NVIDIA has introduced Dynamo, an open-source AI inference software designed to optimize and scale reasoning models across AI factories. By efficiently managing inference requests across large GPU clusters, Dynamo enhances computational efficiency while reducing operational costs, ensuring AI factories maximize their token generation and revenue.
Replacing NVIDIA Triton Inference Server, Dynamo employs disaggregated serving, a method that separates the processing and generation phases of large language models (LLMs) across different GPUs. This approach optimizes each phase independently, improving resource utilization. With its advanced inference optimizations, Dynamo has demonstrated the ability to double performance for Llama models and boost token generation by over 30 times per GPU on DeepSeek-R1 when deployed on large-scale GPU clusters.
Key features of Dynamo include real-time GPU allocation adjustments, intelligent request routing to minimize recomputations, and efficient offloading of inference data to cost-effective storage while maintaining rapid retrieval. By dynamically managing GPU workloads, Dynamo significantly enhances inference throughput and reduces latency, making it a critical tool for AI service providers.
Dynamo’s open-source nature ensures compatibility with popular frameworks like PyTorch, NVIDIA TensorRT-LLM, and vLLM, facilitating seamless integration for enterprises and researchers. Major AI players such as AWS, Google Cloud, Microsoft Azure, and Meta are among the organizations expected to benefit from Dynamo’s ability to scale inference workloads efficiently.
A core innovation of Dynamo is its intelligent mapping of stored inference data across GPUs, ensuring that new requests are routed to the most relevant computational nodes, reducing redundancy and enhancing response times. With its advanced scheduling, low-latency communication, and modular architecture, Dynamo is set to become a key driver in optimizing AI inference at scale.