NVIDIA and DeepInfra Achieved 300 t/s on a 550B Parameter Giant.

- June 01, 2026

NVIDIA Unveils Nemotron 3 Ultra: A Massive 550B MoE Titan Clocking Record-Breaking 300 t/s via DeepInfra

During his highly anticipated Computex 2026 keynote, NVIDIA CEO Jensen Huang officially announced the launch of Nemotron 3 Ultra. Standing as the largest and most sophisticated open-weights artificial intelligence model NVIDIA has ever developed, Nemotron 3 Ultra features a colossal 550 billion total parameters (550B).

Utilizing a highly advanced hybrid architecture, the model operates with exceptional compute efficiency by maintaining roughly 90% sparsity activating a mere 55 billion parameters (55B active) per individual inference token pass.

Benchmark Evaluation: The Most Capable US Open Model

While the complete open-weights repository is slated for full public availability in the first half of 2026, benchmarking platform Artificial Analysis partnered with NVIDIA to evaluate a pre-release version running on native BF16 precision.

The results establish a massive milestone for Western open-source development:

Intelligence Metric: Nemotron 3 Ultra achieved a stellar score of 48 on the Artificial Analysis Intelligence Index.
The Competitive Landscape: This score positions it as the undisputed leader among US-based open-weights architectures, comfortably beating DeepSeek V4 Flash and comfortably outclassing domestic peers like Gemma 4 31B (39) and Nemotron 3 Super (36). However, it still narrowly trails behind elite Chinese frontier models, such as Moonshot's Kimi K2.6 and MiniMax-M2.7.

Speed Supremacy: Breaking the Throughput Ceiling

The defining competitive moat of Nemotron 3 Ultra lies in its deployment performance. Testing conducted on a pre-release endpoint hosted by bare-metal inference infrastructure provider DeepInfra demonstrated a blistering generation speed exceeding 300 tokens per second (t/s).

This throughput completely upends the enterprise AI landscape, as equivalent frontier models in this size class typically bottleneck at market speeds of just 50 to 100 tokens per second. Consequently, Nemotron 3 Ultra crowns itself as the absolute fastest AI model globally within its intelligence tier.

What enables this massive 550B model to run at speeds of 300 t/s isn't just the power of the Blackwell chip, but also its internal Hybrid Latent MoE architecture. NVIDIA integrates Mamba-2 State Space Layers with Transformer's Attention layers. The Mamba system allows the model to process long, linear data sets (linear-time complexity), stably supporting a context window of up to 1 million tasks (1M tokens) without the speed degradation seen in older models.

Another factor contributing to this speed is the embedded Multi-Token Prediction (MTP) technology in the main processor. This allows the model to predict and release answers "multiple tokens at a time" in a single processing round (forward pass), essentially providing intelligent word prediction without relying on external draft models.

With lower running costs from only enabling necessary Expert layers (55B Active) coupled with speeds of 300 t/s, this model is not just built for general-purpose chatbots. However, it is designed to be the infrastructure for "Agentic AI," or industrial bots that need to perform complex tasks involving hundreds of steps across large documents (Multi-step Task Planning & Cross-document Reasoning). This level of speed helps to minimize lag or delays when the bot is "thinking and working."

NVIDIA Officially Enters Windows PC Market with RTX Spark

Source: Artificial Analysis

💬 AI Content Assistant

Ask me anything about this article. No data is stored for your question.