📡 Breaking news
Analyzing latest trends...

Xiaomi Teams Up with TileRT to Unlock Mind-Blowing 1,000 Token/s Speeds on a Single Node.

Xiaomi Teams Up with TileRT to Unlock Mind-Blowing 1,000 Token/s Speeds on a Single Node.
Xiaomi Partners with TileRT to Achieve Blazing-Fast 1,000 Token/s On-Single-Node LLM Execution

Xiaomi has officially announced a landmark technical partnership with TileRT, an advanced runtime developer specializing in ultra-high-speed Large Language Model (LLM) execution platforms. By optimizing Xiaomi's proprietary MiMo-V2.5-Pro model, the joint framework has successfully breached a historic performance milestone achieving an astonishing inference speed of 1,000 tokens per second on a single machine without relying on specialized or custom AI accelerators.

To unlock this unprecedented throughput, the engineering teams did not deploy the standard heavy model weights. Instead, they utilized a heavily compressed variant that strategically targets the model's Mixture-of-Experts (MoE) architecture.

The MoE routing layers were carefully quantized down to an ultra-lean MXFP4 format, preserving near-lossless intelligence and contextual accuracy compared to the full-precision base model, while the remaining neural network layers operate under a highly optimized FP8 execution state.

The breakthrough is further accelerated by two core engineering optimizations:

  • DFlash (Dynamic Block Flash Inferences): Rather than predicting text sequentially token-by-token, this specialized decoding mechanism speculatively generates a dense block of candidate tokens ahead of time, subsequently verifying their statistical accuracy in a single parallelized compute cycle.

  • Persistent Engine Kernel: TileRT’s proprietary runtime layer features a persistent compute kernel that remains active in hardware cache memory. This design eliminates repetitive kernel-launch overheads and rigorously optimizes data-shuffling pathways across memory lanes to maximize compute utilization.

In the spirit of open-source collaboration, Xiaomi has released the fully optimized model weight named MiMo-V2.5-Pro-FP4-DFlash to the developer community. However, organizations demanding commercial-grade turnkey infrastructure can access this extreme throughput through Xiaomi's new enterprise-tier Ultraspeed Model Service. This premium API offering requires prior registration approval and is priced at a 3x premium compared to Xiaomi’s standard model endpoints.

The MXFP4 (4-bit Microscaling Formats) data format is considered one of the most powerful quantization standards currently available. It works in conjunction with small scale factors to control loss. Using this format specifically for Mixture-of-Experts (MoE) models is a very clever strategy, as MoE models typically have very large file sizes due to their... "Multiple experts" are embedded within the model. Compressing this portion to 4-bit while retaining other parts in FP8 drastically reduces the memory footprint, allowing the model to load and run on common hardware caches at speeds up to 1,000 tokens/s.

The obstacle to LLM speed isn't just the chip's compute bound, but the speed of data transfer from main RAM to the processor (memory bandwidth bound). TileRT's Persistent Engine Kernel addresses this by keeping the main computational code waiting in deep memory (L2/L3 cache) at all times. This eliminates the need for the processor to repeatedly "load and clear" instruction sets, significantly reducing data transfer latency.

Although the Ultraspeed model is three times more expensive, for enterprise customers using AI in real-time agents, instant voice call center bots, or code analysis of millions of lines, speeds of 1,000 tokens/s significantly reduce user waiting times. (Time-to-First-Token) is reduced to near zero, which helps increase customer satisfaction and saves money on cloud server rental costs in the long run.

 

Apple Unveils macOS 27 Golden Gate Performance-Focused Update Brings Siri AI and Liquid Glass Fixes. 

 

Source - Xiaomi 

💬 AI Content Assistant

Ask me anything about this article. No data is stored for your question.

Comments

Popular posts from this blog

Alphabet Launches $80B Equity Drive as Berkshire Hathaway Bets Big with $10B Private Placement.

GPT-5.5 and Codex Unleashed as Microsoft Exclusivity Dissolves.

Bots Overtake Humans for 57.5% of Internet Traffic as AI Agents Proliferate.

NVIDIA Unleashes Vera CPU Custom 88-Core Olympus Silicon Set to Challenge AMD and Intel Dominance.

Broadcom Posts Blockbuster Q2 2026 Earnings as AI Revenue Skyrockets 143% to $10.8 Billion.

Apple Drops Urgent iOS 26.5.1 Update to Fix Critical iPhone 17 Charging Bug Ahead of WWDC.

Apple Inteligence Unveiled Google Gemini Partnership Powers iOS 27 with On-Screen Awareness.