Xiaomi Teams Up with TileRT to Unlock Mind-Blowing 1,000 Token/s Speeds on a Single Node.

AI Text-to-Speech.

Xiaomi Teams Up with TileRT to Unlock Mind-Blowing 1,000 Token/s Speeds on a Single Node.

- June 09, 2026

Xiaomi Partners with TileRT to Achieve Blazing-Fast 1,000 Token/s On-Single-Node LLM Execution

Xiaomi has officially announced a landmark technical partnership with TileRT, an advanced runtime developer specializing in ultra-high-speed Large Language Model (LLM) execution platforms. By optimizing Xiaomi's proprietary MiMo-V2.5-Pro model, the joint framework has successfully breached a historic performance milestone achieving an astonishing inference speed of 1,000 tokens per second on a single machine without relying on specialized or custom AI accelerators.

To unlock this unprecedented throughput, the engineering teams did not deploy the standard heavy model weights. Instead, they utilized a heavily compressed variant that strategically targets the model's Mixture-of-Experts (MoE) architecture.

The MoE routing layers were carefully quantized down to an ultra-lean MXFP4 format, preserving near-lossless intelligence and contextual accuracy compared to the full-precision base model, while the remaining neural network layers operate under a highly optimized FP8 execution state.

The breakthrough is further accelerated by two core engineering optimizations:

DFlash (Dynamic Block Flash Inferences): Rather than predicting text sequentially token-by-token, this specialized decoding mechanism speculatively generates a dense block of candidate tokens ahead of time, subsequently verifying their statistical accuracy in a single parallelized compute cycle.
Persistent Engine Kernel: TileRT’s proprietary runtime layer features a persistent compute kernel that remains active in hardware cache memory. This design eliminates repetitive kernel-launch overheads and rigorously optimizes data-shuffling pathways across memory lanes to maximize compute utilization.

In the spirit of open-source collaboration, Xiaomi has released the fully optimized model weight named MiMo-V2.5-Pro-FP4-DFlash to the developer community. However, organizations demanding commercial-grade turnkey infrastructure can access this extreme throughput through Xiaomi's new enterprise-tier Ultraspeed Model Service. This premium API offering requires prior registration approval and is priced at a 3x premium compared to Xiaomi’s standard model endpoints.

The MXFP4 (4-bit Microscaling Formats) data format is considered one of the most powerful quantization standards currently available. It works in conjunction with small scale factors to control loss. Using this format specifically for Mixture-of-Experts (MoE) models is a very clever strategy, as MoE models typically have very large file sizes due to their... "Multiple experts" are embedded within the model. Compressing this portion to 4-bit while retaining other parts in FP8 drastically reduces the memory footprint, allowing the model to load and run on common hardware caches at speeds up to 1,000 tokens/s.

The obstacle to LLM speed isn't just the chip's compute bound, but the speed of data transfer from main RAM to the processor (memory bandwidth bound). TileRT's Persistent Engine Kernel addresses this by keeping the main computational code waiting in deep memory (L2/L3 cache) at all times. This eliminates the need for the processor to repeatedly "load and clear" instruction sets, significantly reducing data transfer latency.

Although the Ultraspeed model is three times more expensive, for enterprise customers using AI in real-time agents, instant voice call center bots, or code analysis of millions of lines, speeds of 1,000 tokens/s significantly reduce user waiting times. (Time-to-First-Token) is reduced to near zero, which helps increase customer satisfaction and saves money on cloud server rental costs in the long run.

Apple Unveils macOS 27 Golden Gate Performance-Focused Update Brings Siri AI and Liquid Glass Fixes.

Source - Xiaomi

💬 AI Content Assistant

Ask me anything about this article. No data is stored for your question.

Xiaomi Partners with TileRT to Achieve Blazing-Fast 1,000 Token/s On-Single-Node LLM Execution

The breakthrough is further accelerated by two core engineering optimizations:

DFlash (Dynamic Block Flash Inferences): Rather than predicting text sequentially token-by-token, this specialized decoding mechanism speculatively generates a dense block of candidate tokens ahead of time, subsequently verifying their statistical accuracy in a single parallelized compute cycle.
Persistent Engine Kernel: TileRT’s proprietary runtime layer features a persistent compute kernel that remains active in hardware cache memory. This design eliminates repetitive kernel-launch overheads and rigorously optimizes data-shuffling pathways across memory lanes to maximize compute utilization.

Apple Unveils macOS 27 Golden Gate Performance-Focused Update Brings Siri AI and Liquid Glass Fixes.

Source - Xiaomi

Search This Blog

News World That's Worth

Xiaomi Teams Up with TileRT to Unlock Mind-Blowing 1,000 Token/s Speeds on a Single Node.

💬 AI Content Assistant

Comments

Post a Comment

Popular posts from this blog

AMD Advancing AI Helios Rack Launches with 72 GPUs Claiming 30% Edge Over NVIDIA Rubin.

Google Launches Video Selfie Verification for Secure, Deepfake-Resistant Account Recovery.

White House Accuses Moonshot AI of IP Theft via Covert Claude Fable Distillation and Thai Server Links.

Anthropic Debuts Claude Opus 5: Fable 5-Level Intelligence at Half the Cost.

Instagram Targets Public Prank Videos Filmed with Meta Smart Glasses Amid Privacy Concerns.

NVIDIA and SK Hynix Sign $500B Alliance to Build 2GW AI Factory and Secure HBM Supply.

Classic Xbox Games Are Finally Coming to PC with Modern Enhancements and Game Pass Day One.