Google TPU v8 Arrives A Split Strategy for Training and Inference Supremacy.

- April 22, 2026

Google Unveils TPU v8: A Dual-Chip Architecture for the Future of AI

Google has officially unveiled its 8th-generation Tensor Processing Unit (TPU) architecture, marked by a strategic split into two distinct, specialized chips: the TPU v8t (for training) and the TPU v8i (for inference). While they share a brand name, they are built on fundamentally different architectures designed to solve the distinct bottlenecks of model training and large-scale deployment.

TPU v8t: The Training Powerhouse

Engineered for massive-scale training, the v8t is a monster of performance.

SparseCore & FP4: It introduces a dedicated SparseCore for memory access control and an MXU (Matrix Multiplication Unit) with native FP4 (4-bit floating point) support, doubling down on efficiency.
Virgo Network: It features the new Virgo Network, capable of interconnecting up to 134,000 chips at a staggering total bandwidth of 47 Petabits per second, delivering a peak performance of 1.6 ExaFLOPS.
Direct Path: With TPU Direct RDMA, the chip can bypass the CPU entirely to access memory and pull data directly from storage.

TPU v8i: The Inference Specialist

Designed to solve the "memory wall" in LLM inference, the v8i is built for speed.

Massive SRAM: It boasts 3x more on-chip SRAM compared to its predecessor, directly addressing memory bandwidth bottlenecks.
CAE Engine: A new Collectives Acceleration Engine (CAE) is integrated to accelerate the decoding phase of AI models.
Boardfly ICI: A new chip-to-chip interconnect architecture specifically optimized for Mixture-of-Experts (MoE) models.

Google reports that the v8t offers 2.7x better price-performance compared to the previous Ironwood architecture, while the v8i delivers an 80% inference boost, particularly for massive MoE models.

Hardware-level FP4 (4-bit) support is a major turning point. Previously, we used FP16 or FP8, but reducing to 4-bit allows for massive memory savings in models with minimal accuracy loss. This enables us to run much larger models on the same hardware.

The TPU v8i chip is specifically designed for Mixture-of-Experts (MoE), currently the most popular architecture for top-tier models (such as GPT-4 or Gemini). MoE allows AI to be as intelligent as large-scale models while using less computation. The Boardfly ICI, specifically designed to handle MoE, gives Google a significant lead over competitors in running such models.

Google isn't just selling a chip; they're selling "Pallas," a low-level programming language specifically for TPUs. Using Pallas allows developers to maximize TFLOPS, whereas competitors like NVIDIA, using CUDA, require more complex optimization in some cases. Integrating Pallas with PyTorch will further solidify Google's dominance in the AI infrastructure market.

Google Unifies its AI Identity Gemini Enterprise App and Agent Platform Explained.

Source: Google Cloud Blog

💬 AI Content Assistant

Ask me anything about this article. No data is stored for your question.