Cloudflare Workers AI Goes Large: Welcoming Kimi K2.5 with 1.1 Trillion Parameters and 77% Cost SavingsCloudflare has announced a major strategic shift for its Workers AI platform. Moving beyond its traditional focus on small-to-medium language models (such as GPT-OSS 120B and Nemotron 3 120B), Cloudflare is now embracing "Frontier-scale" models. The flagship addition to this new tier is Kimi K2.5, a powerhouse model boasting a staggering 1.1 trillion parameters.
The Efficiency Leap: Powered by "Infire"
To maintain its edge in cost-effectiveness, Cloudflare is utilizing its proprietary Infire engine, specifically engineered to squeeze maximum performance out of GPUs. The real-world impact is significant: Cloudflare’s internal code review system processes 7 billion tokens daily. By switching to Kimi K2, the company estimates a 77% reduction in costs slashing a potential $2.4 million annual bill down to a fraction of the price.
Advanced Caching and Session Affinity
With a massive 256k context window, efficient caching is no longer optional. Cloudflare is introducing two key updates for Kimi K2.5:
Transparent Cache Metrics: Users can now see exactly how much cache was utilized during each request.
x-session-affinity Header: Developers can use this new HTTP header to "hint" the system to route requests to the same machine, significantly increasing cache hit rates and reducing latency.
Flexible Rate Limits and Pricing
While standard rate limits apply, Cloudflare is introducing an Asynchronous API. This allows for high-volume, non-urgent tasks to be processed with a typical wait time of under 5 minutes, bypassing traditional rate limit ceilings.
Pricing for Kimi K2.5 on Workers AI:
Input: $0.60 per million tokens
Cached Input: $0.10 per million tokens
Output: $3.00 per million tokens
Cloudflare's in-house development of the Infire engine is key. The competition isn't about who has more GPUs, but about who can "extract" better performance per watt. The fact that Infire can run 1.1T models at such a low cost means Cloudflare is challenging OpenAI and Anthropic, leveraging its strengths in distributed edge computing networks.
The Kimi model (from Moonshot AI) excels in long-context handling. Cloudflare's choice of this model as its flagship indicates they are targeting the "Enterprise RAG" (Retrieval-Augmented Generation) market, where companies need to read hundreds of pages of corporate documents simultaneously. Supporting a low-cost caching 256k context window will encourage companies to scale up AI without fear of budget overruns.
A major problem with serverless AI is that commands are sent to a machine without the original data (cache), requiring a slow and expensive reload. The use of x-session-affinity addresses this. It's about shifting developers' mindset closer to that of a Stateful Application, significantly improving the fluidity of continuous multi-turn conversations with AI.
Enabling Asynchronous mode is a clever way to manage GPUs. Cloudflare can utilize "a fraction of the remaining computing power" at any given time to handle pending tasks, allowing them to lower prices to near-unbeatable levels and reducing system load during peak hours.
Manus Desktop Arrives Transform Your PC into a Super Agent with My Computer.
Source: Cloudflare
Cloudflare Workers AI Goes Large: Welcoming Kimi K2.5 with 1.1 Trillion Parameters and 77% Cost SavingsCloudflare has announced a major strategic shift for its Workers AI platform. Moving beyond its traditional focus on small-to-medium language models (such as GPT-OSS 120B and Nemotron 3 120B), Cloudflare is now embracing "Frontier-scale" models. The flagship addition to this new tier is Kimi K2.5, a powerhouse model boasting a staggering 1.1 trillion parameters.
The Efficiency Leap: Powered by "Infire"
To maintain its edge in cost-effectiveness, Cloudflare is utilizing its proprietary Infire engine, specifically engineered to squeeze maximum performance out of GPUs. The real-world impact is significant: Cloudflare’s internal code review system processes 7 billion tokens daily. By switching to Kimi K2, the company estimates a 77% reduction in costs slashing a potential $2.4 million annual bill down to a fraction of the price.
Advanced Caching and Session Affinity
With a massive 256k context window, efficient caching is no longer optional. Cloudflare is introducing two key updates for Kimi K2.5:
Transparent Cache Metrics: Users can now see exactly how much cache was utilized during each request.
x-session-affinity Header: Developers can use this new HTTP header to "hint" the system to route requests to the same machine, significantly increasing cache hit rates and reducing latency.
Flexible Rate Limits and Pricing
While standard rate limits apply, Cloudflare is introducing an Asynchronous API. This allows for high-volume, non-urgent tasks to be processed with a typical wait time of under 5 minutes, bypassing traditional rate limit ceilings.
Pricing for Kimi K2.5 on Workers AI:
Input: $0.60 per million tokens
Cached Input: $0.10 per million tokens
Output: $3.00 per million tokens
Cloudflare's in-house development of the Infire engine is key. The competition isn't about who has more GPUs, but about who can "extract" better performance per watt. The fact that Infire can run 1.1T models at such a low cost means Cloudflare is challenging OpenAI and Anthropic, leveraging its strengths in distributed edge computing networks.
The Kimi model (from Moonshot AI) excels in long-context handling. Cloudflare's choice of this model as its flagship indicates they are targeting the "Enterprise RAG" (Retrieval-Augmented Generation) market, where companies need to read hundreds of pages of corporate documents simultaneously. Supporting a low-cost caching 256k context window will encourage companies to scale up AI without fear of budget overruns.
A major problem with serverless AI is that commands are sent to a machine without the original data (cache), requiring a slow and expensive reload. The use of x-session-affinity addresses this. It's about shifting developers' mindset closer to that of a Stateful Application, significantly improving the fluidity of continuous multi-turn conversations with AI.
Enabling Asynchronous mode is a clever way to manage GPUs. Cloudflare can utilize "a fraction of the remaining computing power" at any given time to handle pending tasks, allowing them to lower prices to near-unbeatable levels and reducing system load during peak hours.
Manus Desktop Arrives Transform Your PC into a Super Agent with My Computer.
Source: Cloudflare
Comments
Post a Comment