Inception Labs says its diffusion LLM is 10x faster than Claude, ChatGPT, Gemini

Wait 5 sec.

Last week, Inception Labs launched Mercury 2, a large language model based on diffusion rather than the autoregressive approach used by every major AI lab. And on this week’s episode of The New Stack Agents, Inception CEO and co-founder Stefano Ermon explains how the diffusion model of generative AI could reshape how we build AI applications.But first, some background: Traditional LLMs generate text one token at a time, left to right, a system that Ermon calls “fancy autocomplete.” Meanwhile, diffusion models work differently: They start with a rough answer and refine it in parallel, much like image models like Stable Diffusion crystallize images from noise. The result is a model that produces over 1,000 tokens per second — five to ten times faster than speed-optimized models from OpenAI, Anthropic, and Google, according to Inception’s own testing.“What we’re seeing is that our Mercury 2 model, which is a reasoning model, is actually able to match the quality of these speed optimized models from [frontier labs OpenAI, Anthropic, Meta, and Google], while being five to 10x faster in terms of, like, the end to end latency, how long you need to wait before it gives you an answer,” Ermon tells TNS Senior Editor for AI Frederic Lardinois.Autoregressive models are slower because they move data through memory instead of doing math. Diffusion models focus on parallel computation, which is what GPUs were built for. And GPU giant Nvidia, an investor in Inception, is helping optimize the serving engine, Ermon says.Ermon, who pioneered diffusion models for images at Stanford and published the foundational text diffusion paper that won Best Paper at ICML 2024, is candid about the trade-offs: Mercury 2 matches the quality of Claude Haiku and Google Flash-class models, not Claude Opus or OpenAI GPT-4. But he argues the economics will win out as models scale. Reinforcement learning, the technique behind today’s reasoning models, is also naturally faster on diffusion architectures since its bottleneck is inference.Inception Labs is the only company shipping a production diffusion LLM — Google’s text diffusion model is still “experimental.” Mercury 2 is now available via an OpenAI-compatible API, with AWS Bedrock integration coming soon.Listen to the full conversation on The New Stack Agents.The post Inception Labs says its diffusion LLM is 10x faster than Claude, ChatGPT, Gemini appeared first on The New Stack.