Clarifai’s AI Engine Cuts Costs Without Performance Hit

Wait 5 sec.

AI agents are predicted to overtake the enterprise, and possibly the internet, but there’s one unnerving challenge that doesn’t get a lot of attention: Paying for the tokens that underlie all the large language models behind AI agents.“With agents and agentic workloads, they just chew through tokens all day long,” Matt Zeiler, CEO of AI company Clarifai, told The New Stack. “Things like GitHub, Copilot, OpenAI Codex, all these different coding tools now process work asynchronously, and that just means you can fire off ten different tasks and have them all chewing through tokens at the same time.”Clarifai announced a new tool today that’s designed to address this potential cost creep by optimizing the inference performance of models. The Clarifai Reasoning Engine is a collection of optimizations that leverage how reasoning models “think” and then improves the performance without dropping quality or accuracy, Zeiler said. [Editor’s Note: Atlas also offers a reasoning engine, but its goal is to help models reason about tasks, breaking them into subtasks.]‘Two-Times Faster at 40% the Costs’“These models think through step-by-step, and because of that, there are certain optimizations we can do to make them accelerate that,” he said. One such method is optimizing kernels for performance. By improving latency and speed, the reasoning engine makes the AI model more economical to run, Zeiler said.“It’s two-times faster than competitors at 40% the cost,” he said. “Because of our efficiency of running the model with our reasoning engine, we could price it low, as well, to make it much more appealing for all your agentic AI use cases.”Organizations can use Clarifai’s platform and reasoning engine to optimize custom AI models as well.“They can actually wrap their own models in a very simple Python class,” he said. “Then they can implement whatever they want their model to be.”The Clarifai platform also enables developers to write MCP tools that they can define and deploy to the platform with a single line of code.A recent benchmarking by Artificial Analysis shows that Clarifai’s optimized OpenAI gpt-oss 120B model offered an output of approximately 650 tokens per second, at a price point of 10 cents per 1 million tokens. The next fastest offering was SambaNova at approximately 600 tokens per second output at a cost of 30 cents per 1 million tokens. Price-wise, CompactAI also comes in at 10 cents per 1 million tokens, but its speed was much lower, at 200 tokens per second.“There’s some companies that build custom chips, like Groq, SambaNova, et cetera,” he said. “Our results are even competitive with some of those custom chips — not just the others that are GPU providers.”Provided by Clarifai, from Artificial Analysis.In a benchmark by Artificial Analysis, Clarifai’s hosted gpt-oss-120b model achieved new records of speed, serving over 500 tokens per second with a time to first token of 0.3 seconds. In a subsequent round of tests, the Clarifai Reasoning Engine outperformed all GPU-based inference implementations, as well as specialized non-GPU accelerators, proving for the first time that GPU performance can match — and in some cases surpass — non-GPU architectures, the company said in a statement.Lots of companies just do inference, but not optimizations, Zeiler added.“It’s a choice: Do you want low latency? Do you want high throughput and do you want low prices,” he said. “With Clarifai, you’re getting all three of those without sacrificing any quality or sacrificing the flexibility of deploying across any cloud or even on premise.”The reasoning engine fits into Clarifai’s compute orchestration product, which was announced earlier this year, he said.“Those models get orchestrated across these different kind of compute planes at the click of a button,” he said. “You can provision them in our cloud VPCs, on these different cloud providers. You can also connect your own compute from all the different cloud providers in your VPC, as well as bare metal. So the model can get the Clarifai reasoning engine and then deploy it to any of these compute environments.”It also enables routing traffic across them dynamically, a feature launched in preview earlier this year.Clarifai launched local runners a few weeks ago as well. A local runner is an agent or process that executes jobs, such as tests or builds. These also help to enhance performance.“We actually use local runners heavily internally to develop the optimizations behind our Clarifai Reasoning Engine, because it allows you to even put break points in while benchmarking, while testing out your model, and that’s a game-changer for AI teams,” Zeiler said.That basically allows you to run a model on a MacBook or a gamer-level PC, just like it’s on the cloud or an on-premise cluster. It opens up that compute behind the Clarifai API, and the API “speaks” all the common protocols, including MCP and gRPC.“Because of that, you can now have a model running in your laptop, actually being used in your coding tools or your agent development kit, or whatever your favorite MCP client is,” he said.The post Clarifai’s AI Engine Cuts Costs Without Performance Hit appeared first on The New Stack.