Cut your AI search costs without sacrificing quality

Wait 5 sec.

The cost that’s driving your AI search billEvery organization running AI-powered search faces the same hidden cost driver: query embeddings. Documents are embedded once. Queries are embedded continuously for every user, every search, every second. At scale, this quickly becomes one of the largest line items in your AI infrastructure budget.Together, Vespa AI and Voyage AI have solved this problem with a technique called asymmetric retrieval. Use the best embedding model available for your documents (once, at indexing time), then embed queries for free using a tiny, locally running model. Voyage AI’s voyage-4 model family is built for exactly this. All four models share a common vector space, making the split practical without any reindexing or architectural changes.“Every organization running AI-powered search faces the same hidden cost driver: query embeddings.”Bottom line for decision-makers: Your query embedding bill effectively goes to zero and your search path becomes more resilient, all without replacing your existing search infrastructure.The problem: Symmetry is expensiveThe conventional approach uses the same embedding model for both documents and queries. It’s simple, but it ignores a critical asymmetry in how those two operations work.Document EmbeddingQuery EmbeddingFrequencyOnce per documentEvery single requestLatency sensitivityNone, no user is waitingOn the critical path, 24/7Cost @ 10K QPSAmortized, negligible~$15,500/monthAt 10,000 queries per second with ~30-token queries, you generate roughly 777 billion tokens per month, all routed through an external API at real cost.The solution: Asymmetric retrieval with Voyage AI + VespaVoyage AI’s voyage-4 family introduces four models (voyage-4-large, voyage-4, voyage-4-lite, and voyage-4-nano) that all produce embeddings in a shared vector space. You can embed documents with the most powerful model and query with the smallest, and they remain fully compatible.Vespa now has native support for this workflow, running voyage-4-nano locally inside its container nodes, with no API calls, no rate limits, and no additional cost.How it worksStep 1: index time: documents → voyage-4-large (API)Embed each document once with Voyage AI’s top-tier model. The results are the highest accuracy, with no latency pressure. Cost is fully amortized over the document’s lifetime.Step 2: query time: queries → voyage-4-nano (local)Embed every user query with a tiny model running inside Vespa. Runs in single-digit milliseconds on CPU. Zero external API dependency. Zero cost.Read the full technical blog.Business impact at a glanceMetricSymmetric (traditional)Asymmetric (Vespa + Voyage AI)Query embedding cost @ 10K QPS❌ ~$15,500 / month✅ $0 / monthQuery embedding latency❌ API round-trip (10–80ms)✅ 1,000 queries/sec)✅ Strong fit, savings scale linearlyLarge document corpus✅ Strong fit, document embedding cost is amortizedLatency-sensitive applications✅ Strong fit, local inference eliminates network round-tripsMulti-tenant platforms✅ Strong fit, per-tier quality/cost controlLow volume (