Pinecone Revamps Vector Database Architecture for AI Apps

Wait 5 sec.

Pinecone announced Tuesday the next generation version of its serverless architecture, which the company says is designed to better support a wide variety of AI applications.With the advent of AI, the cloud-based vector database provider has noticed a shift in how its databases are used, explained chief technology officer Ram Sriharsha. In a recent post announcing the architecture changes, Sriharsha said broader use of AI applications has led to a rise in demand for:Recommender systems requiring 1000s of queries per second;Semantic search across billions of documents; andAI agentic systems that require millions of independent agents operating simultaneously.In short, Pinecone is trying to serve diverse and sometimes opposing customer needs. Among the differences is that retrieval-augmented generation (RAG) and agentic AI workflows tend to be more sporadic than semantic search, the company noted.“They look very different from semantic search use cases,” Sriharsha told The New Stack. “In these emerging use cases, you see that actual workloads are very spiky, so it’s the opposite of predictable workload.”Also, the corpus of information might be actually quite small — from a few documents to a few hundred documents. Even larger loads are broken up into what Pinecone calls “namespaces” or “tenants.” Within each tenant, the number of documents might be small, he said.That requires a very different sort of system to be able to serve that cost effectively, he added.A Pod-Based ArchitectureAbout four years ago, Pinecone began to ship the public version of its vector database in a pod-based architecture.A pod-based architecture is a way of organizing computing resources where a “pod” is a group of dedicated computers tightly linked together to function as a single unit. It’s often used for cloud computing, high-performance computing (HPC), and other scenarios where scalability and resource management are the primary concerns.That worked because traditionally, recommender systems used a “build once and serve many” form of indexing, Sriharsha explained.“Often, vector indexes for recommender workloads would be built in batch mode, taking hours,” he wrote in the blog. “This means such indexes will be hours stale, but it also allows for heavy optimization of the serving index since it can be treated as static.”Serverless ArchitectureSemantic search workloads bring different requirements, he continued. They generally have a larger corpus and require predictable low latency — even though their throughput isn’t very high. They tend to heavily use metadata filters and their workloads care more about freshness, which is whether the database indexes reflect the most recent inserts and deletes.Agentic workloads are different still, with a small to moderate sized corpora of fewer than a million vectors, but lots of namespaces or tenants.He noted that customers running agentic workloads want:Highly-accurate vector search out of the box without becoming vector search experts;Freshness, elasticity, and the ability to ingest data without hitting system limits, resharding, and resizing; andPredictable, low latencies.Supporting that requires a serverless architecture, Sriharsha said.“That has been highly successful for these RAG and agentic use cases and so on, and it’s driven a lot of cost savings to customers, and it’s also allowed people to run things at large scale in a way that they couldn’t do before,” he said.Convergence on One ApproachBut now Pinecone was supporting two systems: The pod-based architecture and the serverless architecture. The cloud-provider began to look at how it could converge the two in a way that offered customers the best of both.”They still don’t want to have to deal with sizing all these systems and all of this complexity, so they can benefit from all the niceties of serverless, but they need something that allows them to do massive scale workloads,” Sriharsha said. “That meant we had to figure out how to converge pod architecture into serverless and have all the benefits of serverless, but at the same time do something that allows people to run these very different sort of workloads.”Tuesday’s announcement was the culmination of months of work to create one architecture to serve all needs.This next-generation approach allows Pinecone to support cost-effective scaling to 1000+ QPS through provisioned read capacity, high performance sparse indexing for higher retrieval quality, and millions of namespaces per index to support massively multitenant use cases.Image via Ram Sriharsha’s blog postIt involves the following key innovations to Pinecone’s vector databases, according to Sriharsha’s post:Log structured Indexing. Log-structured indexing (LSI) is a data storage technique that prioritizes write speed and efficiency that Pinecone has adapted and applied to their vector database;A new freshness approach that routes all reads through the memtable (an in-memory structure that holds the most recently written data);Predictable caching in which the index portion of the file, (Pinecone calls these slabs), is always cached between local SSD and memory, which enables Pinecone “to serve queries immediately, without having to wait for a warm up period for cold queries”;Cost-effective at high QPS; andDisk-based Metadata Filtering, which is another new feature in this update of Pinecone’s serverless architecture.The post Pinecone Revamps Vector Database Architecture for AI Apps appeared first on The New Stack.