Optimise LLM usage costs with Semantic Cache

Wait 5 sec.

I built a natural language chat bot reporting system, powered by Retrieval Augmented Generation(RAG). It provided a great experience to the user, they were able to query the data and get reports diced and sliced by the dimensions which they wanted. It was intuitive and provided efficiency. It turned tons of data into liquid gold. Then, the LLM API bill arrived. That's when I realized that in my quest to build this intelligence, I had created a financial black hole.\In today’s IT world, each team is looking to build an Agentic AI system. In each environment these Agentic AI systems make hundreds and thousands of LLM calls, starting with development, eval, user acceptance testing and then in the production environment.In order to generate grounded answers, we implement Retrieval Augmented Generation(RAG). This means we query vector stores for similar content and pass the retrieved data with user’s questions, prompts, examples, guardrails instructions, structured response schema etc. to LLM. This means for a simple question, we add hundreds of additional tokens in the name of context, guardrails and few shot prompting etc. Large Language models (LLM) parse these as tokens and do its job to generate the response and then we get the answer to our question. This response can be an answer to our question or it can be a structured response to one of the steps in a multi agentic workflow. That means in order to execute one workflow with multiple agents invoking multiple tools, thousands of token usage happens. All of this boils down to high token usage and high cost, and which creates question on Return on Investment(ROI).In my system, the system pulls data from Vector DB and Knowledge Graph to ground the answer. I observed around ~2000 input tokens in the context window. The LLM then generates a thoughtful, 500 token response. At first glance, that’s just a few cents. But with 200,000 inquiries a day, those "cents" transform into a monster. Just add that usage with multiple users during development and evaluation phases also. Currently cost per million token usage as input and output looks cheaper as we are in early stages of LLM api adoption and service providers have deliberately given lots of discounts. In future these prices will go higher, and then the overall cost from development to production to support and maintenance of these systems will drastically increase.You can also refer to a couple of articles citing the challenges around the LLM cost.The LLM Pricing War Is Hurting Education—and StartupsHow scaling enterprise AI with the wrong LLM could cost you\I built RAG based systems using LangGraph, Neo4j, ChromaDB, and Redis. I bring this architecture pattern named Semantic Cache and implementation details to reduce token burn in production. This pattern can reduce your LLM usage cost by caching and responding to similar questions instead of invoking LLM each and every time. Let me present this pattern to you which you can apply in multiple use cases and in multiple touch points in your Agentic AI workflow.\What is CacheCache has been there for ages in our IT System architecture. It started with In-Memory, then it became distributed. Initially there were few limited offerings for on-prem or VM based deployment. With the advent of cloud, many cloud offerings evolved. In today’s time, we have quite matured caching systems. However, usage of the caching system has been based on key value pairs. Keys are hashed and stored in a Hash table kind of data structure. Caching was always an exact key based lookup. If your key is present, the cache system will return the values for your key, and if the key does not, that means it was not stored or it was evicted. However, it was never meant for similarity based lookup.\What is Semantic CacheVectors with multiple dimensions and finding vectors using cosine similarity or similar mechanisms also existed for a long time. However with LLM and RAG, vector embeddings and similarity search has become quite popular. Vector RAG is one of the primary choices of solution for RAG based systems. This helps in finding similar content. This is not an exact key look up, instead this helps in finding the embedding vectors which are semantically similar.This capability has been initially developed in Vector databases. But now, this capability is also available in caching systems like redis and now cloud service providers also have introduced this capability in their cache offerings. We can define Semantic Cache as a cache which fetches cache entries based on the meaning of a content or document instead of the hash of the key.\How Semantic Cache is differentLet’s do a quick differentiation between Cache, Semantic Cache and RAG.Following tables can help in clarifying differences between Traditional Cache and Semantic Cache.| Factors | Traditional Cache | Semantic Cache ||:---:|:---:|:---:|| Approach | Key based | Meaning based || Key Search | Exact match | Similar match || Best For | APIs | LLMs || Example | Key, Value (StudentId, StudentObj) \n 2025081520, StudentObject1 \n 2025091640, StudentObject2 \n if key used while doing look up is different, \n then it wont's return any result | Statements \n Where did Alex graduate from? \n Which university did Alex graduate from? \n both will fetch the same entry from cache. |\Following table can help in clarifying differences between RAG and Semantic Cache.| Factors | RAG | Semantic Cache ||:---:|:---:|:---:|| Use | To generate new response | To return previously generated and cached answer || Data | External data sources like Vector databases | Previously generated Questions and Answer || Cost | Medium to High | Low || Best For | Grounded response generation | Repeated but similar question |\n We can see that the difference between RAG and Semantic cache is simple but the financial impact is big. RAG retrieves raw information and generates responses. If your questions are repetitive, caching answers wins because it bypasses retrieval, tool orchestration, and generation.\Semantic Cache based RAG ArchitectureI’ve designed the following architecture for semantic cache on top of standard RAG architecture.\Components of the architectureUser: From any client like browser or mobile application, an Q&A api is invoked once a user enters any question in natural language.API Gateway: API request is authenticated and authorised in GatewayGraphRAG + VectorRAG Agent System(FastAPI)A LangGraph ReAct agent, this orchestrates the workflow. It decides whether to check the cache, query the graph store or vector store, or respond to the user.Guardrails: Implemented via system prompts and specific instruction sets to ensure safe and accurate responses.MCP (Model Context Protocol):Semantic Cache MCP Client: Connects to the cache server using Server-Sent Events (SSE).Semantic Cache MCP Server: A FastMCP server that exposes tools (cachelookup, cachestore) and manages cache server(Redis) interactions.Semantic Cache Server(Redis): Acts as a Semantic Cache. It stores vector embeddings of questions and corresponding answers to provide fast retrieval for similar queries.LLM: The reasoning engine for the agent and the generator for Cypher queries for GraphRAG and natural language responses.Observability(LangSmith): Provides tracing, monitoring, and debugging for the agent's execution steps.The Knowledge Graph(Neo4j): It stores structured relationships (e.g., (Person)-[:WORKS_AT]->(Organization)) and is queried via Cypher generated by the LLM.Vector Store(ChromaDB): It stores the vector embeddings generated from the knowledge document.\In order to implement the semantic caching, mainly the new semantic cache MCP component has been added in the standard RAG based Agent System.The agent first does a cache look up.Semantic Cache MCP client uses cache_lookup tool to check if the question is already answered and cachedSemantic Cache MCP Server uses CACHESIMILARITYTHRESHOLD of >95% to decide whether a similar question is cached in the server. I tested similarity thresholds ranging from 80% to 99%. Below 90%, false positives increased and I started getting incorrect cached answers for loosely similar phrases. Above 97%, cache hit rates were significantly bad. In my use case, finally 95% provided the best trade off between precision and reuse. The right balance, however, depends on your specific use case and data distribution.If the cache hit happens, then it returns the answer without making any RAG call, which means neither Vector db nor knowledge graph call nor LLM call. The most important LLM call is avoided.If the cache miss happens, then it executes the standard RAG and generates the response. It ensures to store the question with the answer in the cache server, which gets used for subsequent calls.The following code snippet shows the MCP server code for cache lookup. I’ve used KNN based similarity search and returned only 1 closest match. I use distance as score and (1-score) as similarity. One more important point is the use of dialect 2 which is a must for vector queries in Redis, otherwise the query will fail.\@mcp.tool()def cache_lookup(question: str) -> str:""" Look up a semantically similar question in the cache. Args: question: The question to look up Returns: JSON with cached answer if found (similarity >= 95%), or cache miss indication """global _hits, _missesclient = get_redis_client()ensure_index_exists(client)# Compute embedding for the queryquery_embedding = compute_embedding(question)query_vector = np.array(query_embedding, dtype=np.float32).tobytes()# Vector similarity searchtry:q = (Query(f"*=>[KNN 1 @embedding $vec AS score]") .return_fields("question", "answer", "expires_at", "score") .sort_by("score") .dialect(2) )results = client.ft(INDEX_NAME).search(q,query_params={"vec": query_vector} )if results.total > 0:doc = results.docs[0]score = float(doc.score)similarity = 1 - score # Convert distance to similarityexpires_at = float(doc.expires_at)# Check expirationif time.time() > expires_at:logger.info(f"Cache entry expired for: '{question[:50]}...'")# Delete expired entryclient.delete(doc.id)_misses += 1return json.dumps({"found": False,"reason": "expired" })# Check similarity thresholdif similarity >= CACHE_SIMILARITY_THRESHOLD:_hits += 1logger.info(f"Cache HIT: similarity={similarity:.4f} for '{question[:50]}...'")return json.dumps({"found": True,"answer": doc.answer.decode() if isinstance(doc.answer, bytes) else doc.answer,"similarity": round(similarity, 4),"original_question": doc.question.decode() if isinstance(doc.question, bytes) else doc.question })_misses += 1logger.info(f"Cache MISS for: '{question[:50]}...'")return json.dumps({"found": False,"reason": "no_similar_question" })except Exception as e:logger.error(f"Cache lookup error: {e}")_misses += 1return json.dumps({"found": False,"reason": f"error: {str(e)}" })\The following code snippet shows the MCP server code for cache store. I’ve first calculated the embeddings of the question, which gets stored along with the raw question and answer as value for the key which is the standard md5 hash of the question.\@mcp.tool()def cache_store(question: str, answer: str) -> str:""" Store a question-answer pair in the semantic cache. Args: question: The original question answer: The answer to cache Returns: JSON confirmation of storage """client = get_redis_client()ensure_index_exists(client)try:# Compute embeddingembedding = compute_embedding(question)embedding_bytes = np.array(embedding, dtype=np.float32).tobytes()# Generate unique keykey_hash = hashlib.md5(question.encode()).hexdigest()[:12]cache_key = f"cache:{key_hash}"current_time = time.time()# Store in Redis hashclient.hset(cache_key, mapping={"question": question,"answer": answer,"embedding": embedding_bytes,"created_at": current_time,"expires_at": current_time + CACHE_TTL_SECONDS })# Set TTL on the keyclient.expire(cache_key, CACHE_TTL_SECONDS)logger.info(f"Cached answer for: '{question[:50]}...' (TTL: {CACHE_TTL_SECONDS}s)")return json.dumps({"stored": True,"key": cache_key,"ttl_seconds": CACHE_TTL_SECONDS })except Exception as e:logger.error(f"Cache store error: {e}")return json.dumps({"stored": False,"error": str(e) })\The following snapshot shows caching of questions and answers in Redis with embeddings.\ Following snippet shows MCP client code to invoke cache_lookupdef cache_lookup(self, question: str) -> dict:return _run_async(self._req_with_connection("cache_lookup", question))\Following snippet shows MCP client code to invoke cache_storedef cache_store(self, question: str, answer: str) -> dict:return _run_async(self._req_with_connection("cache_store", question, answer))\The following trace shows us the execution flow.Trace is highlighted at cachelookup call, and you can see the output as cachehit as false with reason nosimilarquestion. Thus in this flow, standard RAG is executed. You can see graphragquery node in the trace followed by an LLM call.\ Now, it's the moment of truth, when a similar question is asked again. We can see in the following snapshot, cache_hit is true. There is a slight change in wording of the question, but it’s similar to the previous one, hence the similarity score is 0.9866 as shown below. Hence the answer is directly retrieved from the cache and you will not see the RAG flow execution. Answer is directly returned from the semantic cache.\ ResultAdding semantic cache to a GraphRAG or VectorRAG system provided a measurable performance and cost benefits. Previously, the system took 5 to 6 seconds per request, which includes Vector search, Knowledge graph cypher query generation and execution, and final answer generation by the LLM. When the cache hit happened, answers were on the user's screen in 900 ms to 1.2 seconds. On average, this provided a 24% reduction in daily LLM calls, which slashed the LLM API bills.\Semantic Cache StrategyWhere Semantic Cache WinsWhen a system handles high volume, predictable traffic, a cache layer adds a performance benefit. It wins when the same knowledge is asked by many users, like in the "Customer Support & Helpdesks" use case, where questions are like "How do I reset my password?", "What is your refund policy?". In Product Documentation use cases, where users are querying the same product manuals. This even happens when developers query the same technical documentation and API docs. In Internal HR & Onboarding use cases, where questions like "How do I enroll in health insurance?" and "What is the holiday schedule?" pop up by each employee. In Compliance & Policy Q&A, where your system provides standardized, vetted answers to regulatory or company policy inquiries.When to Avoid Semantic CachingIt fails specifically in those systems which provide hyper personalization. If two users ask the same question but require different answers based on their medical history or profile or account balance, a semantic cache can provide incorrect answers. It even fails when your documents, knowledge base and data are volatile or it's real time, like when you need answers about stock prices or live inventory, a cache is your enemy. You need the RAG to see the current state of the world.\Semantic Cache Invalidation StrategyA cache needs to be refreshed and semantic cache needs additional invalidation strategy. Every cache entry must be tagged with a createdat timestamp to manage its time to live(TTL) values. In semantic cache additionally documentversion and embedding_model used during generation also should be tagged. This is critical because if you upgrade the embedding model, your vector space shifts. This makes the old cache entries obsolete. We need to track these variables, such that we can trigger cache invalidation whenever a source document is updated or the embedding model is changed or updated. This ensures that our system does not respond with inaccurate answers.\Authorisation and Security ConsiderationIn any caching system, you must ensure that it authorises before it provides the data to the user. It also must ensure that data cached is secured at rest also. Implementation of semantic cache must adhere to these principles, it should not just be seen as adding a layer for performance or cost benefits.In a multi-tenant system, tenant isolation keys must be implemented in cache. Similarly in hyper personalised user's system, user level namespace partitioning must be done in the cache. This guarantees that the vector similarity search is restricted to the specific bucket belonging to that specific user of the corresponding organisation.The most important point about authorisation is creation of permission-aware cache keys, where the hash of the key includes the user's specific roles or access levels. This ensures that a user with less privilege does not get a cached answer from a more privileged user. A cache hit must not be an unauthorized data fetch mechanism. If a user does not have permission to fetch a particular document or database row via a RAG pipeline, then they must not be able to pull the generated answer from the cache.Finally to meet various compliance requirements like HIPPA or SOC2, your answers must be encrypted before it's cached. It's not negotiable at all. The semantic cache must use the same cryptographic practice as your primary data store.\ConclusionThe architecture blueprint, code snippets and the various snapshots proves that it’s possible to semantically cache the questions and do a vector embeddings based similarity search to return the previously generated answers to avoid RAG based LLM calls for similar questions.The example shown above is pretty simple, however we can extend this architecture choice for multiple LLM touch points which happens in the Agentic AI system to make decisions not only for the end user questions and answers.Semantic caching is a simple way to lower token costs while maintaining response quality. If your system gets similar queries to answer, adding a cache layer before the retrieval and generation steps is an advantage. The easiest part is that you don’t have to overhaul your RAG architecture, you just optimize the flow by handling repetitive intent earlier in the chain.