Debugging the undebuggable: building observability into probabilistic AI systems

Wait 5 sec.

Debugging used to be straightforward: A service failed, you checked the logs, followed the stack trace, and fixed the bug. Unfortunately, with AI systems, especially those powered by LLMs and agent workflows, that approach breaks down quickly.The problem doesn’t just lie in a more complex system. It’s complicated by the fact that failures are no longer deterministic. The system may return a different answer for the same input. A tool might silently fail. Retrieval might return low-quality or noisy context. Nothing overtly “crashes,” but something is clearly wrong.This tutorial focuses on a practical question: How do we debug a system that doesn’t fail in obvious ways? To tackle this question, we’ll build a small AI service and, more importantly, instrument it so that we can actually understand what’s happening inside it.Why debugging AI systems feels differentTraditional debugging relies on three assumptions:Inputs lead to predictable outputsFailures throw errorsLogs tell the full storyNone of these holds for AI systems.Instead, we deal with:Non-deterministic outputsHidden reasoning stepsExternal dependencies (retrieval, APIs, tools)Large, dynamic promptsThis means debugging must shift from log-based thinking to observability-driven engineering.“Debugging must shift from log-based thinking to observability-driven engineering.”What we’re buildingWe’ll create a simple AI question-answering service with:Retrieval (vector search + reranking)External tool callsLLM reasoningStructured output validationObservability (tracing + logging + token estimation)The focus is not just on building it, but on making it debuggable.Architecture overview: a debuggable AI systemThis architecture highlights a key shift in modern AI Systems: Observability is a core component rather than an afterthought. Each stage of the workflow, from retrieval to tool execution to model reasoning, is instrumented, enabling engineers to trace decision-making. This makes it possible to debug not just failures but also unexpected behaviors, which are far more common in AI systems than in traditional software.Step 1: install dependenciesbashpip install fastapi uvicorn \ langchain langchain-openai langchain-community \ faiss-cpu rank-bm25 \ httpx tenacity \ opentelemetry-api opentelemetry-sdk \ opentelemetry-instrumentation-fastapi \ opentelemetry-instrumentation-httpx \ tiktoken pydanticWe explicitly include OpenTelemetry because debugging AI systems without tracing is like flying blind.Step 2: initialize the model (with production controls)Pythonimport osfrom langchain_openai import ChatOpenAIapi_key = os.environ.get("OPENAI_API_KEY")if not api_key: raise ValueError("OPENAI_API_KEY must be set")llm = ChatOpenAI(model="gpt-4o-mini",temperature=0,model_kwargs={"response_format": {"type": "json_object"}},openai_api_key=api_key,request_timeout=30, max_retries=2)Timeouts and retries are not optional. When something fails, you need to know if it’s your system or the model provider.Step 3: add retrieval (and make it observable)Pythonfrom langchain.docstore.document import Documentfrom langchain_community.vectorstores import FAISSfrom langchain_openai import OpenAIEmbeddingsdocs = [ Document(page_content="Observability helps debug AI systems."), Document(page_content="Retrieval quality impacts model output."), Document(page_content="Tracing reveals hidden execution paths.")]embeddings = OpenAIEmbeddings()index = FAISS.from_documents(docs, embeddings)Now, retrievalPythonfrom rank_bm25 import BM25Okapidef retrieve(query: str): results = index.similarity_search(query, k=5) # Add lexical reranking corpus = [doc.page_content.split() for doc in results] bm25 = BM25Okapi(corpus) scores = bm25.get_scores(query.split()) ranked = sorted(zip(scores, results), reverse=True) return [ { "text": doc.page_content, "source": doc.metadata.get("source", "internal") } for _, doc in ranked[:3] ]Debugging insightIf the retrieval is wrong, everything downstream is wrong as well.Always log what documents were retrieved.Step 4: safe tool executionPythonfrom urllib.parse import urlparseimport httpxfrom tenacity import retry, stop_after_attempt, wait_exponential# Restrict outbound requests to trusted domains only.ALLOWED_DOMAINS = { "api.trusted-source.com", "documentation.org"}@retry( stop=stop_after_attempt(2), wait=wait_exponential(min=1, max=4))async def fetch_external(url: str): parsed_url = urlparse(url) # Prevent SSRF and internal network probing. if parsed_url.netloc not in ALLOWED_DOMAINS: raise ValueError( f"URL domain '{parsed_url.netloc}' is not allowed." ) async with httpx.AsyncClient( timeout=10, follow_redirects=False ) as client: response = await client.get(url) response.raise_for_status() # Truncate response to control token usage. return response.text[:3000]Production AI systems should never allow unrestricted outbound requests from model-generated inputs. Without domain allowlists, agents can become SSRF vectors that can probe internal services, cloud metadata endpoints, or private infrastructure. Restricting outbound access to trusted domains is a minimal production safeguard.“Production AI systems should never allow unrestricted outbound requests from model-generated inputs.”Debugging insightTool failures are silent killers. Without retries and logging, you won’t know if:The tool failedThe tool returned empty dataThe model ignored the toolStep 5: token visibility (not exact, but useful)Pythonimport tiktokenencoder = tiktoken.encoding_for_model("gpt-4o-mini")def estimate_tokens(messages): """ Approximate token usage for OpenAI-style chat payloads. Note: This is still an estimate. Real usage depends on: - system prompts - retrieved context - tool call arguments - provider-specific formatting - output tokens """ # Approximate overhead used by OpenAI chat formatting. tokens_per_message = 3 tokens_per_name = 1 total = 0 for message in messages: total += tokens_per_message for key, value in message.items(): if isinstance(value, str): total += len(encoder.encode(value)) if key == "name": total += tokens_per_name # Assistant reply priming tokens. total += 3 return totalToken counting should be treated as an operational estimate, not an exact billing mechanism. Real request cost depends on the full message payloads, retrieved context, tool-call arguments, system prompts, provider-side formatting, and generated output tokens. Even approximate tracking, however, is extremely useful for debugging runaway agents and monitoring cost regressions in production systems.Debugging insightUnexpected cost spikes often come from:large retrieved contextrepeated loopsoversized promptsStep 6: build the agent workflow (deterministic)Pythondef run_workflow(question: str): # Step 1: Retrieve context = retrieve(question) context_text = "\n".join([c["text"] for c in context]) messages = [ {"role": "system", "content": "Answer clearly using the provided context."}, {"role": "user", "content": f"Context:\n{context_text}\n\nQuestion: {question}"} ] tokens = estimate_tokens(messages) response = llm.invoke(messages) return { "raw_output": response.content, "sources": [c["source"] for c in context], "token_estimate": tokens }Debugging insightWe’re sticking to a deterministic flow, so we don’t have to deal with tools or agents acting out on their own.Step 7: validate output (guardrails)Pythonfrom pydantic import BaseModelclass OutputSchema(BaseModel): answer: str sources: list[str] token_estimate: intimport jsondef validate_output(raw): try: parsed = json.loads(raw["raw_output"]) except Exception: parsed = { "answer": raw["raw_output"], "sources": raw["sources"], "token_estimate": raw["token_estimate"] } validated = OutputSchema(**parsed) return validated.dict()Debugging insightFailures here tell you:The model ignored instructionsThe output format changedSomething upstream corrupted the contextStep 8: add observabilityPythonfrom fastapi import FastAPIfrom opentelemetry import tracefrom opentelemetry.sdk.trace import TracerProviderfrom opentelemetry.sdk.trace.export import ( BatchSpanProcessor, ConsoleSpanExporter,)from opentelemetry.instrumentation.fastapi import FastAPIInstrumentorfrom opentelemetry.instrumentation.httpx import ( HTTPXClientInstrumentor,)# Configure tracer provider.provider = TracerProvider()# For local debugging:# export traces to console.## In production, prefer:# OTLPSpanExporter -> OpenTelemetry Collector -> Jaeger/Grafana/etc.processor = BatchSpanProcessor(ConsoleSpanExporter())provider.add_span_processor(processor)trace.set_tracer_provider(provider)# Create application.app = FastAPI()# Instrument FastAPI and outbound HTTP calls.FastAPIInstrumentor.instrument_app(app)HTTPXClientInstrumentor().instrument()Instrumentation alone does not make traces visible. OpenTelemetry requires a tracer provider, span processor, and exporter to record and emit telemetry data. For local debugging, a console exporter is sufficient. In production systems, traces are typically exported via OTLP collectors to platforms such as Jaeger, Grafana, Tempo, Datadog, or Honeycomb.Add endpoint:Pythonfrom fastapi.concurrency import run_in_threadpoolfrom pydantic import BaseModelclass Query(BaseModel): question: str@app.post("/ask")async def ask(q: Query): result = await run_in_threadpool(run_workflow, q.question) output = validate_output(result) return outputWhat you can now debugWith this setup, you can answer questions like:“Why was the answer wrong?”Check retrieved documents“Why did the output change?”Compare context and token size“Why is latency high?”Trace LLM vs tool vs retrieval“Why is the cost increasing?Inspect token estimates and context sizeEngineering principle: make AI systems observableAI systems are not just models. They are pipelines for:retrievalreasoningtoolsvalidationEach part can fail independently. You need visibility at every step, or you’re essentially debugging in the dark.“Each part can fail independently. You need visibility at every step, or you’re essentially debugging in the dark.”Production lessonsLogs are not enoughYou need traces that show the full execution pathRetrieval errors look like model errorsAlways inspect the context firstTool failures are often silentAdd retries and instrumentationToken growth is a hidden riskMonitor prompt size continuouslyDeterministic workflow simplifies debuggingFewer moving parts = fewer unknownsConclusionThe takeaway here is that, since you’re dealing with a probabilistic system, your debugging tools (and approach) have to change. By introducing things like observability, deterministic workflows, structured validation, and proper tracing, you’re setting yourself up to see where the logic goes sideways (because it will).The post Debugging the undebuggable: building observability into probabilistic AI systems appeared first on The New Stack.