A modern agent project depends on at least four services. There is the LLM proxy that fronts your model providers. There is the vector store that holds your embeddings. There is the observability backend that captures traces of every model call and tool invocation. There is at least one MCP server exposing tools to the agent. Once these services are built, your local machine is running a small distributed system, and the next immediate priority should be how to keep your work reproducible and close to production.I have done my fair share of work developing locally and deploying with Docker, so I will try to explain it in this article. The stack I used covers the LLM proxy, the vector store, the observability backend, an MCP tool server, and the agent application itself. The example agent is a docs research agent: it indexes a local documents folder, answers questions with citations, and sends every step to a trace backend. Each section explains one service.\The agentThe example agent we are going to build does three things. It reads a folder of local documents, embeds the text, and stores the embeddings in a vector index. When a user asks a question, the agent retrieves the most relevant chunks, calls an LLM with those chunks in context, and returns an answer with source citations. Every model call and every retrieval step is logged to a trace backend for observability.\n The local stack will have the following services:A model provider gateway. The agent calls an LLM, but it should not call OpenAI or Anthropic directly. A proxy in front of the model providers makes it easy to swap providers, track per-call cost, and add fallbacks without changing application code.A vector store. The agent needs a location to store embeddings and run similarity searches. For this stack, we will use Pinecone Local, the Docker emulator that mirrors the Pinecone API.An observability backend. Without traces, debugging an agent past the second or third tool call is unmanageable. We will be using Langfuse, which is self-hostable and has its own Postgres image.An MCP server. The agent needs to access updated documents from disk. We use the official filesystem MCP server, exposed over HTTP transport.The agent application.A Python service that brings the above together. \n Optionally, we can add Ollama as a sixth service for a completely offline iteration.\n The full Compose file is shown below in pieces. Each section explains one block and the design decision behind it.LiteLLM proxyThe first service in the file is LiteLLM. It is a proxy that exposes a single OpenAI/Anthropic compatible API and routes calls to whichever model provider you configure behind it. I used anthropic for this setup: services: # ---- LLM proxy ---- litellm: image: ghcr.io/berriai/litellm:main-stable ports: - "127.0.0.1:4000:4000" environment: ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY} volumes: - ./litellm-config.yaml:/app/config.yaml command: ["--config", "/app/config.yaml", "--port", "4000"]The matching litellm-config.yaml looks like this:model_list: - model_name: gpt-4o-mini litellm_params: model: openai/gpt-4o-mini api_key: os.environ/OPENAI_API_KEY - model_name: claude-haiku litellm_params: model: anthropic/claude-haiku-4-5 api_key: os.environ/ANTHROPIC_API_KEY\The agent application callshttp://litellm:4000/v1/chat/completions regardless of which underlying model it wants. The model name in the request body chooses the backend.\n Putting a proxy in front of model providers lets you swap the agent's model by editing a single config file instead of touching application code. It also gives you a place to configure a fallback chain when a provider returns an error, rather than handling that inside your agent loop.\Pinecone LocalFor the local vector store, I wanted to use something that is widely used in production and doesn't require me to have an account. I used Pinecone Local mostly because I wanted the same client library locally and in production. I tried lighter local vector stores first, but switching APIs later became annoying. \n # ---- Vector store ---- # PINECONE_HOST is set to pinecone.local (with a dot) because the # Pinecone Python SDK rejects hostnames that have no dot and aren't # 'localhost'. The network alias below makes pinecone.local resolve # to this container inside the Compose network. pinecone: image: ghcr.io/pinecone-io/pinecone-local:latest environment: PORT: 5080 PINECONE_HOST: pinecone.local networks: default: aliases: - pinecone.local\n The agent connects to it using the standard Pinecone Python client, with thehost argument pointing at the Compose alias instead of the hosted endpoint:from pinecone import Pineconepc = Pinecone(api_key="pclocal", host=os.environ["PINECONE_HOST"])def get_index(): # Pinecone Local serves the data plane over HTTP. The SDK defaults # to HTTPS, which fails with an SSL handshake error against the # emulator. Strip whatever scheme describe_index returns and build # an http:// URL explicitly. desc = pc.describe_index(INDEX_NAME) host = desc.host.replace("https://", "").replace("http://", "") return pc.Index(name=INDEX_NAME, host=f"http://{host}")\n While setting up the agent, I faced two issues with the SDK.\n The first SDK quirk is hostname validation. The Pinecone Python client rejects any host that does not contain a dot or is not literallylocalhost. A bare Compose service name like pinecone fails this check, which is why the YAML above gives the container a network alias pinecone.local. The alias resolves to the same container inside the Compose network, but it satisfies the SDK's validator.\n The second SDK quirk is the data plane scheme. When the agent callspc.Index(name), the SDK fetches the index host from the controller and connects to it over HTTPS by default. Pinecone Local only speaks plain HTTP, so the connection fails with an SSL handshake error. The fix is to override the host with an explicit http:// URL when constructing the Index handle, which is what the get_index() helper above does. Neither of these quirks is documented prominently, but both are easy to handle once you know they exist.\Langfuse: tracesThe third service is the observability backend. Langfuse is an open-source LLM observability platform that captures traces of model calls, tool calls, and retrieval steps. It runs locally as two containers: the Langfuse server itself, and a Postgres database that backs it. \n langfuse-db: image: postgres:15 environment: POSTGRES_USER: langfuse POSTGRES_PASSWORD: langfuse POSTGRES_DB: langfuse volumes: - langfuse-data:/var/lib/postgresql/data langfuse: image: langfuse/langfuse:2 depends_on: - langfuse-db ports: - "127.0.0.1:3000:3000" environment: DATABASE_URL: postgresql://langfuse:langfuse@langfuse-db:5432/langfuse NEXTAUTH_SECRET: dev-secret SALT: dev-salt NEXTAUTH_URL: http://localhost:3000 TELEMETRY_ENABLED: "false"\After the first boot, you visithttp://localhost:3000, create a project, and copy the public and secret keys into your agent's environment. From the agent's side, wiring is two lines:from langfuse.decorators import observe@observe()def answer_question(query: str) -> str:# retrieval, model call, response```python\n The@observe() decorator captures the function call as a trace span. Nested decorated calls become child spans automatically. The Langfuse SDK auto-instruments calls made through the OpenAI client, so calls passing through the LiteLLM proxy show up as model spans inside the trace without any extra code.\n The first time you debug an agent that does retrieval, makes two model calls, and invokes a tool, the value of this becomes obvious. Print statements force you to reconstruct what happened from unstructured stdout. A trace gives you the actual call tree, with the inputs, outputs, latency, and token cost of each step.\n A typical docs-research call has the following trace shape:ask (1.8s)retrieve (0.3s)pinecone.query (0.2s)generate (1.4s)claude-haiku\n When the answer is wrong, you do not have to guess whether the bug is in retrieval or in generation. The trace tells you which step took which inputs and what came back.MCP filesystem serverThe fourth service is an MCP server. MCP, the Model Context Protocol, is the protocol most agent frameworks now use to expose tools to a model. The official filesystem server gives the agent a set of tools for reading and listing files inside a sandboxed directory.mcp-filesystem: image: mcp/filesystem:latest volumes: - ./docs:/data/docs:ro command: ["--transport", "http", "--port", "8080", "/data/docs"]\n The volume mount is the important part. The MCP server only sees what is under/data/docs, which maps to the local ./docs folder on your laptop. Mounting it read-only also stops the server from accidentally writing back into your source tree.The agent connects to the MCP server over HTTP:from mcp import ClientSessionfrom mcp.client.streamable_http import streamablehttp_clientasync with streamablehttp_client("http://mcp-filesystem:8080") as (read, write, _):async with ClientSession(read, write) as session:await session.initialize()tools = await session.list_tools()\n Stdio transport is fine when the MCP server runs as a subprocess of the agent. In a Compose-based dev setup, the MCP server is its own container, so HTTP transport (or the newer Streamable HTTP transport) is the only option that fits the network model.\n If you write a custom MCP server for your project, you add it to the Compose file the same way as an image, a port, the relevant volumes or environment, and a command that starts it on the chosen transport. The agent does not change. It just learns about a new MCP endpoint through configuration.\n With the filesystem MCP server running and the agent configured to use it, the docs research agent can answer questions about a folder of markdown files without any of the file-reading code living inside the agent itself. The MCP server does the file access. The agent calls the tool.\The agent appThe fifth service is your code. The interesting design decision in this section is the difference between the dev image and the prod image.agent: build: context: . dockerfile: Dockerfile.dev depends_on: - litellm - pinecone - langfuse - mcp-filesystem ports: - "127.0.0.1:8000:8000" volumes: - ./src:/app/src - ./docs:/app/docs:ro environment: LITELLM_BASE_URL: http://litellm:4000 PINECONE_HOST: http://pinecone.local:5080 LANGFUSE_HOST: http://langfuse:3000 MCP_FILESYSTEM_URL: http://mcp-filesystem:8080 ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY} LANGFUSE_PUBLIC_KEY: ${LANGFUSE_PUBLIC_KEY:-} LANGFUSE_SECRET_KEY: ${LANGFUSE_SECRET_KEY:-} command: ["uvicorn", "agent.main:app", "--host", "0.0.0.0", "--port", "8000", "--reload"]The requirements.tx file installs all the required dependencies.fastapiuvicorn[standard]openaipineconesentence-transformerslangfuse>=2, list[str]: paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()] chunks, current = [], "" for p in paragraphs: if len(current) + len(p) > chunk_size and current: chunks.append(current.strip()) current = p else: current = (current + "\n\n" + p) if current else p if current: chunks.append(current.strip()) return chunks@app.on_event("startup")def index_docs(): existing = [i["name"] for i in pc.list_indexes()] if INDEX_NAME not in existing: pc.create_index( name=INDEX_NAME, dimension=EMBED_DIM, metric="cosine", spec=ServerlessSpec(cloud="aws", region="us-east-1"), ) index = get_index() docs_path = Path("/app/docs") vectors = [] for md_file in sorted(docs_path.glob("*.md")): text = md_file.read_text() chunks = chunk_text(text) embeddings = embed_model.encode(chunks).tolist() for i, (chunk, emb) in enumerate(zip(chunks, embeddings)): vectors.append({ "id": f"{md_file.name}-{i}", "values": emb, "metadata": {"file": md_file.name, "text": chunk}, }) if vectors: index.upsert(vectors=vectors) print(f"Indexed {len(vectors)} chunks from {docs_path}")class Question(BaseModel): query: str@observe()def retrieve(query: str, top_k: int = 3): index = get_index() embedding = embed_model.encode([query])[0].tolist() results = index.query(vector=embedding, top_k=top_k, include_metadata=True) return results.matches@observe()def generate(query: str, chunks) -> str: context = "\n\n".join( f"[{c.metadata['file']}]\n{c.metadata['text']}" for c in chunks ) response = llm.chat.completions.create( model="claude-haiku", messages=[ {"role": "system", "content": "Answer using only the provided context. Cite the source file in brackets."}, {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}, ], ) return response.choices[0].message.content@app.post("/ask")@observe()def ask(question: Question): chunks = retrieve(question.query) answer = generate(question.query, chunks) return {"answer": answer, "sources": [c.metadata["file"] for c in chunks]}\Source code comes in through a bind mount at runtime. Combined withuvicorn --reloadThis gives you sub-second iteration to edit a file in your editor, the change is visible inside the container immediately, and uvicorn restarts the server. \n The production Dockerfile copies the source code into the image and does not depend on bind mounts. This separation lets the dev loop stay fast without compromising the prod image.The depends_on block is also important. It tells Compose to start the dependencies before the agent service. It does not, however, wait for them to be healthy. If you need readiness gating, add a healthcheck block to each dependency and use depends_on: condition: service_healthyin the agent service. For most dev stacks, the simpler form is enough. \n The environment variables follow a single pattern. Each service that the agent talks to has a*_URL or *_HOST variable. In dev, these point at Compose service names like litellm and pinecone. In prod, the same variables point at hosted endpoints. The application code reads the variables and does not care about the difference.A typical iteration cycle looks like this. You edit a function insrc/agent/retrieval.py. uvicorn detects the change and restarts in about half a second. You re-issue the same request through the API. The new behavior is live, and the trace appears in Langfuse a second later. No image rebuild. No container restart.\Optional: OllamaThe optional sixth service is Ollama. It runs small models locally and is useful when you want to iterate without burning API credits, or when you are working without a network connection.ollama: image: ollama/ollama:latest ports: - "11434:11434" volumes: - ollama-models:/root/.ollamaTo wire Ollama into the LiteLLM proxy, add a model entry to litellm-config.yaml:- model_name: llama-local litellm_params: model: ollama/llama3.2 api_base: http://ollama:11434\n The agent then asks forllama-local instead of claude-haikuwhen you want to stay offline. Quality will be lower for any non-trivial reasoning, but the loop runs without network calls.\n Custom MCP servers follow the same pattern as the filesystem server. Build them as their own image, add a service block, expose the port the agent talks to, and the agent learns about the new tools through configuration. The Compose file is where you compose. The agent does not need to know whether a tool comes from a pre-built MCP server or from one you wrote yesterday.There are several things this stack does not handle.There are no GPUs. This is a CPU-only dev stack. If your agent needs to run a large local model or do its own embedding with a GPU-bound model, you will need a different setup.\n Langfuse uses dev keys. Pinecone Local accepts any API key. There is no API gateway in front of the agent. Real auth belongs in the production environment, and the dev stack should make it easy to bypass during iteration.\n Pinecone Local is not a perfect mirror of hosted Pinecone. The differences are documented and usually small, but they exist. \n Observability uses dev keys and a single project. Production Langfuse should be configured with real authentication, retention policies, and probably a managed Postgres instance.\ConclusionThe full Compose file is around 90 lines. You can take this file as written, change the agent service for your own application, and have a local agent stack that mirrors what your production looks like. Adjust the model list, swap the MCP servers for your own, and keep the rest. The shape of the stack is more useful than any individual choice inside it. \n