Building a Multi-Agent AI System in Python for More Reliable Outputs

Wait 5 sec.

It started with a straightforward goal: build an AI that could research a topic and return a clean, structured answer.On paper, this is exactly what modern large language models are supposed to do. You give a prompt, you get an answer. Simple.In practice, it wasn’t.The outputs were inconsistent. Sometimes insightful, sometimes shallow. Sometimes it turned out successful but often flawed in subtle ways:Important steps were skippedInformation felt incompleteThe structure was inconsistentAnd worst of all, it sometimes confidently hallucinatedThe problem wasn’t the model’s intelligence. It was the expectation that one system could do everything at once.That assumption had to change.I Had to Rethink My ApproachInstead of forcing one model to handle everything, I asked a different question:What if AI worked more like a team instead of a single brain?In real-world workflows, complex tasks are rarely handled by one person. There’s collaboration:Someone who plansSomeone who researchesSomeone who writes or deliversSo, I broke the system into roles with each agent handling a specific responsibility.Then there is something critical:ToolsBecause in reality, people don’t just “think” they use tools.Researchers use search enginesAnalysts use databasesEngineers use calculatorsWriters use referencesSo why should AI agents be any different?That decision changed everything.Multi-Agent ArchitectureHere’s the structure that replaced the single LLM call:System Architecture OverviewThis architecture introduces a clear flow:The Planner decides what needs to be doneThe Researcher gathers information using toolsThe Writer transforms raw input into structured outputThe result is not just better outputs, it’s a more reliable and grounded system.Model FlexibilityThis architecture is model-agnostic.You can plug in:Claude (Anthropic)Gemini (Google DeepMind)Open-source models via Hugging FaceLocal models via OllamaEach agent depends on a callable interface not the model itself.For this walkthrough, we’ll use GPT-4.Step 1: Setting Up Your Environmentpip install openaiSet your OpenAI API key.Step 2: Creating a Base LLM Functionimport openaidef call_llm(messages):     response = openai.ChatCompletion.create(         model="gpt-4",         messages=messages     )     return response['choices'][0]['message']['content'] Step 3: Building the Planner Agentdef planner_agent(task):     prompt = f"""     You are a planner agent.     Break this task into clear, specific, non-overlapping steps.     Task: {task}     Output as a numbered list.     """     return call_llm([{"role": "user", "content": prompt}])Why this matters:Logical sequencingClear scopeReduced ambiguityStep 4: Creating the Research Agent (Now With Tools)This is where the system becomes powerful.Instead of relying only on the LLM, the Research Agent uses tools to fetch real information.Example Toolsdef search_tool(query): \n     # Simulated search (replace with real API like SerpAPI, Tavily, etc.) \n     return f"Search results for: {query}" \n \n def knowledge_base_tool(query): \n     # Simulated structured knowledge lookup \n     return f"Knowledge base info about: {query}"Research Agent Using Toolsdef research_agent(step, context=""): \n     # Always use tools first \n     search_results = search_tool(step) \n     kb_results = knowledge_base_tool(step) \n \n     prompt = f""" \n     You are a research agent. \n \n     Context so far: \n     {context} \n \n     External data: \n     {search_results} \n     {kb_results} \n \n     Task: \n     {step} \n \n     Provide detailed, factual, and non-repetitive insights. \n     Ground your answer strictly in the external data. \n     """ \n \n     return call_llm([{"role": "user", "content": prompt}])What this fixes:Previously:RepetitiveGenericSometimes hallucinatedNow:Grounded in real dataMore specificMore trustworthyTools Used in This SystemThe Research Agent in this system relies on external tools to ground its outputs in real information rather than pure model generation. In this implementation, two main types of tools were used: a web search tool (simulated here but typically implemented using APIs like SerpAPI or Tavily) and a knowledge base retrieval tool for structured domain-specific information lookup. These tools allow the agent to fetch real-world context before generating responses, improving accuracy and reducing hallucinations. To use real versions of these tools, you can obtain API keys from services such as SerpAPI (for web search results), Tavily (for LLM-optimized search), or connect to internal databases/vector stores like FAISS or Pinecone for knowledge retrieval. These APIs are then wrapped as simple Python functions that the agent can call before passing data into the LLM for reasoning and summarisation.Step 5: Designing the Writer Agentdef writer_agent(research_notes): \n     prompt = f""" \n     You are a writer agent. \n     Create a clear, well-structured report using ONLY the information below: \n \n     {research_notes} \n \n     Use headings, subheadings, and bullet points. \n     Do not add new information. \n     Ensure clarity and flow. \n     """ \n \n     return call_llm([{"role": "user", "content": prompt}])Why this matters:Works with curated inputNo guessingNo hallucinationStep 6: The Orchestrator Where Everything Connectsdef validate_output(output): \n     return output and len(output.strip()) > 50 \n \n \n def run_multi_agent_system(task): \n     print("Planning...") \n     plan = planner_agent(task) \n \n     steps = plan.split("\n") \n \n     research_notes = [] \n     context = "" \n \n     for step in steps: \n         if step.strip(): \n             print(f"Researching: {step}") \n             result = research_agent(step, context) \n \n             if validate_output(result): \n                 research_notes.append(result) \n                 context += "\n" + result \n \n     print("Writing final output...") \n     final_output = writer_agent("\n".join(research_notes)) \n \n     return final_outputStep 7: Running the Systemresult = run_multi_agent_system("Explain how electric cars work") \n print(result)\What Happened When I Tested It (And Why It Still Wasn’t “Perfect”)After wiring everything together, I expected something close to a fully reliable AI research assistant.So I ran the first real test:“Explain how electric cars work.”The system kicked off exactly as designed.Planner broke it into steps like:What is an electric car?How does the battery work?How does the motor convert energy?Charging systemsAdvantages vs combustion enginesSo far, so good.Then the Research Agent started working through each step using the tools.It pulled structured-looking outputs like:“Search results for: battery in electric car”“Knowledge base info about: electric motor operation”And the Writer Agent finally assembled everything into a clean report.At first glance, it looked impressive.But when I read it properly… something was off.Not broken. Just not sharp enough.Where the System Still Fell ShortThe output had structure, but not depth.For example, in the electric motor section, it described something like:“Electric cars use motors to convert electrical energy into mechanical energy to move the vehicle.”Technically correct. But also… painfully generic.It felt like a textbook summary that avoided all the interesting parts:No mention of torque curvesNo explanation of instant accelerationNo clarity on regenerative braking mechanicsNo comparison between AC vs DC motor types in EVsAnd worse: \n Even though the Research Agent was using tools, the results were still beingflattened by the Writer Agent.So I had unintentionally built a system that was:well-structured but low-resolution.That was a key realization.Structure alone doesn’t guarantee intelligence.The Real Problem Wasn’t the Agents, It Was the Information FlowAt this point, I noticed something subtle but important:Each agent was doing its job correctly in isolation.But the handoff between them was losing value.Planner created abstract stepsResearch Agent produced fragmented tool outputsWriter compressed everything too aggressivelySo by the time the final output was generated, the system had effectively “averaged out” the knowledge.It wasn’t hallucinating.But it also wasn’t thinking deeply.The First Improvement: Forcing Evidence-Rich ResearchThe first fix I introduced was surprisingly simple:Instead of allowing the Research Agent to return loose summaries, I forced it to produce evidence-dense structured notes.So I changed the prompt slightly:Provide:Key factsMechanismsEdge casesExamplesAny numerical data if availableDo not summarize. Preserve detail.This single change shifted everything.Now, instead of:“Electric motors convert energy into motion”I started getting outputs like:“Electric motors generate torque instantly because they do not rely on combustion cycles. Most EVs use AC induction or permanent magnet synchronous motors. Regenerative braking converts kinetic energy back into electrical energy, improving efficiency by 10–25% depending on driving conditions.”That was already a big jump.But something still felt missing.The Second Issue: Context Was Not Being Preserved ProperlyEven with better research output, each step was still being treated independently.The system didn’t truly accumulate understanding.It was more like:10 separate notes → compressed into 1 final essayNot:continuous reasoning building toward an explanationSo I added a small but powerful change:Persistent Context Memory Between StepsI modified the orchestrator:context += "\n[STEP INSIGHT]\n" + resultNow every new research step could see what came before it.This meant the system stopped repeating itself and started building layered understanding.The Third Improvement: Introducing “Depth Prompts”The biggest breakthrough came when I stopped treating the Writer Agent as a passive formatter.Instead, I made it actively question the research.I updated the writer prompt:Before writing the final report, identify: \n - Missing explanations \n - Weak reasoning \n - Overly generic statements \n Then refine the structure before writing.This effectively turned the Writer Agent into a lightweight critic.Not just a summarizer.Running the Improved System AgainI ran the same prompt again:“Explain how electric cars work.”This time, the output changed dramatically.The explanation now included:Why EV torque delivery feels instant (and how gear reduction plays a role)How battery thermal management prevents performance degradationWhy regenerative braking efficiency drops at low speedsTrade-offs between battery weight and vehicle rangeReal-world inefficiencies in charging cyclesIt even included a comparison:“While internal combustion engines typically operate at 20–30% efficiency, modern EV drivetrains can exceed 85–90% efficiency under optimal conditions.”That was the moment it clicked.The system wasn’t just generating answers anymore.It was reasoning through a pipeline of structured cognition.What I Learned From This ExperimentBuilding this system taught me something important:It’s not enough to make AI agents exist.You have to design:How they think individuallyHow they pass informationAnd how that information evolvesMost failures in multi-agent systems don’t come from the model.They come from:Weak interfaces between agentsLoss of contextOver-compression of informationLack of iterative refinementThe Bigger InsightA single LLM is like a generalist.But a multi-agent system is only powerful if it behaves like a workflow, not a collection of prompts.Once I stopped thinking in terms of “agents that answer” and started thinking in terms of:“agents that transform information step-by-step”Everything changed.Final ResultAfter refinement, the system became:Structured → Every step has a clear purposeConsistent → Outputs follow a predictable patternReliable → Grounded in tools, with fewer hallucinationsIt no longer feels like querying a model.It feels like coordinating a team.Instead of asking one system to do everything, you distribute responsibility and  equip each agent with the tools to do its job properly.That’s where the real power lies.\\