The Moment One Model Wasn’t Enough
There’s a thought experiment worth starting with. Imagine you’re running a company and you need a comprehensive market analysis report. You could hire one extraordinarily brilliant person and ask them to do everything: design the research strategy, scour dozens of databases, analyze the financials, write the executive summary, proofread every paragraph, and double-check every number. That person would need a superhuman attention span, zero fatigue, and the ability to simultaneously hold an analyst’s eye for data, a writer’s command of prose, and an editor’s ruthless skepticism of their own output.
Or you could hire a team: a strategist who designs the approach, two researchers who divide and conquer the data gathering, an analyst who synthesizes findings, a writer who drafts, and an editor who challenges every claim. The total work output isn’t just the sum of their individual contributions — it’s qualitatively different from what any one person could produce, because the structure enables specialization, parallelism, and oversight that a single mind cannot replicate.
The AI industry hit this same realization in 2023, and the reverberations are still reshaping how every serious AI system is built. A single language model, even a very capable one, struggles to do everything well at once: plan a complex task, execute ten subtasks in parallel, remember what it learned three hours ago, verify its own outputs critically, and maintain a consistent voice in a long document — all simultaneously, in one giant prompt. The architectural answer, which by 2026 has matured from a research curiosity into a production requirement, is the multi-agent system: a network of collaborating AI agents, each with a defined role, communicating with one another, sharing memory, and orchestrated toward a shared goal.
This article builds multi-agent systems from the ground up. We’ll start with the precise failure modes of single-agent systems that forced this architecture into existence, build the conceptual vocabulary, go deep on how agents communicate, how memory is structured, how planners and specialists interact, how the major frameworks (CrewAI, AutoGen, LangGraph) have made different bets on this problem, and what the leading AI companies are actually running in production.
Phase 1: The Problem — Why One Agent Was Never Going to Be Enough
The Context Window Ceiling
Every language model has a finite context window — the total amount of text it can read and reason over at one time. For years, researchers treated this as a temporary engineering limitation, assuming it would simply be scaled away. And context windows did grow dramatically: from a few thousand tokens in early GPT-3-era models to hundreds of thousands, and experimentally into the millions. But the problem didn’t disappear; it transformed.
Even a model with a million-token context window hits a hard wall when a task genuinely requires more information than that window can hold — and many real tasks do. Analyze every customer support ticket from the past year. Audit an entire multi-repository codebase. Research a topic by reading three hundred papers and synthesizing conflicting findings. Compress enough of that into one model’s context and you don’t just hit a size limit; you hit a quality limit. Research has consistently shown what practitioners call the “lost in the middle” effect — models pay substantially less reliable attention to information buried in the center of a very long context, even when it’s technically present and relevant. A context window isn’t like a desktop where everything is equally accessible; it’s more like a table, and the stuff in the middle tends to get buried under what’s on top.
More fundamentally, there’s a philosophical reason the single-context approach will always struggle with genuinely hard tasks: the same cognitive process that retrieves a fact should probably not be the same process responsible for critically evaluating whether that fact is reliable, because conflicted attention leads to lower-quality output on both fronts. Humans long ago learned to solve this by assigning different people to different cognitive tasks. The same principle applies to AI.
The Quality Problem: Simultaneous vs. Sequential Attention
Ask a single model to both write a piece of analysis and fact-check it, simultaneously in one pass, and you get something weaker than having two different agents do those jobs separately in sequence. Writing and critical reviewing are cognitively opposed orientations — creative construction versus skeptical deconstruction — and a model that’s trying to do both at once tends to produce work that’s too confident where it should be skeptical and too hedged where it should be assertive.
Multi-agent architectures solve this the same way editorial pipelines do: separate the person doing the work from the person critiquing it. One agent generates; a different agent, often with an explicitly different system prompt establishing a critic persona, reviews and challenges the output. The generator doesn’t need to second-guess every sentence it writes; the reviewer doesn’t need to do any generative work at all. Each can excel at its narrower task in a way neither could while trying to do both simultaneously.
The Parallelism Gap
The deepest structural argument for multi-agent systems is one that single-agent loops simply cannot address at all: parallelism. A single agent is inherently sequential — one thought follows another, one action follows another, each step waiting for the previous one to complete. For a task where ten subtasks are genuinely independent (researching ten different companies, translating a document into ten different languages, running ten different code analyses on different modules of a codebase), waiting for a single agent to work through them one by one is an enormous, unnecessary bottleneck.
A multi-agent system can execute all ten of those subtasks simultaneously, with ten specialized agents working in parallel, their results collected and synthesized by a coordinator. The difference isn’t a modest speed improvement — it can be an order-of-magnitude reduction in total elapsed time. At production scale, where time and cost are both real constraints, this is often the single most compelling practical argument for the architecture.
The Pre-Agentic Era’s Workarounds (And Why They Failed)
Before multi-agent systems became mainstream, engineers tried two routes around these limitations, and understanding why each fell short directly motivates the architecture that replaced them.
Prompt engineering heroics. If you craft a sufficiently clever, elaborate prompt, can you coax a single model into playing multiple roles sequentially — first a researcher, then an analyst, then an editor — within one conversation? To a degree, yes, and this approach produced genuinely impressive one-off demonstrations. But it doesn’t scale: role-switching via prompt injection is brittle, gets confused as context grows, and still can’t execute truly parallel work, no matter how cleverly the prompt is structured.
Fine-tuning for specific tasks. Train a dedicated model on a specific task — research synthesis, code review, financial analysis — and it will genuinely perform better on that task than a general-purpose model. But this approach requires expensive training runs for each specialization, can’t dynamically assign the right specialist to an unexpected task type, and still leaves you with a collection of siloed models that don’t communicate or share memory with each other in any structured way.
Both approaches were trying to make a hammer into a Swiss Army knife. Multi-agent systems are the Swiss Army knife — not one blade, but many, each sharp for its purpose, coordinated by a handle that knows when to use which.
Phase 2: Building the Mental Model
Roles, Not Models
The first mental shift required for multi-agent systems is thinking in roles rather than models. Every agent in a well-designed system has a job description: a specific set of responsibilities, a specific set of tools it’s authorized to use, and a specific kind of output it produces. The underlying language model powering that agent is often the same model powering every other agent in the system — or it might be a different, cheaper model for simpler tasks — but the role is what defines the agent’s identity and behavior, not the model choice.
This is important because it decouples the what (the agent’s purpose and responsibilities) from the how (which model, which prompt strategy, which tools are attached). You can swap a faster, cheaper model under a simple routing agent without changing anything else in the system. You can upgrade the model powering your most critical analysis agent independently of everything else. The role-based design is what makes this modularity possible.
Planner Agents: The Strategic Layer
Not all agents are created equal in a multi-agent system. Some agents act; others direct. The role most fundamental to how complex multi-agent systems actually function is the planner agent — sometimes called an orchestrator, supervisor, or coordinator depending on the framework — whose job is not to do domain work itself, but to decompose a complex goal into a sequence (or parallel set) of subtasks, determine which specialist agent should handle each subtask, sequence the work correctly (respecting dependencies), and synthesize the results into a coherent final output.
A useful mental model: the planner is a senior project manager who has never written a line of code in their life but is extremely good at understanding what “done” looks like, breaking down the path from here to done, assigning the right person to each piece, and integrating everyone’s contributions into something coherent. The engineers (specialist agents) may be smarter than the PM in their specific domain; the PM’s value comes from seeing the whole picture and keeping everything coordinated.
Specialist Agents: The Execution Layer
Specialist agents have a narrower mandate than planners, but they execute it with more focused competence. A research agent is optimized — via its system prompt, tools, and possibly model selection — to find, retrieve, and summarize information from external sources. A code agent has a code interpreter, a code execution environment, and a system prompt that establishes engineering rigor and correct language conventions. A critic agent has an explicitly adversarial stance toward outputs: its job is to assume a draft is wrong, find the weakest claims, identify what’s missing, and return a structured critique rather than validation.
The cleaner the separation between these roles, the better each agent performs. An agent that’s been given twenty different responsibilities and fifty different tools performs worse than one with five responsibilities and ten tools — not because the model is incapable, but because clarity of purpose is itself a performance variable, both for the model (which has a cleaner reasoning context) and for the system designer (who can test and improve each role independently).
Agent-to-Agent Communication: The Three Patterns
Multi-agent systems communicate in three structurally distinct patterns, and every framework you’ll encounter is making a design choice about which of these it supports and how:
Hierarchical (supervisor-worker). A planner agent dispatches subtasks to worker agents, receives their outputs, and coordinates the next steps — much like a manager and their direct reports. Workers typically don’t communicate with each other directly; everything flows through the supervisor. This pattern is easy to reason about and debug, but becomes a bottleneck if the supervisor has to touch every message.
Sequential pipeline. Agent A produces output, hands it to Agent B, which produces output, hands it to Agent C. The structure is linear: each stage’s output is the next stage’s input. This is natural for workflows with a clear ordering — research → draft → review → edit — but has no mechanism for parallel work and requires complete re-execution if an early stage needs revision.
Peer-to-peer (collaborative). Agents communicate directly with each other as peers, without a central coordinator — similar to the way an engineering team might use a shared document: multiple people reading and writing the same artifact without everything going through one person. This enables genuine emergent collaboration but is harder to monitor and debug, because the communication graph can be complex and non-linear.
Most production systems are some blend of hierarchical and sequential: a top-level planner that delegates to teams of specialists who work sequentially within their team. Peer-to-peer is still more common in research systems than in production, primarily because the observability and reproducibility requirements of production favor the structured communication of the first two patterns.
Memory: The Four Layers
Memory in a multi-agent system is more nuanced than “remember what was said.” There are four qualitatively different kinds of memory, and each serves a different purpose in the architecture:
In-context (working) memory is the current conversation or task thread in a given agent’s active context window. It’s fast, immediately available, and temporary — it disappears when the session ends. This is what every LLM naturally has.
External short-term memory is a shared, structured store (often a database or document) that multiple agents can read from and write to within a single task execution. It’s how Agent B knows what Agent A learned, even though they’re separate processes running in separate contexts. Think of it as a shared whiteboard that the whole team can see.
External long-term (episodic) memory persists across tasks and across sessions — a log of what this agent (or system) has done before, what worked, what failed, what a given user prefers. This is what allows an agent to learn from experience rather than starting completely fresh on every task.
Semantic memory (a knowledge base) is structured, domain-specific knowledge the system can retrieve via search — similar to a RAG pipeline (covered in depth in our RAG article). This is where proprietary knowledge, domain expertise, and factual reference material live, retrieval-indexed for efficient access.
Production multi-agent systems typically use all four layers, with different agents interacting with different layers depending on their role. A research agent might write findings to external short-term memory; a planner might read previous task performance data from long-term episodic memory when deciding how to decompose a similar task it’s handled before.
Phase 3: Internal Working Deep Dive — What Actually Happens in a Multi-Agent Run
The Full Lifecycle of a Complex Task
Let’s trace exactly what happens from the moment a user submits a genuinely complex request — say, “research our top five competitors and produce a one-page strategic brief for each” — through a well-designed multi-agent system.
Stage 1: Task Intake and Goal Decomposition
The user’s request lands with the planner agent. The planner’s first job isn’t to start doing research; it’s to translate an ambiguous natural-language goal into a structured task graph: a list of subtasks, their dependencies, and which agent role should handle each. For our example, the planner might decompose this into:
- Five parallel research subtasks (one per competitor), each going to a dedicated research agent instance.
- Five parallel first-draft subtasks, each waiting on the corresponding research subtask.
- Five independent review subtasks, each waiting on the corresponding draft.
- One final synthesis task — combining five reviewed briefs into a coherent package — must wait until all five review subtasks are done.
This task graph encodes what must happen, who should do it, and in what order, allowing the execution engine to identify which subtasks can start immediately and which must wait for upstream dependencies. The planner doesn’t write a single word of competitor research; it writes a project plan.
Stage 2: Delegation and Tool Provisioning
Once the task graph is established, the execution engine dispatches work to the appropriate agents. Each research agent instance is initialized with: a focused system prompt establishing the research persona, the specific competitor it should focus on, the set of tools it’s authorized to use (web search, a vector database of internal market intelligence, a company-data API), and a pointer to the shared external memory where it should write its findings.
This provisioning step is where tool scoping matters: a research agent should have web search; it probably shouldn’t have database write access or code execution. An engineer agent should have a code interpreter; it shouldn’t have access to customer data unless explicitly required. Least-privilege tool assignment isn’t just a security consideration — it also reduces the space of possible actions an agent might take, which reduces the surface area for unexpected behavior.
Stage 3: Parallel Execution and Shared Memory Writes
The five research agents now run in parallel, completely independently. Each one follows its own research loop — formulating search queries, retrieving results, synthesizing findings, iterating — and at the end writes a structured research summary to the shared external memory, tagged with the competitor name and a timestamp.
This shared memory write is the crucial handoff point: it’s how the drafting agents in the next stage will access what the research agents learned without needing to share a context window with them. The research agent’s internal context — the full thread of its search queries, intermediate findings, and reasoning — doesn’t matter to the drafter; only the clean, synthesized output does.
Stage 4: Downstream Execution with Dependency Checks
As each research subtask completes, its corresponding drafting agent becomes unblocked and starts immediately, without waiting for all five research agents to finish. This is a key property of well-designed multi-agent execution: maximum parallelism within dependency constraints. If Competitor 3’s research finishes before Competitors 1, 2, 4, and 5, Competitor 3’s draft can start immediately and possibly finish before some of the other research tasks are even completed.
The drafting agents read from external memory (the research summaries), generate a draft brief, and write their output back to the shared memory. Review agents then pick up the drafts, apply their critic persona — looking for unsupported claims, missing competitive context, weak strategic analysis — and return structured feedback, potentially triggering a revision loop back to the drafter.
Stage 5: Synthesis and Final Output
Once all five reviewed briefs are complete, the synthesis agent (or the planner itself, if the task is small enough) pulls all five from shared memory, applies whatever formatting and coherence-checking the final deliverable requires, and returns the result to the user.
The whole process — five parallel research streams, five parallel drafts, five reviews, one synthesis — might be completed in the time a single sequential agent would take to finish one research stream, because parallelism is doing the heavy lifting. The quality ceiling is also higher, because each specialized agent applied focused attention to a narrow task rather than one overloaded agent trying to do everything in sequence.
Agent-to-Agent Communication Protocols
How do agents actually send messages to each other? There are two primary mechanisms in production systems today, and they represent different points on the speed-versus-explicitness trade-off.
Shared state + polling is the most common pattern in framework implementations like LangGraph: agents read from and write to a shared state store, and routing logic (conditional edges in LangGraph’s model) determines what the next agent should do based on what’s now in the shared state. Communication isn’t direct — Agent A doesn’t “call” Agent B; Agent A writes to state, and the execution engine determines that B should run next. This pattern is excellent for inspectability (the full shared state is a complete audit log of what every agent wrote) but adds latency proportional to the polling and routing overhead.
Direct messaging/function invocation is more common in frameworks like AutoGen, where one agent can directly call another agent as if calling a function, passing arguments and receiving a return value. This is faster and more intuitive for simple agent-to-agent handoffs, but it can produce complex, hard-to-trace communication graphs in systems with many agents interacting non-linearly.
Standardized protocol messages (A2A) represent where the industry is heading. Google’s Agent2Agent (A2A) protocol, developed in 2025 and now co-governed under the Linux Foundation alongside MCP, defines a standard message format for agents to communicate across framework boundaries — so an agent built with AutoGen can send a well-formed, interoperable task request to an agent built with LangGraph, just as two HTTP services from different frameworks can interoperate because they both speak HTTP. This is in early but real adoption as of mid-2026; the significance is that multi-agent systems are becoming composable across organizational and framework boundaries, not just within one team’s chosen toolset.
How Memory Works in Distributed Agent Execution
The memory architecture in a multi-agent system is not simply “give every agent a bigger context window.” It’s a set of deliberately designed storage tiers, each optimized for different access patterns.
The shared external memory store (often a simple key-value store, a relational database, or a structured document store) is where agents deposit and retrieve task-specific information during a single run. It needs to be fast (sub-millisecond read latency for agents that are constantly retrieving while running) and concurrent-safe (multiple agents writing simultaneously cannot corrupt each other’s data). In practice, many teams use Redis for in-flight task state precisely because it’s fast, supports atomic operations, and has excellent support for TTL-based automatic cleanup when a task finishes.
The vector knowledge base is where agents retrieve semantically relevant information that isn’t specific to the current task but is persistently available as background domain knowledge. This is a standard RAG retrieval layer, and its role in a multi-agent system is simply to be one more tool available to whichever agents need it — typically the research and analysis agents.
Long-term episodic memory — the system’s record of what it has done before — is a newer and less standardized area, but one of the most actively researched. The mechanism that works best in practice is storing experience as semantically indexed text (a summary of a past task, what approach was taken, what the outcome was) in a vector store with time-based recency weighting, so the system naturally gravitates toward recent relevant experience rather than arbitrarily distant historical examples. The challenge is preventing this store from growing unboundedly in a way that degrades retrieval quality, which is why most production implementations include a periodic consolidation step — clustering similar episodic memories and reducing them to a smaller set of higher-quality general lessons, similar to how human memory consolidates during sleep.
Phase 4: Engineering Implementation — Building a Production Multi-Agent System
Let’s build a minimal but architecturally honest multi-agent system: a competitive research team with a planner, two parallel research agents, and a synthesis agent. We’ll use LangGraph as the execution backbone because it makes the state and routing logic explicit — understanding what’s actually happening is the whole point here.
Defining the Shared State Schema
from typing import Annotated, TypedDict
from langgraph.graph.message import add_messages
class TeamState(TypedDict):
# The original task, set once and read-only throughout
task: str
# Agents assigned to the task (set by planner)
assigned_topics: list[str]
# Research results keyed by topic, written by research agents
research_results: dict[str, str]
# Final synthesis
final_report: str
# Accumulating messages for the planner's reasoning trace
messages: Annotated[list, add_messages]
Why the state is a shared ledger, not passed arguments. Every agent in this system reads from and writes to TeamState — it’s the shared whiteboard. No agent needs to know which agent produced a given piece of data, only that the data is there. This decouples agents from each other: you can add a third research agent, change the synthesis logic, or insert a review step between research and synthesis without touching any agent that doesn’t directly participate in that change.
The Planner Agent
from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, SystemMessage
model = ChatAnthropic(model="claude-sonnet-4-6")
PLANNER_SYSTEM = """You are a research team coordinator. Given a task, break it
into 2-4 focused research subtopics. Return ONLY a JSON array of topic strings,
no preamble. Example: ["Topic A", "Topic B", "Topic C"]"""
def planner_agent(state: TeamState) -> dict:
response = model.invoke([
SystemMessage(content=PLANNER_SYSTEM),
HumanMessage(content=f"Task: {state['task']}")
])
import json
topics = json.loads(response.content)
return {
"assigned_topics": topics,
"messages": [response],
}
Why the planner returns structured JSON, not prose. The planner’s output is consumed programmatically by the execution engine — the topics list determines how many research agent invocations get spawned and what each one focuses on. Returning prose requires fragile parsing; returning structured JSON is a direct machine-to-machine interface. Prompting a model to return only valid JSON (with a clear schema and a concrete example) is far more reliable than parsing prose, especially when this output is critical path for the rest of the workflow.
Research Agents: Dynamic Invocation
RESEARCHER_SYSTEM = """You are a focused research analyst. Research the given
topic thoroughly using your available tools. Return a structured 3-5 paragraph
summary of the most important findings. Be specific and cite sources."""
def make_researcher(topic: str):
"""Factory that creates a research agent closure for a specific topic."""
def research_agent(state: TeamState) -> dict:
response = model_with_tools.invoke([
SystemMessage(content=RESEARCHER_SYSTEM),
HumanMessage(content=(
f"Research this topic thoroughly: {topic}\n"
f"Overall task context: {state['task']}"
))
])
return {
"research_results": {topic: response.content},
"messages": [response],
}
research_agent.__name__ = f"research_{topic.lower().replace(' ', '_')}"
return research_agent
Why a factory function rather than a single reusable research node. LangGraph (and most graph-based frameworks) assigns nodes by name — each node is a unique callable with a unique string identifier. If we want four parallel research agents running simultaneously, we need four distinct node objects, each bound to its specific topic. The factory pattern lets us generate as many uniquely named research agents as the planner requests, rather than hard-coding a fixed number of research nodes at graph construction time. This is the pattern that enables a planner to dynamically determine parallelism.
Synthesis Agent and Final Assembly
SYNTHESIZER_SYSTEM = """You are a senior analyst synthesizing research from
multiple specialists. Produce a coherent, well-structured report. Integrate
findings across topics, identify cross-cutting themes, and highlight any
tensions or gaps in the research."""
def synthesis_agent(state: TeamState) -> dict:
all_research = "\n\n".join(
f"## {topic}\n{findings}"
for topic, findings in state["research_results"].items()
)
response = model.invoke([
SystemMessage(content=SYNTHESIZER_SYSTEM),
HumanMessage(content=(
f"Task: {state['task']}\n\n"
f"Research from specialist agents:\n{all_research}"
))
])
return {"final_report": response.content, "messages": [response]}
Why the synthesizer gets all research concatenated rather than seeing the full message history. The synthesizer only needs the research outputs, not the full internal reasoning trace each research agent went through to produce them. Passing the full history would bloat the synthesizer’s context with irrelevant intermediate steps, reducing the quality of its synthesis and increasing cost. Extracting only the final structured outputs from shared state, then passing just those to the next agent, is the multi-agent equivalent of a clean API interface: each stage gets exactly the information it needs, nothing more.
Common Production Mistakes
Underspecified agent roles. The biggest quality lever in a multi-agent system is the specificity and clarity of each agent’s system prompt. Vague role descriptions — “you are a helpful research assistant” — produce agents that behave inconsistently and overlap poorly with their teammates. Role prompts should define what the agent is responsible for, what it is explicitly NOT responsible for (to prevent agents doing each other’s jobs), what format its output should be in, and what “done well” looks like. Spending three hours crafting role prompts will improve end-to-end quality more reliably than spending three hours tuning routing logic.
Unscoped tools. Giving every agent access to every tool produces worse results than carefully scoping each agent’s tool set. Models with fewer, more relevant tools make better tool-selection decisions. More practically, a research agent with write access to production databases is a security incident waiting to happen.
No guardrails on parallelism. Spawning hundreds of parallel agent invocations simultaneously can exhaust API rate limits, overwhelm downstream services, and produce costs that spiral quickly at scale. Production systems need explicit concurrency limits — a maximum of N agents running in parallel at any time — with work queued and dispatched as running agents complete. Most teams implement this as a simple semaphore around agent invocations.
Trust boundaries without access control. In a multi-agent system, one agent calling another means one piece of code with certain permissions is spawning another piece of code that may request different (or broader) permissions. Without an explicit model of which agents can request which other agents to do what, a compromised or misbehaving agent can cascade requests through the system with escalating capabilities. Explicit capability declarations per agent role, and validation that requested actions fall within those declarations, are not optional in a production deployment that touches sensitive data or external systems.
No coordination for shared memory writes. If two research agents happen to finish at the same time and both try to write to the same key in the shared state, a naive implementation will produce a race condition — one write silently overwrites the other. The dictionary reducer pattern (where each agent writes to a unique key in a dict, rather than overwriting a shared string) is the standard way to avoid this in LangGraph, but it has to be designed in from the start: bolting on atomicity guarantees after the fact is much more painful than getting the state schema right from day one.
Phase 5: Real-World Systems — Multi-Agent AI at Production Scale
OpenAI: The Deep Research Agent
OpenAI’s Deep Research feature, released in 2025, is a multi-agent system with an architecture that closely mirrors what we built above, at dramatically larger scale. It uses a planner-researcher-synthesizer structure: a coordinator model decomposes a complex research question into parallel research threads, dispatches them to specialized research sub-agents that each run multi-step web search loops, and a synthesis stage produces a long-form, cited report that no single model context window could produce reliably on its own. The elapsed wall-clock time is substantially lower than sequential research would take, specifically because of the parallel execution structure. Deep Research represented OpenAI’s first consumer-facing admission that single-model, single-context generation wasn’t the right architecture for knowledge-intensive tasks.
Google: AlphaCode 2 and Code Generation Pipelines
Google DeepMind’s code generation research (continuing the AlphaCode line) uses an ensemble of agents in a generate-and-critique loop: multiple agents independently generate candidate solutions to a programming problem, a separate evaluation agent tests those candidates against test cases, and a synthesis stage combines insights from the highest-performing candidates to produce a refined final solution. The multi-agent structure enables diversity in the generation stage — different agents explore different solution approaches rather than converging prematurely on one strategy — which is precisely what improves performance on problems where the optimal solution isn’t obvious from the problem statement alone.
Microsoft: Magentic-One and Enterprise Copilot
Microsoft’s Magentic-One research system is a hierarchical multi-agent architecture with a generalist orchestrator coordinating a team of specialized agents: a WebSurfer for browser-based research, a FileSurfer for document retrieval, a Coder for writing and executing code, and a ComputerTerminal for executing shell commands. The orchestrator dynamically assigns tasks to these specialists based on what the current problem requires, much like a project manager with a versatile team. Microsoft has applied similar architectures within its enterprise Copilot products, where different specialized agents handle document retrieval, data analysis, email drafting, and calendar management — coordinated by an orchestration layer that routes user intent to the right specialist.
Vertical Applications: Legal, Medical, and Financial
Some of the most significant real-world multi-agent deployments are in domains where getting it wrong has serious consequences, and where the depth and accuracy of research genuinely demands parallel, specialized, cross-verified work. Legal AI companies have deployed multi-agent pipelines that separate document retrieval (finding relevant case law), legal analysis (interpreting how that law applies to a given situation), and risk assessment (quantifying exposure and likelihood) into separate, specialized agents, with explicit critic passes between drafting stages. Medical research applications use multi-agent designs where one agent focuses on primary literature, another on clinical trial data, and a third on drug interaction databases — then a synthesis agent produces a clinical summary with conflicts clearly flagged. These aren’t just convenience features; in these domains, the multi-agent structure’s ability to separate concerns and build in critical review is what enables the output to be trusted at all.
Phase 6: Framework Deep Dive — CrewAI, AutoGen, and LangGraph
Three Different Bets on the Same Problem
By 2026, three frameworks dominate the production multi-agent landscape, and they each represent meaningfully different architectural philosophies. Understanding the philosophy behind each one — not just the API — is what allows you to make an informed choice rather than defaulting to whatever has the most GitHub stars.
CrewAI: Role-First, Developer-Friendly
CrewAI’s central metaphor is exactly what it sounds like: a crew of agents, each with a named role, a goal, a backstory, and a set of tools, assembled to work on a task. The framework makes the role-based approach its primary, first-class abstraction — you think first about “what roles do I need,” then you configure agents to fill those roles, then you define a process (sequential, hierarchical) for how those agents collaborate.
from crewai import Agent, Task, Crew, Process
researcher = Agent(
role="Market Research Analyst",
goal="Find accurate, current market intelligence on the target company",
backstory="You are a veteran analyst with deep expertise in competitive intelligence",
tools=[web_search_tool, document_retriever],
verbose=True,
)
writer = Agent(
role="Strategic Report Writer",
goal="Transform research into a compelling, executive-ready strategic brief",
backstory="You specialize in translating technical findings into business language",
tools=[], # No tools — writing is pure generation
)
research_task = Task(
description="Research {company} across financial performance, product roadmap, and market position",
expected_output="A structured 400-600 word research summary with source citations",
agent=researcher,
)
brief_task = Task(
description="Write a one-page strategic brief based on the research",
expected_output="Executive-ready strategic brief in standard format",
agent=writer,
context=[research_task], # Brief waits for research to complete
)
crew = Crew(
agents=[researcher, writer],
tasks=[research_task, brief_task],
process=Process.sequential,
)
result = crew.kickoff(inputs={"company": "Salesforce"})
CrewAI’s strengths. The role-based API is genuinely intuitive for teams that think in terms of “who does what” rather than “how does data flow.” Getting a working prototype up and running takes minutes, not days. The framework handles task dependency resolution, context passing between tasks, and basic retry behavior without requiring explicit graph construction. For teams that need a multi-agent system now and can live with some inflexibility in how they structure the execution logic, CrewAI is often the fastest path to something running.
CrewAI’s limitations. The abstraction that makes CrewAI easy to use also makes it harder to customize deeply. The execution engine (especially the sequential and hierarchical process modes) isn’t as configurable as LangGraph’s graph-based structure, which means unusual workflows — non-linear branching, complex retry strategies, fine-grained human-in-the-loop gates — require more effort to express. Debugging is also more challenging because the internal execution flow is less transparent than an explicit graph with named nodes and inspectable checkpoints.
AutoGen: Conversational, Event-Driven, Flexible
AutoGen, developed by Microsoft Research and heavily redeveloped in its AutoGen 2.0 revision (released in late 2024), takes a fundamentally different metaphor from CrewAI: instead of roles in a crew, agents in AutoGen are conversational participants that communicate by passing messages back and forth, with the conversation history itself serving as the primary shared memory. Coordination emerges from the conversation rather than from an explicit task graph.
from autogen_agentchat.agents import AssistantAgent, UserProxyAgent
from autogen_agentchat.teams import RoundRobinGroupChat
analyst = AssistantAgent(
name="Analyst",
model_client=model_client,
system_message="You analyze data and identify key trends. Be precise.",
tools=[data_query_tool],
)
critic = AssistantAgent(
name="Critic",
model_client=model_client,
system_message=(
"You challenge claims made by the Analyst. "
"Identify unsupported assumptions, weak reasoning, or missing evidence."
),
)
team = RoundRobinGroupChat([analyst, critic], max_turns=6)
result = await team.run(task="Analyze Q2 revenue trends for the APAC region")
AutoGen’s strengths. The conversational model is flexible and mirrors how humans naturally think about agent collaboration — agents talking to each other. AutoGen 2.0’s architectural redesign makes it far more composable and production-ready than the original version, with a cleaner separation between the core framework and the specific agent/team implementations, making it easier to extend and test. The framework is also genuinely well-suited to tasks where the “right” structure of the multi-agent interaction isn’t fully known in advance and can productively emerge from the conversation itself.
AutoGen’s limitations. The conversational model that gives AutoGen its flexibility also makes it harder to reason about execution paths: a conversation-based system can go in many directions, and debugging a failure in a long-running multi-agent conversation is substantially harder than debugging a named-step failure in a LangGraph execution. The lack of a built-in, standardized checkpointing and state persistence layer (compared to LangGraph’s first-class checkpointer concept) makes long-running task recovery more complex to implement correctly.
LangGraph: Explicit, Inspectable, Production-Grade
LangGraph, covered in depth in our dedicated article, sits at the opposite end of the explicitness spectrum from CrewAI. Rather than hiding the execution model behind a high-level role or conversation abstraction, LangGraph puts the graph structure front and center: you define your state schema, you define your nodes, you define your edges and routing logic, and the execution model is exactly what the graph diagram shows, with no hidden machinery.
For multi-agent systems specifically, LangGraph enables the subgraph composition pattern: an entire agent (with its own state, nodes, edges, and checkpointer) can be nested as a single node inside a larger parent graph. This makes it possible to build genuinely large, team-structured multi-agent systems where one subgraph is a research team, another is an analysis team, and the parent graph is the coordinator that routes between them — all sharing the same execution model, all checkpointed the same way, all observable through the same tooling.
When to choose which framework? CrewAI for prototyping and role-heavy workflows, where getting something running quickly is the priority. AutoGen for conversational multi-agent patterns and Microsoft ecosystem integrations. LangGraph for production systems where you need full control over execution flow, durable state, fine-grained human-in-the-loop gating, and the ability to debug and replay any execution from its full checkpoint history. Most teams that ship real production multi-agent systems at scale end up on LangGraph or a custom implementation built on similar primitives — not because CrewAI and AutoGen aren’t good, but because the transparency and recoverability requirements of production eventually demand the control that LangGraph’s explicit graph model provides.
Phase 7: Advantages, Limitations, and Trade-offs
Advantages — And Their Real-World Conditions
Parallelism that actually changes what’s achievable. This isn’t a marginal improvement. A research task that would take a single agent forty-five minutes sequentially can be completed in under ten minutes when five specialist research agents run simultaneously. At production scale — handling hundreds or thousands of concurrent tasks — this parallelism is what makes multi-agent systems economically viable for complex tasks, since total elapsed time directly affects infrastructure cost and user experience.
Quality through specialization and critique. This matters most for tasks where the difference between “good enough” and “excellent” is meaningful — professional reports, production code, medical summaries, legal analysis. A specialist agent with a narrowly focused system prompt and the right tools genuinely outperforms a generalist agent trying to do the same task as one of many simultaneous concerns.
Fault isolation. When one agent fails in a well-designed multi-agent system, it doesn’t necessarily bring down the whole task. A research agent that hits a rate limit or a tool error can be retried independently while other research agents continue their work. This isolation is structurally much harder to achieve in a single monolithic agent.
Limitations That Need Honest Treatment
Latency overhead. Every agent-to-agent handoff introduces latency: a message must be serialized, written to shared state, read by the next agent, and incorporated into a new context. For tasks that are inherently sequential and don’t benefit from parallelism, a single well-prompted agent will be faster and cheaper than a multi-agent system that adds coordination overhead with no throughput benefit.
Cost amplification. Each agent invocation is a model call with a cost. A five-agent system that runs five parallel agents and then a synthesis agent is spending roughly six model calls where a single-agent system would spend one. For simple tasks, this cost multiple is hard to justify. Multi-agent systems earn their cost at the task complexity level where six focused calls produce dramatically better results than one unfocused one — but that threshold is higher than most initial implementations assume.
Emergent, hard-to-predict behavior. A single agent is relatively deterministic (for a given prompt and temperature). A system of multiple agents communicating through shared state can exhibit emergent behavior that’s genuinely difficult to predict or reproduce — especially with conversational patterns like AutoGen’s where the exact sequence of agent messages depends on each model’s response to the previous one. This makes testing multi-agent systems substantially harder than testing single-agent systems; exhaustive unit testing is not sufficient, and production behavior can differ from testing behavior in ways that are hard to anticipate.
Debugging complexity. When something goes wrong in a multi-agent system, determining which agent made the error, why it made it, and how that error propagated through subsequent agents requires either a robust observability setup (LangSmith, full checkpoint inspection) or significant time spent reconstructing the execution history from logs. Teams that skip the observability investment early invariably regret it when the first production incident hits.
Phase 8: Career Impact & Future
Where the Hiring Demand Is Actually Concentrated
Multi-agent systems engineering is currently one of the highest-leverage skill clusters in AI engineering, precisely because it sits at the intersection of several traditionally separate disciplines: distributed systems thinking (parallelism, state management, fault tolerance), applied LLM engineering (prompt design, tool use, role specification), and product intuition (knowing which tasks are complex enough to warrant the architecture’s overhead). Engineers who can reason clearly across all three of these are rare, and the demand for them is visible across nearly every serious AI product company.
The specific roles where this shows up most concretely are AI Engineer, Agent Systems Engineer, AI Platform Engineer (building the orchestration infrastructure that product teams build on), and ML Infrastructure Engineer — as well as backend engineers at companies where “build a reliable, observable multi-agent workflow” has become a routine product requirement rather than a research project.
What to Learn Next
The most effective learning path runs in this sequence: build a working single-agent system in LangGraph first (grounding yourself in state, nodes, edges, checkpointing), then extend it to two collaborating agents (getting experience with shared state and delegation patterns), then try one of the higher-level frameworks (CrewAI or AutoGen) to understand the trade-offs firsthand rather than abstractly. Pair this with the LangSmith observability platform — being able to trace and debug multi-agent runs is as important as being able to build them. The A2A protocol and its emerging ecosystem of interoperable agents is worth watching closely; the ability to compose agents across organizational and framework boundaries will likely be the next structural shift in how production multi-agent systems are built.
The Real Argument for Multiple Minds
There’s a deeper reason multi-agent systems work beyond the technical arguments about context windows and parallelism. A single agent, no matter how capable, has a coherence problem: it needs to simultaneously be the person generating an idea and the person evaluating whether that idea is good, the entity taking an action and the entity deciding whether that action was wise. These aren’t just competing cognitive tasks; they’re opposing epistemic stances, and collapsing them into one model in one context predictably produces output that’s less critically examined and less reliable than it could be.
Multi-agent systems restore the epistemic separation that makes human intellectual work reliable: the person who writes the first draft is not the person who critiques it. The person who designs the experiment is not the person who validates the results. The planner who sets the task isn’t the researcher who executes it. This isn’t just an organizational nicety — it’s an epistemic necessity for work that needs to be trustworthy.
The frameworks we’ve covered, the architectures we’ve analyzed, and the production systems we’ve traced are all ultimately engineering implementations of that one insight. As AI systems take on tasks with higher stakes — decisions that affect money, health, code that ships to millions of users — the industry’s move toward multi-agent architectures that build in critique, separation of concerns, and human oversight isn’t a trend. It’s an acknowledgment that trustworthy AI systems are built more like well-functioning teams than like extremely smart individuals, and that the architecture should reflect that from the beginning.




