Your Slack lights up at 9:14 AM: Conversion in Germany dropped 8% last week. What happened?
You know the drill. Pull up the experiment tracker. Cross-reference the incident log. Scan three weeks of release notes. Check whether a competitor launched something. Each source lives in a different tool, requires different context, and takes 30-90 minutes of focused attention. By the time you have a working hypothesis, half the day is gone.
Now picture a different workflow. You type one question into a system that decomposes it into four parallel research threads, dispatches a specialized subagent to each, waits for all four to report back, and synthesizes a weighted hypothesis brief with confidence levels. Total elapsed time: roughly 20 minutes in well-tuned implementations.
This is the orchestrator-subagent pattern for knowledge work. Anthropic's multi-agent research found that multi-agent systems can outperform single-agent approaches by roughly 90% on specific complex research task types[1], while cutting elapsed time substantially — though actual gains depend heavily on how well the decomposition matches the problem structure. Those improvements stop being abstract the first time you watch four subagents work a problem simultaneously.
The Mental Model: Orchestrator, Subagents, Synthesis
Understanding the three-phase architecture behind parallel research machines
Before diving into implementation, you need a clear mental model. The orchestrator-subagent pattern has three distinct phases, and confusing them is where most teams stumble.
Phase 1: Decomposition. The orchestrator receives a question and breaks it into independent research threads. The key word is independent — if thread B depends on the output of thread A, they cannot run in parallel. Good decomposition produces threads that can execute simultaneously without coordination.
Phase 2: Parallel execution. Each subagent runs its assigned thread using specialized tools and domain knowledge. Subagents operate in isolation. They do not communicate with each other during execution. This constraint is a feature, not a limitation — it eliminates coordination overhead and keeps each agent's context window clean.[2]
Phase 3: Synthesis. The orchestrator collects all subagent reports and produces a unified brief. This is where confidence levels, weighted hypotheses, and cross-thread patterns emerge. Synthesis is the phase that transforms raw findings into decisions.
For a deeper dive into how orchestrator-subagent patterns are evaluated, Microsoft's multi-agent orchestration guidance and Anthropic's multi-agent research are worth reading alongside this article.
Decomposition Design: Breaking Questions Into Parallel Threads
How to split a research question so subagents can work simultaneously
Decomposition is where the orchestrator earns its keep. A poorly decomposed question creates subagents that duplicate effort or miss critical angles. A well-decomposed question produces a brief that no single analyst could assemble in the same timeframe.
Consider the original question: Why did conversion drop 8% in Germany? A skilled orchestrator decomposes this into four independent threads, each targeting a different causal category.
| Subagent | Thread Focus | Data Sources | Output Format |
|---|---|---|---|
| Experiment Tracker | Active and recently concluded A/B tests affecting the DE funnel | Experiment platform API, feature flag system | List of experiments with traffic allocation and measured impact |
| Incident Log Analyst | Production incidents, latency spikes, or payment errors in DE region | PagerDuty, Datadog, payment gateway logs | Timeline of incidents with duration and affected user count |
| Release Notes Parser | Code deployments that touched checkout, pricing, or localization | GitHub releases, deploy logs, changelog | Annotated list of relevant releases with change descriptions |
| Competitive Monitor | Competitor launches, pricing changes, or campaigns in the DE market | Web search, press releases, app store updates | Summary of competitive events with estimated timing and relevance |
Notice how each subagent receives four elements: an objective (what to find), an output format (how to structure findings), tool guidance (where to look), and clear boundaries (what falls outside its scope). This specificity is not optional. Vague task descriptions produce vague results and waste tokens on irrelevant exploration.
The decomposition prompt itself follows a predictable structure. Here is a simplified version of what the orchestrator generates internally.
orchestrator-decomposition.tsinterface SubagentTask {
id: string;
objective: string;
outputFormat: string;
tools: string[];
boundaries: string;
timeoutMs: number;
fallbackStrategy: "skip" | "retry-once" | "use-cached";
}
function decomposeQuestion(question: string): SubagentTask[] {
// The orchestrator LLM generates these from the question
return [
{
id: "experiment-tracker",
objective: `Find all A/B tests that ran in DE market during the
affected period. Report traffic allocation, variant performance,
and whether any test concluded with a winner deployed.`,
outputFormat: "JSON array of { testName, status, deImpact, confidence }",
tools: ["experiment-api", "feature-flags"],
boundaries: "Only tests affecting DE users. Ignore global tests with <1% DE traffic.",
timeoutMs: 120_000,
fallbackStrategy: "retry-once"
},
{
id: "incident-log",
objective: `Identify production incidents in EU-west region during
the affected period. Focus on checkout, payment, and page-load
degradation.`,
outputFormat: "JSON array of { incident, severity, startTime, duration, usersAffected }",
tools: ["pagerduty-api", "datadog-api"],
boundaries: "EU-west region only. Ignore incidents resolved in under 2 minutes.",
timeoutMs: 90_000,
fallbackStrategy: "use-cached"
},
// ... release-notes-parser, competitive-monitor
];
}Parallel Execution: Running Subagents Without Collision
Practical patterns for launching and managing concurrent research threads
Once the orchestrator has its task list, execution is straightforward but requires discipline around three concerns: isolation, resource limits, and progress tracking.
Isolation means each subagent gets its own context window and tool sessions. Subagents never share state during execution. If subagent A discovers something relevant to subagent B, that connection gets made during synthesis, not mid-flight. Shared state introduces race conditions and debugging nightmares.[4]
Resource limits prevent any single subagent from consuming disproportionate compute. Set token budgets and wall-clock timeouts for each thread. Simple fact-finding gets 3-10 tool calls per subagent; deep analysis might allow 30+. The orchestrator decides the budget at decomposition time.
Progress tracking lets the orchestrator know which subagents have finished, which are still running, and which have failed. The simplest implementation uses Promise.allSettled() in JavaScript or equivalent patterns in other languages, which waits for every promise to complete regardless of individual success or failure.
- 1
Create isolated subagent instances with task-specific prompts
typescriptconst subagentPromises = tasks.map(task => spawnSubagent({ systemPrompt: buildSubagentPrompt(task), tools: task.tools, tokenBudget: 8_000, timeout: task.timeoutMs, }) ); - 2
Execute all subagents concurrently and collect results
typescriptconst results = await Promise.allSettled(subagentPromises); // results: Array<{status: 'fulfilled', value} | {status: 'rejected', reason}> - 3
Classify outcomes and apply fallback strategies
typescriptconst classified = results.map((result, i) => ({ taskId: tasks[i].id, status: result.status, data: result.status === 'fulfilled' ? result.value : null, error: result.status === 'rejected' ? result.reason : null, fallback: tasks[i].fallbackStrategy, }));
Graceful Failure Handling: When Subagents Break
Designing for partial success rather than all-or-nothing outcomes
Subagents will fail. APIs go down. Rate limits get hit. Context windows overflow on unexpectedly large datasets. The question is never will something fail but how does the system behave when it does.[4]
The worst possible design treats failure as a binary: either all subagents succeed and you get a brief, or any failure aborts the entire run. Real analysis tolerates partial information all the time. A senior analyst who cannot access the incident log will still produce a useful brief from the other three sources — they will just note lower confidence around the infrastructure angle.
Your parallel research machine should work the same way.
Promise.all() — one failure kills everything
Retry indefinitely until success
Silently omit failed threads from the brief
Fixed timeout for all subagents regardless of task complexity
No confidence adjustment when data is missing
Promise.allSettled() — collect all outcomes
Retry once, then fall back to cached data or mark as unavailable
Explicitly flag missing threads with impact on confidence
Per-task timeouts calibrated to expected data volume
Confidence scores decrease proportionally to missing sources
Each subagent report should include a structured status field. This gives the synthesis phase the information it needs to weight findings appropriately.
subagent-report.tsinterface SubagentReport {
taskId: string;
status: "complete" | "partial" | "failed";
completeness: number; // 0.0 to 1.0
findings: Finding[];
sourcesConsulted: string[];
sourcesUnavailable: string[];
executionTimeMs: number;
tokenUsage: number;
notes: string; // Free-text context about limitations
}Synthesis Prompts That Produce Confidence Levels
Turning raw subagent findings into a weighted hypothesis brief
Synthesis is where the orchestrator-subagent pattern diverges from simple parallelization. A naive implementation just concatenates subagent reports and asks the LLM to summarize. A sophisticated implementation asks the LLM to reason across reports, identify correlations, weigh evidence, and assign confidence levels to competing hypotheses.
The synthesis prompt needs to accomplish five things: ingest all reports with their completeness metadata, identify convergent evidence across threads, generate ranked hypotheses, assign confidence scores, and flag gaps that reduce certainty.
synthesis-prompt.tsfunction buildSynthesisPrompt(reports: SubagentReport[]): string {
const reportSummaries = reports.map(r => `
## ${r.taskId} (${r.status}, ${Math.round(r.completeness * 100)}% complete)
${r.findings.map(f => `- ${f.summary}`).join('\n')}
Sources unavailable: ${r.sourcesUnavailable.join(', ') || 'none'}
`).join('\n');
return `You are a senior research analyst synthesizing findings from
${reports.length} parallel investigation threads.
Here are the reports:
${reportSummaries}
Produce a hypothesis brief with the following structure:
1. **Top hypothesis** with confidence level (0-100%) and supporting evidence
2. **Alternative hypotheses** ranked by likelihood
3. **Evidence gaps** — what data was missing or incomplete
4. **Recommended next steps** to increase confidence
Rules for confidence scoring:
- Start at 50% (prior) and adjust based on evidence
- Convergent evidence from 2+ threads: +15-25%
- Single-thread evidence only: cap at 60%
- Each failed/partial subagent: -10% ceiling reduction
- Contradictory evidence: flag explicitly, do not average away`;
}Worked Example: Diagnosing the Germany Conversion Drop
Walking through the full pipeline from question to hypothesis brief
Let's trace the full pipeline for our running example. The product lead asks: Why did conversion drop 8% in Germany last week?
The orchestrator decomposes this into four threads and dispatches subagents. Here is what comes back after 18 minutes of parallel execution.
- 1
Experiment Tracker reports back (status: complete, 100%)
Found two active experiments. Experiment DE-checkout-v3 allocated 30% of DE traffic to a new checkout flow. The variant showed a 12% drop in conversion, but the experiment was still running and had not been stopped.
- 2
Incident Log Analyst reports back (status: partial, 70%)
Found one P2 incident: a payment provider timeout affecting SEPA direct debit between Tuesday and Thursday. Approximately 3,200 users hit the error. Could not access Datadog metrics due to an API rate limit — duration estimate is approximate.
- 3
Release Notes Parser reports back (status: complete, 100%)
Identified release v4.12.0 on Monday that updated VAT calculation logic for EU countries. Changelog mentions 'corrected edge case for German reverse-charge invoices.' No rollback was issued.
- 4
Competitive Monitor reports back (status: complete, 95%)
Main competitor launched a 15%-off spring promotion in DACH markets on Wednesday. Campaign ran across Google Shopping and social channels. No pricing changes detected from other competitors.
The Synthesized Hypothesis Brief
How the orchestrator turns four reports into a decision-ready document
The orchestrator feeds all four reports into its synthesis prompt. Here is the output — a structured brief that a product lead can act on immediately.
Primary Hypothesis (Confidence: 72%)
The checkout experiment DE-checkout-v3 is the dominant cause, contributing an estimated 3.6% of the 8% drop based on 30% traffic allocation and 12% variant underperformance
Converging evidence: experiment data (thread 1) + VAT logic change (thread 3) both affect the checkout funnel in the same timeframe
The VAT update may have introduced a display bug that compounds the experiment's negative variant
Secondary Hypothesis (Confidence: 45%)
Competitor spring promotion siphoned intent-stage traffic starting Wednesday
Timing aligns with the steepest part of the conversion decline
Single-thread evidence only — capped at 60% before partial-data penalty
Contributing Factor (Confidence: 35%)
SEPA payment timeout affected ~3,200 users, but this alone explains only 0.5-1% of the drop
Confidence reduced because Datadog metrics were only partially available (thread 2 at 70% completeness)
Building Your Own Parallel Research Machine
A practical checklist for implementing the orchestrator-subagent pattern
You do not need a massive infrastructure investment to start. The pattern works with any LLM that supports tool use, and the orchestration logic fits in a few hundred lines of TypeScript or Python. Here is what matters most when building your first implementation.
Implementation Readiness Checklist
Identify 3-5 recurring research questions your team spends >2 hours answering
Map the data sources each question requires (APIs, databases, web search)
Verify each data source has programmatic access (API keys, query endpoints)
Design decomposition templates for your most common question patterns
Set token budgets and timeouts per subagent type
Implement Promise.allSettled() or equivalent for parallel execution
Build structured report format with status and completeness fields
Write synthesis prompts with explicit confidence-scoring rules
Add logging for execution time, token usage, and failure rates per thread
Test with intentional failures to verify graceful degradation
Decomposition Design Rules
Each subagent thread must be executable without output from any other thread
Dependencies between threads force serial execution and negate the speed advantage of parallelization.
Every subagent task must specify an output format, not just an objective
Structured output formats make synthesis reliable. Free-form responses create unpredictable inputs for the synthesis prompt.
Subagents must never share context windows during execution
Shared context introduces race conditions and context pollution. Cross-thread insights belong in the synthesis phase.
Timeouts must be per-task, not global
A single slow subagent should not delay the entire pipeline. Per-task timeouts with fallback strategies preserve overall throughput.
Failed subagents must report why they failed, not just that they failed
The synthesis phase needs failure context to adjust confidence scores and recommend follow-up actions accurately.
Scaling Considerations and Cost Management
What changes when you move from prototyping to production
Multi-agent systems consume roughly 15x more tokens than a single chat interaction — a realistic rough estimate that varies significantly by subagent complexity. That cost is justified when the alternative is 4-8 hours of senior analyst time, but it demands attention to efficiency.
The primary variables driving performance variance are token usage, tool call frequency, and model selection. These give you clear optimization levers. Use a capable but efficient model for subagents — Claude Sonnet 4 handles most research threads well — while reserving the most capable model (Claude Opus 4) for orchestration and synthesis where reasoning depth matters most.[2]
Caching is another high-impact optimization. Many research questions share common subqueries. If your competitive monitor subagent already scanned the market yesterday, today's run can start from cached results and only check for updates. Gartner reported a surge in multi-agent system inquiries between 2024 and 2025[6], which signals that tooling and best practices are maturing rapidly — though enterprise adoption is still early and patterns are still settling.
We switched our weekly market analysis from a single long-running agent to four parallel subagents with an orchestrator. The quality of insights went up because each subagent could focus deeply on its domain, and total wall-clock time dropped from 45 minutes to under 12.
Common Pitfalls and How to Avoid Them
My subagents keep producing overlapping findings. How do I fix this?
Tighten your decomposition boundaries. Each subagent task should specify not just what to investigate, but what falls outside its scope. If the experiment tracker and release notes parser both surface the same checkout change, add explicit exclusion rules: the experiment tracker covers A/B test impacts, the release parser covers code changes and their intended behavior. Overlap in the findings is fine — overlap in the investigation wastes tokens.
How many subagents should I use? Is more always better?
Not always. Each subagent adds coordination overhead and token cost. For most business research questions, 3-5 subagents hit the sweet spot. Beyond 5, you see diminishing returns because the synthesis prompt struggles to integrate too many threads coherently. Start with fewer and add more only when you identify distinct data sources that current threads miss.
What happens when the orchestrator's decomposition is bad?
Bad decomposition is the single most common failure mode. Signs include subagents that finish instantly with no findings (too narrow), subagents that time out repeatedly (too broad), or synthesis that cannot form any hypothesis above 30% confidence. The fix: log every decomposition, review the ones that produced weak briefs, and refine your templates over time.
Can I use different models for different subagents?
Yes, and you should. Route simple data-retrieval tasks to faster, cheaper models and reserve capable models for threads requiring judgment like competitive analysis. Anthropic's own system uses Claude Opus 4 for orchestration and Claude Sonnet 4 for subagents.
How do I test this system before connecting real data sources?
Build mock subagents that return canned responses with varying status and completeness levels. This lets you stress-test synthesis and failure handling without burning API credits. Include at least one mock that returns partial results, one that fails entirely, and one with contradictory findings.
Start With One Question, Then Expand
The orchestrator-subagent pattern is not something you adopt wholesale on a Monday morning. Pick one recurring research question that currently takes your team more than two hours. Map its data sources. Write the decomposition template. Build a minimal orchestrator that spawns subagents, collects reports, and runs a synthesis prompt.
That first implementation will be rough. The decomposition will not be perfect. A subagent will fail in a way you did not anticipate. The synthesis prompt will produce confidence levels that feel arbitrary. All of that is normal, and all of it improves quickly with iteration.
What will not feel rough is the speed. The first time you get a weighted hypothesis brief in 20 minutes that would have taken a full morning of manual research, the pattern sells itself. From there, you will find yourself decomposing every complex question into parallel threads — not because you have to, but because serial research starts feeling like unnecessary friction.
- [1]Anthropic — Building a Multi-Agent Research System(anthropic.com)↩
- [2]Microsoft — Multi-Agent Orchestrator and Sub-Agent Architecture(learn.microsoft.com)↩
- [3]Eesel — Subagent Orchestration Patterns(eesel.ai)↩
- [4]Maxim AI — Multi-Agent System Reliability: Failure Patterns, Root Causes, and Production Validation Strategies(getmaxim.ai)↩
- [5]Kanerika — AI Agent Orchestration(kanerika.com)↩
- [6]Machine Learning Mastery — 7 Agentic AI Trends to Watch in 2026(machinelearningmastery.com)↩