Single vs. Multi-Agent: The Cognition-Anthropic Schism and Why Both Are Right

I recently argued that multi-agent AI is a trap. The information theory is clean, the benchmarks are damning, and the cost data makes the case on its own. But while I was building that argument, I kept running into a problem: Anthropic's production data tells a different story, and it's not wrong.

Two production teams at the frontier of agent engineering have published data that directly contradicts each other. Cognition says multi-agent architectures are structurally fragile: compounding errors across agent boundaries cause systematic information loss, and single-threaded linear execution is the fix. Anthropic says multi-agent delegation is a context-compression primitive: an orchestrator-worker setup with Opus 4.5 leading Sonnet subagents beat a single-agent Opus 4.5 by 90.2% on their internal research eval, cutting execution time by up to 90% through parallel search. The cost is real too: Anthropic's multi-agent approach burns roughly 15x the tokens of a single-agent equivalent. But the performance lift is also real.

Both cite production metrics. Both are credible. And both are right, for the model version they tested on.

This isn't a debate that needs a winner. It's a category error that needs a reframe. The optimal agent architecture is model-version-specific, with a knowledge half-life measured in months. The teams defending their position as universal truth are optimizing for a snapshot that's already expiring.

Two Failure Modes, Not Two Opinions

The schism dissolves the moment you notice that Cognition and Anthropic are protecting against different failure modes.

Cognition's primary concern is context fidelity loss at handoffs. Every time work crosses an agent boundary, the receiving agent gets a compressed version of what the sending agent knew. That compression is lossy. In stateful coding tasks, the lost details compound: a variable name drops out of a summary, a constraint gets paraphrased into ambiguity, and three handoffs later the system is solving a different problem than the one it started with. Cognition's prescription — single-threaded linear execution with aggressive history compression — eliminates handoff points entirely. No boundaries, no boundary errors.

Anthropic's primary concern is context exhaustion in long-horizon tasks. A single agent working a complex problem accumulates tens of thousands of intermediate tokens: tool outputs, reasoning traces, partial results. Without delegation, those tokens crowd the context window, dilute attention, and cause semantic drift. Anthropic's subagents solve this by acting as compression functions: they explore a subtask, consume 10,000+ tokens of intermediate state, and return a focused 1,000-to-2,000-token summary. The primary agent never sees the noise. Its context stays clean.

Dimension	Cognition (Single-Agent)	Anthropic (Multi-Agent)
Failure mode avoided	Information loss at agent boundaries	Context exhaustion and attention dilution
Failure mode accepted	Risk of context window overload	Risk of lossy inter-agent compression
Architecture	Single-threaded linear, aggressive history compression	Hierarchical delegation, bounded subagent summaries
Implicit assumption	The model can handle long contexts reliably	The model can delegate and reintegrate reliably

Neither architecture is wrong. Each one trades away the failure mode the other prevents. The question isn't "which is better?" It's "which failure mode is more dangerous for your task, your context length, and your model version?"

Architectural Best Practices Have a Six-Month Half-Life

I've spent enough time on both sides of the monolith-vs-microservices argument in traditional software to know that architectural debates usually settle into "it depends" and then stay there. This one is different. The correct answer changes with every major model release.

Consider the trajectory. In 2024, models struggled with long contexts. "Lost in the middle" effects were measurable and significant. Multi-agent decomposition was a rational workaround: smaller contexts per agent meant less attention dilution. The architectural advice was sound: break the work up.

By mid-2025, context windows had grown and utilization had improved. Under controlled experiments with matched thinking-token budgets, single-agent systems consistently matched or outperformed multi-agent systems on multi-hop reasoning tasks. A single agent succeeded 28 out of 28 times in controlled tests; multi-agent setups failed 36 to 100 percent of the time. The advice flipped: keep it simple.

Then Anthropic published data from their Code execution patterns showing a 98.7% token reduction (150,000 tokens down to 2,000) by delegating tool-use subtasks to code-generating subagents. The advice didn't just flip again. It forked. For research-heavy parallel search, multi-agent won, at ~15x the token cost. For stateful sequential coding, single-agent won. Same model family, different task shape, different correct architecture, and now a cost dimension that determines which fork a given team should take.

This isn't oscillation. It's a direct consequence of how model capabilities evolve. Each new model version changes the capability floor, which changes which failure mode is binding, which changes the optimal architecture. Best practices from six months ago are empirically questionable for current frontier models. The knowledge half-life isn't a metaphor; it's a measurable property of a discipline where the substrate changes faster than the consensus can stabilize.

The Bitter Lesson, applied to agent scaffolding, makes this trajectory predictable. No-code workflow builders that teams were deploying as recently as early 2025 are already obsolete, replaced by single long-horizon agents that plan, act, and reflect in continuous loops. Planner-executor scaffolds that decomposed reasoning into explicit stages have merged into interleaved execution. Multi-agent orchestrator-critic graphs survive only for true parallelism or strict context isolation. Every piece of scaffolding you build is a bet on a specific model capability gap, and those gaps close on a quarterly cadence. Martian's ARES framework takes this further: by fine-tuning model weights to specific harness configurations, it couples architecture and model so tightly that switching either one invalidates the other. The half-life isn't just about model upgrades anymore; it's about the entire model-harness pairing.

The Code-Beats-Prompts Catalyst

If the half-life claim sounds abstract, one recent shift makes it concrete. It also happens to directly change the single-vs-multi-agent calculus.

In 2024, the standard tool-use pattern was direct MCP calling: agents invoked tools through JSON schema definitions, passing structured parameters and receiving structured responses. This was presented as a stability and auditability win. The consensus was clear: many narrow, well-defined tools.

By 2026, Anthropic's production data showed that replacing direct MCP tool calls with filesystem-based TypeScript API generation achieved a 98.7% token reduction. Cloudflare validated this independently with their Code Mode implementation, pointing to a simple explanation: pretraining corpora contain orders of magnitude more programming code than JSON specification documents. Models are natively better at generating and reasoning about code than about schema definitions.

This reversal matters for the schism because it changes the economics of delegation. When tool-use meant 150,000 tokens of JSON context, delegating to a subagent was expensive and the inter-agent message overhead was a real cost. When tool-use means 2,000 tokens of generated code, subagent delegation becomes cheap and context-clean. The shift tips the calculus toward Anthropic's position — but only for models with strong code generation capabilities. A model that can't reliably generate correct TypeScript API calls doesn't benefit from this pattern at all.

Notice what happened to the schism's tradeoff table. When tool-use tokens drop 98.7%, context exhaustion becomes far less likely in single-agent flows. That weakens the case for Anthropic's delegation-as-compression pattern on those workloads. But the same code-generation capability makes subagent coordination cheaper and cleaner, which strengthens the case for delegation on parallel-search workloads. The same capability shift helps both sides and hurts both sides, depending on the task shape.

The takeaway isn't that code-generation is "better" than JSON schemas. It's that a seemingly settled tool-use pattern flipped within 18 months, and the flip changed which side of the single-vs-multi-agent debate is correct for a given workload. Architectural recommendations don't degrade gracefully; they break when underlying capabilities cross thresholds.

What This Means for Your Architecture Decisions

If the optimal architecture is model-version-dependent, the most expensive mistake isn't picking the wrong side of the schism. It's picking a side at all and then failing to re-evaluate.

This is the framework I've been using, and it's the only one I've seen hold up across model versions. It's an evaluation sequence, not an architecture diagram:

1. Baseline with a single agent. Measure its performance on your actual tasks. You need a concrete number, not a feeling. Many teams skip this step entirely and jump to multi-agent because it feels more sophisticated.

2. Optimize retrieval and tool design. Most performance problems that look like architecture problems are actually context problems. Better retrieval, structured context files, progressive disclosure, tighter tool definitions. Each of these is cheaper than adding agents, and many teams discover the "need" for multi-agent disappears after this step.

3. Re-measure. If there's still a quantified gap — say, 89% accuracy against a 97% requirement — now you have a measured delta that justifies adding complexity.

4. If you go multi-agent, go thin. Two agents with a strict state machine outperform six debating agents. Deterministic handoffs beat free-form collaboration. Production data backs this up: one team reduced from six debating agents to two with a state machine and cut costs by 95% and latency by 83% while losing less than 1% accuracy.

5. Re-evaluate on every model upgrade. This is the step the schism teaches. When you move from Opus 4.7 to 4.8, your architecture decision needs retesting. Harness-only changes can swing performance by nearly 14 percentage points — and what constitutes the right harness changes with the model. If the knowledge half-life is six months, re-evaluate architecture at minimum with every major model release. Not every sprint, but not every year either. Quarterly is the floor.

The build-to-delete principle applies here with force. Every multi-agent handoff, every subagent delegation pattern, every orchestration layer should be designed for removal. Not because multi-agent is wrong, but because the specific multi-agent pattern that's right today may be unnecessary in six months. Models are internalizing capabilities that currently require explicit scaffolding. The trajectory is predictable even if the timeline isn't: thinner harnesses, fewer explicit orchestration layers, more capability pushed into the model itself.

The Only Durable Position

Cognition will be right again. Anthropic will be right again. Both will be wrong again. The next model generation will change the context-utilization characteristics that determine which failure mode is binding, and the architectural advice will shift accordingly.

This is uncomfortable for teams that want a settled answer. It means agent architecture is not a design decision you make once. It's a continuous empirical discipline, closer to performance tuning than to system design. You measure, you architect for the current model's capabilities, and you build the evaluation infrastructure to know when the ground shifts.

The position worth defending isn't single-agent or multi-agent. It's that architectural choices in agent systems are timestamps, not truths. The teams that treat them that way will spend less time debating and more time measuring. And when the next model drops and the calculus changes, they'll be the ones who notice first.