Published on

Context Engineering: The Missing Layer

Authors

Context Engineering: The Missing Layer

A recent paper tested a dead-simple technique: repeat the entire input prompt before the model generates a response. Across seven models and seven benchmarks, prompt repetition won 47 comparisons, lost zero, and tied 23. On one task, accuracy jumped from 21% to 97%. The researchers filed this under "prompting." It isn't. The technique works because repeating the prompt lets every token attend to every other token during the prefill pass, overcoming limitations in causal attention. The improvement has nothing to do with what you say and everything to do with how information is structured in the context window.

Why repetition helps: LLMs use causal attention — each token can only attend to tokens before it, not after. In a single copy of the prompt, token 500 can't directly use information from token 800. Repeating the prompt gives every token a second chance to attend to every other token during the prefill pass, effectively creating bidirectional context from a unidirectional architecture.

That distinction, between crafting clever instructions and engineering the information environment, is the fault line between prompt engineering and context engineering. And most teams building AI agents are still on the wrong side of it.

No Amount of Prompt Refinement Overcomes Context Poverty

Prompt engineering treats the model as a question-answering system: if you phrase the question correctly, you get the right answer. Context engineering treats it as a reasoning engine operating inside an information environment. The question shifts from "what do I ask?" to "what does the system know, and when does it know it?"

This reframing matters because of the prompt engineering ceiling. A perfectly-worded prompt with incomplete information will always produce worse results than a mediocre prompt with complete context. Teams investing in prompt optimization are often solving the wrong problem. The model doesn't need better instructions; it needs better information.

The prompt repetition result illustrates this precisely. Everyone who encountered that technique classified it as a prompting trick. But the mechanism is architectural: the second copy of the prompt creates bidirectional attention across all tokens, something causal attention in a single pass can't achieve. It's a context engineering insight that got filed in the wrong drawer.

This distinction has organizational consequences. "Prompt engineer" as a job title is already declining. The roles replacing it ("context architect," "agent engineer") reflect the shift from wordsmithing to systems design. The most valuable AI system design decisions are invisible to users: they're about information architecture, not instruction phrasing.

Context Engineering Is Five Distinct Architectural Concerns

Context engineering isn't a single technique. It's five independent dimensions, each requiring different engineering investment:

  1. System prompt design — establishing the right altitude of guidance without over-constraining the model
  2. Tool and skill definitions — the agent's interface to the world, where schema bloat directly taxes the context budget. Skills (reusable, composable instruction sets that agents load on demand) have rapidly overtaken monolithic tool schemas as the preferred pattern here, precisely because they're more context-efficient
  3. Retrieval strategy — what information to fetch, when to fetch it, and at what granularity
  4. Memory management — how to persist and recall information across turns and sessions
  5. Context window hygiene — compaction, structured notes, fresh subagent contexts, and active noise management

This taxonomy isn't one vendor's framework. LangChain's Deep Agents documentation independently converges on a similar five-category model: input context, runtime context, compression, isolation, and long-term memory. When two teams building production agent frameworks independently arrive at the same decomposition, that's a signal. The structure is emergent, not imposed.

Most teams I've talked to have invested seriously in one or two of these dimensions (usually retrieval and system prompts) and treat the others as afterthoughts. The practical audit: score your system 1-5 on each dimension. If three of them are below a 2, your "prompting problems" are actually context architecture problems.

Progressive disclosure deserves specific attention, and skills are the clearest example. Rather than injecting every tool schema and instruction set into the system prompt at startup, agent frameworks increasingly load skills on demand. The Claude Code team found that skills files could point to other files, which point to deeper references, letting the agent discover context through recursive exploration. The agent loads a skill's full instructions only when it determines the skill is relevant. This is a fundamentally different architecture than pre-loading everything into the context window, and it scales because the agent only pays the token cost for capabilities it actually uses.

Context Rot Is the Dominant Failure Mode in Long-Running Agents

As a context window fills with accumulated conversation, tool outputs, and intermediate state, model decision quality degrades measurably. The model doesn't forget. It attends to everything, including noise. This is context rot, and it's the single most common failure mode in production agent systems.

The symptoms are specific and recognizable:

  • Repetitive failure loops — the agent repeats the same incorrect fix after you've already rejected it
  • Instruction bleed — older constraints resurface and conflict with current instructions
  • Accuracy decay — previously correct logic becomes incomplete or wrong
  • Reasoning truncation — the agent skips steps it was handling correctly earlier
  • Permission fatigue — the agent proposes low-value actions due to degraded judgment

If you've run an agent session long enough, you've seen at least three of these. They're not random failures. They're the predictable consequence of a saturated context window where the signal-to-noise ratio has dropped below a usable threshold.

Larger context windows make this worse, not better.

More capacity means more room for noise accumulation before hitting the hard limit, which means the rot progresses further before anyone notices. The agent keeps running, keeps producing output, but the output is progressively less useful. In enterprise environments, the noise extends well beyond conversation history: raw telemetry from dozens of heterogeneous systems, tool schemas consuming tens of thousands of tokens before any task execution begins. A single GitHub MCP server schema, for instance, consumes roughly 55,000 tokens of context budget before the agent has done anything at all.

Production systems that handle this well share one architectural principle: structure work so each execution unit gets a fresh context rather than managing one long session better. The GSD framework targets 30-40% context usage in its orchestrator even through large parallel workloads. That remaining capacity is headroom that prevents rot.

The Compaction vs. Retention Schism Is Unresolved

The industry hasn't converged on how to manage context over long agent sessions, and the two leading approaches fail in different ways.

Compaction (Anthropic's approach) applies single-pass summarization when the context window fills. It preserves architectural decisions and recently accessed files while compressing conversational history into a dense summary. The advantage is operational simplicity and predictable token consumption. The failure mode is silent: a specific error code, configuration parameter, or dependency version gets summarized out of existence, and subsequent tool calls revert to previously failed states. Because the compressed representation replaces the original history, diagnosing where information was lost is difficult.

Retention (LangChain's Deep Agents approach) offloads original messages and tool outputs to the filesystem, preserving exact token sequences for later retrieval. Summarization happens only when storage thresholds force it. The advantage is exact state recovery. The failure mode is architectural complexity: the pipeline must manage filesystem I/O, retrieval routing, and synchronization between the active window and the persistent store.

There's a useful analogy here. Compaction operates like momentum in gradient descent — it smooths the agent's trajectory across sequential interactions, retaining high-signal state while discarding stochastic noise. Well-tuned compaction stabilizes long-horizon behavior. Naive full-transcript logging, by contrast, is the equivalent of zero momentum: every noisy gradient gets full weight, and the optimization path oscillates.

I wouldn't pick a winner here because the right choice depends on your error tolerance. If your agent's failure mode is silent data loss, tasks that fail because a critical detail got compressed away, you're compacting too aggressively. If your failure mode is infrastructure complexity, debugging synchronization bugs between your context store and your inference engine, you're retaining more than you need.

What both camps are converging toward is a three-tier memory architecture: a lean short-term context for the active task, a temporary working memory for the current session, and agent-controlled persistent long-term storage. This compartmentalization lets you compact the short-term tier aggressively while retaining exact records in the persistent tier for retrieval when needed.

Enterprise Context Engineering Is a Different Problem Entirely

In developer tools, context engineering means managing conversation history, tool schemas, and code. In enterprise environments, the context problem is fundamentally harder: raw telemetry from heterogeneous monitoring systems, configuration state across dozens of services, incident timelines that require cross-system alignment, and the institutional knowledge of how these systems actually behave in production.

Enterprise AI agents fail in production not because the model is too weak, but because it lacks contextual understanding. The model is the engine; context is the car. Most enterprise AI stacks are missing the architectural primitive that connects the two: a context engine that filters, aligns, and sequences operational data into task-relevant inputs.

Consider SRE diagnosis. Platforms that query live telemetry at investigation time produce fundamentally different results from those relying on pre-indexed snapshots. The live-query approach mirrors how expert SREs actually debug: they pull specific metrics, correlate traces, and reason causally across temporal sequences. A pre-indexed snapshot can't capture the temporal relationships that make root cause analysis possible.

Recent research formalizes this insight. The DT-MDP-CE framework treats context engineering as a reinforcement learning problem: learn a reward function from mixed-quality investigation trajectories, then use the resulting policy to steer what context the agent receives at each reasoning step. The most striking finding is cross-agent transfer: a context policy learned on one agent architecture improves a completely different agent on the same task domain. Context engineering, it turns out, is agent-agnostic. The information architecture generalizes even when the reasoning architecture doesn't.

The organizational implication: teams need people who understand both the domain data and the agent's information needs. In enterprise settings, the best context engineers tend to be systems thinkers with on-call experience, people who know what information matters because they've debugged these systems at 3 AM.

Four Patterns Already Proving Out in Production

Context engineering as a discipline is still formalizing, but several concrete patterns are already proving out in production:

Fresh subagent contexts per task. Instead of managing one long session, spawn specialized subagents with only the relevant context for each discrete unit of work. Each subagent operates in a clean context window and returns concise results to the orchestrator. This is the single most effective mitigation for context rot.

Hierarchical retrieval. Instead of retrieving full document chunks, search at the sentence level first (keyword and semantic), then selectively expand to full chunks only for high-confidence matches. The A-RAG framework demonstrates this: ~1,800 tokens per retrieval operation versus ~5,400 for naive chunk retrieval — a 3x reduction while maintaining or improving accuracy on multi-hop reasoning tasks.

Agent-constructed context. Instead of pre-building RAG indexes, give the agent search tools and let it construct its own context. The Claude Code team made this explicit design choice: a Grep tool instead of a pre-built index. As models get smarter, they increasingly excel at deciding what information they need and finding it themselves. Progressive disclosure, where each file points to deeper references, lets this scale without pre-indexing everything.

Graph-based context injection. Pre-build a knowledge graph of the codebase or domain, then inject a compact graph report into the agent's context before tool use. One implementation achieves ~71x token compression compared to raw file crawling, and the agent starts every task with architectural awareness it would otherwise spend hundreds of tokens discovering.

Information Architecture Is the Durable Advantage

Models will commoditize. Context windows will grow. Inference will get cheaper. None of this changes the fundamental problem: attention is finite, noise accumulates, and the system that delivers the most relevant information to the model at decision time wins.

Recent work on learnable context policies suggests where this is heading. Agents trained with reinforcement learning have independently discovered strategies like proactive context summarization and selective deletion of stale records, behaviors the researchers didn't explicitly reward but that emerged from optimizing for task completion. Context management is a skill that can be learned, not just engineered.

The deeper insight is that context engineering means designing the system to act despite inevitable inaccuracies. The agent's world model is permanently misspecified; the environment is too complex to represent perfectly in any context window. The discipline isn't about perfecting the representation. It's about building systems robust enough to reason well with an imperfect one.

Context engineering is to prompt engineering what software engineering is to scripting: a discipline that scales. The teams building context infrastructure today — the retrieval pipelines, the memory tiers, the progressive disclosure architectures — are compounding an advantage that gets harder to replicate with each iteration. The window to start is while most teams are still optimizing their system prompts.