2026-01-26 · deep-dive · ai-ml

Context Engineering Deep Dive

Context engineering is the discipline of managing finite attention through four architectural strategies (Write, Select, Compress, Isolate) that prioritize quality over quantity and reversibility over convenience.

The promise was simple: give models longer context windows and they’ll handle more complex tasks. A million tokens should beat 128k, which should beat 32k. But practitioners discovered something counterintuitive—a focused 300-token context often outperforms an unfocused 113,000-token context. The problem isn’t capacity. It’s curation.

Context engineering is the art of managing a finite resource with diminishing returns. And the winning strategy isn’t bigger windows—it’s smarter architecture.

The Constraint: Why Context Engineering Exists

Andrej Karpathy framed it well: the context window is RAM, and context engineering is OS-level memory management. But here’s what makes this harder than traditional memory management—you’re not just fighting capacity limits, you’re fighting attention.

The quantitative reality is sobering. The NoLiMa benchmark found that 11 of 12 tested models dropped below 50% of their short-context performance at just 32,000 tokens. That’s not a soft degradation—it’s a cliff. At 128,000 tokens, most models hit what practitioners call the “rot threshold,” where summarization becomes necessary to maintain coherence. And despite vendors advertising million-token windows, the effective ceiling for most models sits around 256,000 tokens.

Stanford researchers quantified the “lost in the middle” phenomenon: 15-47% performance degradation as context length increases. Models attend to beginnings and endings, but the middle becomes a blur.

Why don’t bigger windows just solve this? Three reasons compound against you:

Quadratic attention cost. Transformer attention scales O(n²) with sequence length. Doubling your context quadruples compute cost. This isn’t just billing—it’s latency, and for agents making dozens of calls, latency compounds.

Attention dilution. Each token competes for the model’s limited attention budget. More context means attention spreads thinner. Critical information gets less focus when surrounded by marginally relevant content.

Distraction amplification. Research shows models with extended context “favor repeating actions from vast history rather than synthesizing novel plans.” Under limited capacity, more context equals more noise. The model over-indexes on what it’s seen before rather than reasoning about what it should do now.

The analogy to database optimization is instructive. Having all the data isn’t the same as having the right data indexed and ready. You wouldn’t query a billion-row table without indexes just because your server has enough RAM to hold it. The same principle applies here: the shift is from prompt engineering (“what to say”) to context engineering (“what the model sees, and in what order”).

The Four Strategies: A Pattern Language

A formal taxonomy emerged from recent research that structures the entire discipline. These four strategies—Write, Select, Compress, Isolate—form architectural primitives that compose together. Think of them as a vocabulary for treating context like a memory hierarchy.

Write Context persists information outside the window for later retrieval. This includes scratchpads (agents save notes within a session via tool calls) and memories that span sessions. The memory types mirror human cognition: episodic (few-shot examples), procedural (instructions and how-tos), and semantic (facts and knowledge). Implementation ranges from simple JSONL append-only logs to sophisticated structured state objects.

Select Context retrieves relevant information at the right moment. This is where RAG lives, but selection goes beyond document retrieval. Smart selection applies to tool descriptions, knowledge graphs, and any information that should appear dynamically rather than upfront. Windsurf’s insight is worth noting: “indexing code is not context retrieval.” Effective selection combines AST parsing, semantic chunking, grep/file search, knowledge graphs, and re-ranking. Each technique catches what others miss.

Compress Context reduces tokens while preserving signal. This splits into compaction (reversible operations like replacing file contents with file paths) and summarization (lossy LLM-generated condensation). The priority hierarchy matters: Raw > Compaction > Summarization. Only summarize when compaction yields insufficient space.

Isolate Context splits processing across separate spaces. Multi-agent architectures give each subagent a focused window. Sandboxing keeps token-heavy objects (images, audio, code execution results) outside LLM context. State objects selectively expose fields, letting different parts of your system see different views of the same data.

The computer architecture parallel makes these memorable: Write is disk persistence, Select is caching and fetching, Compress is compression algorithms, Isolate is process isolation. Just as operating systems optimize memory access patterns across registers, cache, RAM, and disk, context engineering optimizes information placement across these four strategies.

These aren’t independent choices—they compose. You might Write a scratchpad, Select relevant entries via semantic search, Compress old entries through summarization, and Isolate different agents’ access to different scratchpad sections. The art is knowing which combination fits your task.

Compression: The Art of Reversibility

Compression is the most nuanced strategy because it’s where you lose information. The question isn’t whether to compress—long-running agents will exceed any context limit—but how to compress while preserving what matters.

The priority hierarchy deserves emphasis: Raw > Compaction > Summarization. This ordering reflects a preference for reversibility. Raw content can always be processed later. Compacted content (file paths instead of file contents) can be expanded if needed. Summarized content is permanently lossy—what the model decided to exclude is gone.

This maps cleanly to event sourcing principles. In event-driven systems, you maintain the ability to reconstruct state from the event log. You can always project a new view from events, but you can’t un-aggregate a snapshot. Summarization is aggregation. Use it when you must, not when it’s convenient.

Compaction patterns worth knowing:

The path-over-contents pattern: instead of including an entire file in context, include “Output saved to /src/main.py”. The file still exists. The model can request it if needed. You’ve traded tokens for an extra round trip.

Pruning old tool outputs: OpenCode protects the last 40k tokens while removing older tool results. The assumption is recent context matters more than historical context, and tool outputs are regenerable.

Structured Plans over prose: Manus found that “moving from constantly-rewritten todo.md files to structured Plan objects eliminated ~30% of token consumption.” The todo.md pattern treats context as a document. The Plan object pattern treats it as structured data. Structured data compresses better because it has known fields and predictable format.

Compaction thresholds are actively debated:

Factory.ai’s two-threshold model uses Tmax (the “fill line” where compression begins) and Tretained (the “drain line” target after compression). Higher Tmax means richer context but 50% higher inference costs. Narrower gaps between thresholds mean more frequent compression overhead.

Claude Code’s auto-compact triggers at ~95% capacity, preserving recent user messages (~20k tokens) alongside the summary. There’s a known bug where it sometimes triggers at 8-12% instead, causing infinite compaction loops. User feedback suggests 85-90% is a better threshold—95% is “too late” because it leaves no buffer for the model’s response.

Some practitioners advocate for strategic manual checkpointing at 70% capacity, arguing it outperforms any automatic approach. Amp rejects automatic compaction entirely, offering “handoff” where users specify goals when context is cleared.

The right threshold depends on your workload. Predictable, repetitive tasks can run closer to limits because the summarization loss is less damaging. Exploratory, reasoning-heavy tasks need more headroom.

When summarization degrades performance:

The rule of thumb is “not exceeding 80% compression”—don’t try to compress a 10k token conversation into 2k tokens. At some ratio, you lose too much signal. But empirical data on optimal ratios for different task types remains thin. This is a research gap the field is still working through.

The garbage collection analogy is apt: you want incremental collection at idle moments, not stop-the-world GC when memory is exhausted. Proactive, continuous compaction at natural task breakpoints beats reactive compression at 95% capacity.

State Objects: Structure Creates Predictability

OpenAI’s Agents SDK team makes a strong claim: for continuity-dependent tasks, state-based memory beats retrieval-based memory. Their argument is that retrieval systems are “brittle to phrasing variations, prone to missing contextual overrides, and unable to reconcile conflicting updates.” Travel decisions depend on continuity, priorities, and evolving preferences—not ad-hoc search.

This is a position worth evaluating critically. Retrieval-based approaches scale well and adapt to unstructured information. State-based approaches provide consistency guarantees but require upfront schema design. The trade-off is setup complexity for runtime reliability.

The emerging pattern is a three-zone architecture:

Pinned State (Hot Memory) is always in context. These are structured fields with clear precedence: identity, core profile, current session goals. Updates happen via explicit tool calls. This zone is treated as authoritative—it represents what the agent definitively knows about the user and task.

Recent Window (Working Memory) preserves the last N messages or ~20k tokens of raw conversation. This maintains “rhythm” and formatting patterns that summarization destroys. During compaction, this zone is protected. Losing recent context disrupts conversational flow more than losing historical context.

Cold Storage (Long-term Memory) persists outside context and retrieves on demand. Cross-session notes with timestamps and keywords live here. Retrieval is semantic or keyword-based. This zone can scale indefinitely because it doesn’t consume context tokens until queried.

The DDD parallel is direct: hot memory is the aggregate root, cold storage is the repository. The aggregate root maintains invariants and handles commands. The repository retrieves aggregates when needed. You wouldn’t load your entire database into memory to process a single command—you load the relevant aggregate.

The memory lifecycle structures how information flows:

Injection: Render state as YAML frontmatter plus Markdown notes into the system prompt. This is where the agent sees its current knowledge.
Distillation: During execution, capture candidate memories via tool calls. The agent observes something worth remembering and calls a tool to save it.
Trimming: When conversation history is trimmed, preserve short-term context. The trimming operation shouldn’t erase what the agent just learned.
Consolidation: Merge session notes into global memory with deduplication. This happens between sessions or at explicit checkpoints.

The precedence hierarchy is critical for conflict resolution:

Current user input (highest priority)
Session-scoped overrides
Global defaults (lowest priority)

This prevents stale memories from overriding explicit user intent. If a user said “I prefer window seats” six months ago but just said “give me an aisle,” the current input wins. Simple rule, but it requires structured state to enforce. Retrieval systems struggle with this because they return all matching memories without clear precedence.

Consolidation operations keep long-term memory clean:

Deduplication: Merge semantically equivalent memories. “Likes Italian food” and “Prefers Italian restaurants” become one entry.
Conflict resolution: Recency wins. The most recent last_update_date prevails.
Forgetting: Prune explicitly session-scoped notes. Statements containing “this trip” or “this time” are temporary by intent.

The OpenAI travel agent implementation demonstrates this concretely: a TravelState object with profile (pinned), global_memory (cold), session_memory (hot/working), and trip_history (cold). Each zone has different update patterns and retrieval semantics.

Multi-Agent Isolation: Context as Architecture

Multi-agent systems offer a different approach to context management: instead of cramming everything into one window, split the work across specialized agents with focused contexts. LangChain’s framing captures it: “At the center of multi-agent design is context engineering—deciding what information each agent sees.”

Three primary architectures emerge from the Strands Agents framework:

The Graph Pattern uses developer-defined flowcharts where the agent decides which path to follow. Context is shared via full transcript history in shared state—agents have complete context. This pattern excels at conditional logic, branching, and deterministic workflows. Cycles are allowed. Think orchestration: you’re the conductor, the agents follow the score.

The Swarm Pattern creates dynamic, collaborative teams with autonomous handoffs. Context is shared, containing the request, task history, and previous contributions. Agents select the next specialist via a handoff_to_agent tool. This pattern suits exploration, specialized perspectives, and collaborative problem-solving. Cycles are allowed. Think choreography: agents coordinate through shared conventions without central control.

The Workflow Pattern (DAG) executes a pre-defined task graph as a single tool. Context is task-specific—only relevant dependency outputs pass forward. This pattern handles repeatable, complex operations with fixed dependencies. No cycles allowed. Think pipelines: each stage receives what it needs from upstream stages.

The microservices parallel is instructive. Subagents are like isolated microservices with clean APIs—they don’t share state, they communicate through defined interfaces. Swarms are like event-driven choreography—agents react to events and produce events without centralized coordination. Graphs are like orchestration—a coordinator manages the flow.

Context isolation mechanisms prevent context pollution:

invocation_state: A dictionary passed to all agents without exposing data to the LLM. Configuration sharing while maintaining clean prompts.
Private message histories: Custom state schemas let each agent maintain isolated conversation history. Agent A’s mistakes don’t pollute Agent B’s context.
Selective field exposure: State objects expose only relevant fields to each agent. The billing agent sees payment info; the recommendation agent sees preferences.

Handoff patterns determine what context transfers between agents:

LangGraph Swarm handoffs transfer control by issuing a ‘Command’ that updates shared state, switching ‘active_agent’ and passing context. The default hands off the full conversation with notification. Custom handoffs can filter context, add instructions, or rename actions to influence LLM behavior.

The filtering question is crucial: does the next agent need the full conversation, or just a summary of the task and relevant findings? Full transfer preserves context but carries any accumulated confusion. Filtered transfer is cleaner but loses nuance.

The token cost trade-off is significant:

Multi-agent architectures can cost up to 15x more tokens than single-agent chat (per Anthropic benchmarks). Each subagent call resets context, repeating system prompts and core instructions. Stateful patterns (handoffs, skills) save 40-50% of calls on repeat requests because context persists. Subagents maintain consistent cost per request—the isolation has a price.

The choice depends on task characteristics. Multi-agent shines for multi-domain problems requiring specialized knowledge, parallel execution of independent subtasks, and tasks where isolation prevents cross-contamination. Single-agent is cheaper and simpler when tasks are coherent and sequential.

Anthropic’s finding that single-agent approaches underperform multi-agent on complex tasks comes with a caveat: multi-agent requires sophisticated prompt engineering for sub-task coordination. You’re trading context engineering complexity for coordination engineering complexity.

Just-in-Time: Progressive Disclosure as Scalability Pattern

The problem with eager loading is quantifiable. CLAUDE.md files often grow to consume half the context budget before any work begins. MCP implementations can load tens of thousands of tokens of tool descriptions upfront. All else being equal, an LLM performs better when its context is full of focused, relevant content—not boilerplate it might never need.

The progressive disclosure pattern from Anthropic’s Agent Skills architecture creates a three-tier information hierarchy:

First Level (Metadata): ~100 tokens of name and description loaded at startup. The agent knows skills exist and what they’re for, but not how to use them.

Second Level (Core Content): Full SKILL.md loaded when the agent determines relevance. The skill description matches the user’s intent, so the agent pulls in detailed instructions.

Third Level+ (Granular Details): Additional bundled files loaded only when specific scenarios require them. A skill might reference MODULE.md files that load on demand.

This mirrors how humans consult documentation—summary to detail as needed. You don’t memorize the entire API reference; you know where to look and drill down when you need specifics.

The implementation works like OS page swapping:

Initial state: Agent reads names and descriptions of all skills (~100 tokens each)
Semantic-based dynamic routing: Agent matches user prompt to skill descriptions
On match: Loads detailed SKILL.md into context
Progressive loading: SKILL.md can reference files that load on demand

The lazy evaluation parallel from functional programming is direct: defer computation (or loading) until actually needed. Eager evaluation is simpler but wasteful when most paths aren’t taken.

CLAUDE.md best practices apply this pattern:

Keep the base minimal (~50 lines, ~500 tokens): project overview, essential commands, stack information, pointers to documentation.
Move task-specific instructions to /docs folder: gotchas and non-obvious behaviors, testing strategies, architecture decisions. These load when the agent works on relevant tasks.
Delegate enforcement to tools: “If a tool can enforce it, don’t write prose about it.” ESLint handles style, TypeScript enforces types, Prettier formats. This creates backpressure—automated feedback for self-correction without consuming context.

The token savings are dramatic. MCP’s Tool Search pattern reduces context from ~134k to ~5k tokens by loading tool descriptions on demand rather than all upfront.

The hybrid trade-off: eager loading is faster for known, stable workflows because there’s no discovery latency. JIT loading is more efficient but adds latency when the agent must identify and load the right skill. Claude Code takes a hybrid approach—drop CLAUDE.md files upfront for speed, use grep/glob for autonomous exploration. This works better for less dynamic content where the upfront cost is worth the responsiveness.

The reliability question is whether semantic routing works. Skills depend on good naming and descriptions to trigger loading. Poor metadata means missed skills or wrong skills loaded. Your skill descriptions are effectively an API for the agent to use, and like any API, clarity matters.

When Context Fails: Defensive Engineering

Failure modes in context engineering are not edge cases. A study analyzing 1,642 multi-agent system execution traces found 41-86.7% failure rate across 7 state-of-the-art systems. Nearly half the runs failed. This isn’t a reason to avoid the technology—it’s a reason to build defensively.

Drew Breunig’s taxonomy identifies four failure types:

Context Poisoning embeds a hallucination or error that gets repeatedly referenced. The initial hallucination isn’t the real problem—it’s the cascade it triggers.

The phantom SKU example illustrates cascade effects: a hallucinated SKU corrupts pricing at step 6, triggers inventory checks at step 9, generates shipping labels at step 12, sends confirmations at step 15. By detection, four systems are poisoned. Incident response costs 10x what it would have if caught early.

The Gemini Pokemon case is memorable: the model hallucinated game state and became “fixated on achieving impossible or irrelevant goals.” Once the false state entered context, subsequent reasoning built on the lie.

Context Distraction causes models to over-focus on extended history rather than synthesizing novel plans. The model repeats past actions instead of reasoning about new situations. This emerges around 32k tokens for some models—they’ve seen so much history that they default to pattern-matching rather than problem-solving.

Context Confusion arises from superfluous content causing incorporation of irrelevant information. Research shows models perform worse with multiple tools. The Berkeley Function-Calling Leaderboard demonstrates that performance declines as tool count increases. Llama 3.1 8B failed with 46 tools despite adequate context window; it succeeded with 19 tools. More options meant more confusion.

Context Clash emerges from conflicting information within accumulated context. Early incorrect attempts remain present and influence later reasoning. The model can’t distinguish “I tried this and it was wrong” from “this is a valid approach.” The o3 benchmark showed score drops from 98.1 to 64.1 when information spread across conversation turns—the model couldn’t reconcile fragmented context.

Agent hallucinations escalate risk beyond chatbot hallucinations. They involve “physically consequential” errors where incorrect actions affect task execution, system devices, and user experiences in the real world. A chatbot hallucination wastes time. An agent hallucination might ship the wrong product, delete the wrong file, or send the wrong email.

Hallucinations “may arise during intermediate processes such as perception and reasoning, where they can propagate and accumulate over time.” Each step that builds on a hallucination makes the error harder to detect and correct.

Mitigation strategies the field has converged on:

Ensemble verification: Run the same step through multiple models, require consensus before acting. Expensive, but effective for high-stakes decisions.
Uncertainty estimation: Measure model confidence, pause execution below threshold. Let humans review when the model isn’t sure.
LLM-as-a-Judge pipelines: Audit each intermediate result with a separate model call. The judge model isn’t perfect, but it catches errors the primary model misses.
Dynamic tool loading: Load tools on demand, not all upfront. Fewer tools in context means less confusion.
Context quarantine: Separate untrusted content from trusted state. User input goes through validation before reaching core state objects.
Smaller, focused context windows: When possible, reduce context to reduce failure surface. A subagent with 8k tokens of focused context is less likely to hallucinate than a monolithic agent with 128k tokens of everything.

The mindset shift is from “prevent all failures” to “manage uncertainty.” The field has moved from “chasing impossible zero hallucinations” to building circuit breakers and verification layers. Assume poisoning will happen. Build systems that detect and recover.

Synthesis: Putting It Together

RAG versus Long Context is a false dichotomy. Most production systems use both because they address different needs:

RAG offers 1250x lower cost per query at scale
Long Context provides coherence over extended documents
RAG excels at dynamic, user-specific information
Long Context excels at full-document analysis and synthesis

Cost-sensitive applications favor RAG. Coherence-sensitive applications favor Long Context. Most interesting applications need both.

Chain-of-Agents represents a promising hybrid: multiple workers process different text segments sequentially, passing information through natural language. A manager agent synthesizes the final answer. The approach outperformed RAG across all 8 tested datasets, showed up to 10% improvement over long-context models, achieved ~100% improvement on BookSum when inputs exceeded 400k tokens, and reduces time complexity from n² to nk.

The pattern works because it isolates context while preserving information flow. Each worker sees a manageable chunk. The manager synthesizes without seeing everything.

A decision framework for choosing patterns:

What information changes? Dynamic data favors RAG; static data tolerates Long Context.
How often does it change? Frequent updates favor RAG; infrequent updates can use eager loading.
What requires continuity? Continuity-dependent tasks favor state objects over retrieval.
What can be isolated? Independent subtasks benefit from multi-agent isolation.

The principle that guides everything: quality trumps quantity. A focused 300-token context often outperforms an unfocused 113,000-token context. Most production use cases work well within 8K-32K tokens when context is curated properly. Larger windows are useful for full-document analysis, but not required for most retrieval or conversational tasks.

Where the field still debates:

Optimal compression ratios for different task types
The right auto-compact threshold (70%? 85%? 95%? Manual?)
How findings generalize across models (most research is GPT-4 or Claude specific)
Detection and recovery from context poisoning once it occurs
Systematic guidance on tool set design (the “~20 core atomic tools” heuristic lacks rigor)

Context engineering isn’t prompt engineering with extra steps—it’s architecture. The patterns parallel memory hierarchies, microservices, and distributed systems. If you’ve built systems that manage state, handle partial failures, and coordinate components, you have transferable intuitions. Apply them.

References

Key Sources

Anthropic Engineering - Effective Context Engineering - The canonical industry guide. Establishes the four strategies, explains compaction philosophy, covers sub-agent patterns. Start here.
LangChain Blog - Context Engineering - Framework maintainer perspective with the Karpathy framing. Good on failure modes and memory architectures.
OpenAI Cookbook - Context Personalization - Complete state object implementation with code. The travel agent example is directly implementable.
Factory.ai - Compressing Context - The Tmax/Tretained two-threshold model. Practical implementation details and failure mode analysis.
Drew Breunig - How Contexts Fail - Taxonomy of context failures with concrete examples. Essential for defensive engineering.
Google Research - Chain of Agents - The hybrid approach showing 100% improvement on BookSum. Research-grade but points toward production patterns.
Anthropic Engineering - Agent Skills - Progressive disclosure architecture from the source. The three-tier hierarchy explained.

Actions You Can Take

Audit your current context budget: Measure how many tokens your system prompt, tools, and CLAUDE.md consume before any task work begins. Target <20% for upfront context.
Implement the compaction priority hierarchy: Replace raw file contents with file paths where possible. Only summarize when compaction is insufficient. Measure token savings.
Design a state object schema: For any agent handling continuity-dependent tasks, define pinned/hot/cold zones. Implement precedence rules for conflict resolution.
Restructure CLAUDE.md for progressive disclosure: Move task-specific instructions to a /docs folder. Keep the base under 50 lines. Let tools enforce what prose currently describes.
Profile your tool count: If using more than 20 tools, measure whether performance degrades. Consider dynamic tool loading or domain-specific tool subsets.
Set up context failure monitoring: Log when compaction triggers, what gets summarized, and whether agents repeat failed approaches. These are early indicators of context problems.
Test your compression thresholds: If using auto-compact, experiment with triggering at 70%, 85%, and 95%. Measure task completion rates and output quality at each threshold.

Decision Point

Status: Briefing complete, ready for publication.

Summary: Context engineering is the discipline of managing finite attention through four architectural strategies (Write, Select, Compress, Isolate) that prioritize quality over quantity and reversibility over convenience. The winning approach treats the context window like a memory hierarchy, placing the right information at the right level at the right time.

To proceed: Approve to publish, or request revisions.