Context Management

The Fundamental Constraint

Every agent operates within a finite context window. This is not one limitation among many — it is the single most important technical constraint in agentic AI. Every design decision in a production agent system either respects this constraint or eventually fails because of it.

Context windows range from 128K to 1M+ tokens depending on the model, but raw capacity is misleading. What matters is effective context — the portion the model can reliably attend to and reason over. As context fills:

An agent that does not actively manage its context will degrade from capable to unreliable to broken — silently. There is no error message when an agent starts ignoring instructions buried 80K tokens back in the conversation.

Universal Thresholds

Regardless of platform, the same utilization bands apply. These reflect observed degradation patterns across models and agent frameworks.

Context %ActionDescription
0-60%Work FreelyAgent has ample room. Full instructions, tool results, and conversation history fit comfortably. No management overhead needed.
50-70%MonitorBegin tracking consumption. Log context size per turn. Identify which components are growing fastest.
70-80%CompactActively summarize conversation history. Preserve key decisions, file paths, and current task state. Drop verbose tool output.
80%+Clear/ResetMandatory context reset. Serialize critical state to external storage, start a fresh context, and reload only what is needed for the next sub-task.

The overlap between bands is intentional. The transition from “monitor” to “compact” depends on how much remaining work is expected and whether tool results are likely to be large.

Context Budget Per Component

A well-designed agent allocates its context window deliberately. Letting any single component grow unchecked starves the others. The following budget is a starting point for systems using 128K-200K token windows:

ComponentBudgetNotes
System instructions~10-15%Stable across turns. Prime candidate for caching.
Active conversation~30-40%The back-and-forth between user and agent. Grows fastest.
Tool results~20-30%File contents, search results, command output. Often the largest single-turn contributor.
Memory / retrieved context~10-20%RAG results, session memory, persistent facts.
Safety margin~10-15%Reserved headroom for the model’s output and unexpected spikes.
graph LR
    subgraph ctx ["Context Window"]
        direction TB
        S["System Instructions<br/>10-15%"]
        C["Active Conversation<br/>30-40%"]
        T["Tool Results<br/>20-30%"]
        M["Memory / RAG<br/>10-20%"]
        R["Safety Margin<br/>10-15%"]
    end
    S --- C --- T --- M --- R

    subgraph actions ["At Threshold"]
        A1["0-60%: Work freely"]
        A2["70-80%: Compact"]
        A3["80%+: Clear / Reset"]
    end
    ctx -.-> actions

The Compaction vs. Caching Trade-off

Modern inference providers cache repeated prompt prefixes — if the first N tokens of a request match a previous request exactly, those tokens are served from cache at reduced cost and latency. This is valuable for agents because the system prompt is identical across every turn. However, compaction breaks the cache: when an agent compacts its conversation history, the summarized content differs from the original, and every token after the divergence point becomes a cache miss. The solution is to place the system prompt first as a stable prefix (with explicit cache control breakpoints at its end), and treat everything after that boundary — memory, conversation, tool results — as dynamic content that can be compacted freely without cache impact. Cached input tokens are typically priced at 10% of uncached tokens, so preserving this boundary compounds to substantial savings over hundreds of turns.

Key insight: Separate stable content (system prompt) from dynamic content (conversation, tool results, memory). Only compact the dynamic portion.

Advanced Strategies

StrategyMechanismWhen to Use
Hierarchical MemoryL1: working memory (current context). L2: session memory (compacted summaries). L3: persistent DB (vector store, structured logs). Promote/demote between tiers based on recency and relevance.Long-running agents with recurring tasks.
Artifact-DrivenLarge data (file contents, search results, build output) stored externally. Only lightweight references kept in context: file paths, line ranges, content hashes. Agent re-fetches on demand.Agents that process many files or large outputs.
Plan Caching (APC)Reuse plan templates for recurring task patterns. Cache the decomposition, not the execution. Reported 50% cost reduction in benchmarks.Agents handling repetitive task types (e.g., code review, migration).
AgentFoldTwo-phase condensation: (1) granular condensation removes redundant/obsolete information per-turn; (2) deep consolidation periodically restructures the entire compressed history.High-step-count agents exceeding 100+ turns.
Checkpoint SummarizationCompleted sub-tasks are replaced with structured summaries: inputs, outputs, decisions made, artifacts produced. The full execution trace is discarded.Multi-phase workflows with clear sub-task boundaries.

Cross-Platform Implementation

PlatformMechanism
Claude CodeManual /compact and /clear commands. Automated compaction via CLAUDE_AUTOCOMPACT_PCT_OVERRIDE environment variable (default ~80%). --resume flag reloads a previous session’s compacted state for multi-session workflows. Context instructions can be placed in CLAUDE.md.
OpenAI CodexServer-side compaction with configurable thresholds and strategy (summarize, truncate, sliding_window). Options to preserve system instructions and keep last N turns verbatim. Developer does not control exact summarization — simplifies implementation but reduces control over what survives compaction.
Gemini CLIUp to 1M tokens shifts strategy toward monitoring rather than aggressive compaction. Cost still scales with context size (500K costs 5x of 100K per turn), so compaction remains valuable for cost control even when the window is not yet full. Token usage available via usage_metadata after each turn.
LangGraphExplicit graph-based primitives: a context monitor node checks utilization and conditionally routes to a compaction node. Supports sliding-window + summary compaction, external artifact storage with in-context references, and checkpoint-based persistence via InMemorySaver or other backends.

The Performance Cliff

Agent performance does not degrade gracefully — it falls off a cliff. After approximately 35 minutes of continuous operation, success rates on coding tasks drop sharply regardless of model. Extended sessions with many tool calls (100+) show similar degradation as context fills. The curve is non-linear: an agent at 60% context utilization may perform at 95% of peak, while at 85% it may perform at only 60% of peak.

Task duration doubles roughly every 7 months as agents handle increasingly complex multi-file changes (METR research), tracking with growing context demands rather than model capability.

Why Sub-Task Decomposition Is Mandatory

The performance cliff makes a strong architectural argument: no single agent invocation should run long enough to hit the cliff. Long-horizon work must be decomposed into sub-tasks, each completing well within the safe utilization zone. The orchestrator passes forward only checkpoint summaries from previous sub-tasks — not full execution traces — keeping each invocation in the 0-60% band where performance is highest.

The alternative — letting a single agent run until it either succeeds or exhausts its context — is the most common failure mode in production agent systems. Context management is not an optimization. It is a correctness requirement.