Context Management
The Fundamental Constraint
Every agent operates within a finite context window. This is not one limitation among many — it is the single most important technical constraint in agentic AI. Every design decision in a production agent system either respects this constraint or eventually fails because of it.
Context windows range from 128K to 1M+ tokens depending on the model, but raw capacity is misleading. What matters is effective context — the portion the model can reliably attend to and reason over. As context fills:
- Attention quality drops. Information in the middle of long contexts is recalled less reliably than information at the beginning or end.
- Latency increases superlinearly. Each additional token of context adds processing time.
- Cost scales linearly at best. Every API call re-processes the full context.
- Accuracy degrades measurably. Benchmark data shows success rates dropping after approximately 35 minutes of continuous operation or extended sessions with 100+ tool-use steps.
An agent that does not actively manage its context will degrade from capable to unreliable to broken — silently. There is no error message when an agent starts ignoring instructions buried 80K tokens back in the conversation.
Universal Thresholds
Regardless of platform, the same utilization bands apply. These reflect observed degradation patterns across models and agent frameworks.
| Context % | Action | Description |
|---|---|---|
| 0-60% | Work Freely | Agent has ample room. Full instructions, tool results, and conversation history fit comfortably. No management overhead needed. |
| 50-70% | Monitor | Begin tracking consumption. Log context size per turn. Identify which components are growing fastest. |
| 70-80% | Compact | Actively summarize conversation history. Preserve key decisions, file paths, and current task state. Drop verbose tool output. |
| 80%+ | Clear/Reset | Mandatory context reset. Serialize critical state to external storage, start a fresh context, and reload only what is needed for the next sub-task. |
The overlap between bands is intentional. The transition from “monitor” to “compact” depends on how much remaining work is expected and whether tool results are likely to be large.
Context Budget Per Component
A well-designed agent allocates its context window deliberately. Letting any single component grow unchecked starves the others. The following budget is a starting point for systems using 128K-200K token windows:
| Component | Budget | Notes |
|---|---|---|
| System instructions | ~10-15% | Stable across turns. Prime candidate for caching. |
| Active conversation | ~30-40% | The back-and-forth between user and agent. Grows fastest. |
| Tool results | ~20-30% | File contents, search results, command output. Often the largest single-turn contributor. |
| Memory / retrieved context | ~10-20% | RAG results, session memory, persistent facts. |
| Safety margin | ~10-15% | Reserved headroom for the model’s output and unexpected spikes. |
graph LR
subgraph ctx ["Context Window"]
direction TB
S["System Instructions<br/>10-15%"]
C["Active Conversation<br/>30-40%"]
T["Tool Results<br/>20-30%"]
M["Memory / RAG<br/>10-20%"]
R["Safety Margin<br/>10-15%"]
end
S --- C --- T --- M --- R
subgraph actions ["At Threshold"]
A1["0-60%: Work freely"]
A2["70-80%: Compact"]
A3["80%+: Clear / Reset"]
end
ctx -.-> actions
The Compaction vs. Caching Trade-off
Modern inference providers cache repeated prompt prefixes — if the first N tokens of a request match a previous request exactly, those tokens are served from cache at reduced cost and latency. This is valuable for agents because the system prompt is identical across every turn. However, compaction breaks the cache: when an agent compacts its conversation history, the summarized content differs from the original, and every token after the divergence point becomes a cache miss. The solution is to place the system prompt first as a stable prefix (with explicit cache control breakpoints at its end), and treat everything after that boundary — memory, conversation, tool results — as dynamic content that can be compacted freely without cache impact. Cached input tokens are typically priced at 10% of uncached tokens, so preserving this boundary compounds to substantial savings over hundreds of turns.
Key insight: Separate stable content (system prompt) from dynamic content (conversation, tool results, memory). Only compact the dynamic portion.
Advanced Strategies
| Strategy | Mechanism | When to Use |
|---|---|---|
| Hierarchical Memory | L1: working memory (current context). L2: session memory (compacted summaries). L3: persistent DB (vector store, structured logs). Promote/demote between tiers based on recency and relevance. | Long-running agents with recurring tasks. |
| Artifact-Driven | Large data (file contents, search results, build output) stored externally. Only lightweight references kept in context: file paths, line ranges, content hashes. Agent re-fetches on demand. | Agents that process many files or large outputs. |
| Plan Caching (APC) | Reuse plan templates for recurring task patterns. Cache the decomposition, not the execution. Reported 50% cost reduction in benchmarks. | Agents handling repetitive task types (e.g., code review, migration). |
| AgentFold | Two-phase condensation: (1) granular condensation removes redundant/obsolete information per-turn; (2) deep consolidation periodically restructures the entire compressed history. | High-step-count agents exceeding 100+ turns. |
| Checkpoint Summarization | Completed sub-tasks are replaced with structured summaries: inputs, outputs, decisions made, artifacts produced. The full execution trace is discarded. | Multi-phase workflows with clear sub-task boundaries. |
Cross-Platform Implementation
| Platform | Mechanism |
|---|---|
| Claude Code | Manual /compact and /clear commands. Automated compaction via CLAUDE_AUTOCOMPACT_PCT_OVERRIDE environment variable (default ~80%). --resume flag reloads a previous session’s compacted state for multi-session workflows. Context instructions can be placed in CLAUDE.md. |
| OpenAI Codex | Server-side compaction with configurable thresholds and strategy (summarize, truncate, sliding_window). Options to preserve system instructions and keep last N turns verbatim. Developer does not control exact summarization — simplifies implementation but reduces control over what survives compaction. |
| Gemini CLI | Up to 1M tokens shifts strategy toward monitoring rather than aggressive compaction. Cost still scales with context size (500K costs 5x of 100K per turn), so compaction remains valuable for cost control even when the window is not yet full. Token usage available via usage_metadata after each turn. |
| LangGraph | Explicit graph-based primitives: a context monitor node checks utilization and conditionally routes to a compaction node. Supports sliding-window + summary compaction, external artifact storage with in-context references, and checkpoint-based persistence via InMemorySaver or other backends. |
The Performance Cliff
Agent performance does not degrade gracefully — it falls off a cliff. After approximately 35 minutes of continuous operation, success rates on coding tasks drop sharply regardless of model. Extended sessions with many tool calls (100+) show similar degradation as context fills. The curve is non-linear: an agent at 60% context utilization may perform at 95% of peak, while at 85% it may perform at only 60% of peak.
Task duration doubles roughly every 7 months as agents handle increasingly complex multi-file changes (METR research), tracking with growing context demands rather than model capability.
Why Sub-Task Decomposition Is Mandatory
The performance cliff makes a strong architectural argument: no single agent invocation should run long enough to hit the cliff. Long-horizon work must be decomposed into sub-tasks, each completing well within the safe utilization zone. The orchestrator passes forward only checkpoint summaries from previous sub-tasks — not full execution traces — keeping each invocation in the 0-60% band where performance is highest.
The alternative — letting a single agent run until it either succeeds or exhausts its context — is the most common failure mode in production agent systems. Context management is not an optimization. It is a correctness requirement.