- Context overflow causes silent quality degradation — agents forget instructions, repeat questions, and produce inconsistent outputs without any error message
- Tool call results are the largest unexpected context consumers — a single web search can inject 5,000+ tokens into the context
- Enable token usage logging immediately so you know your actual usage before hitting limits
- Configure an explicit truncation strategy — default FIFO truncation removes critical early messages; summarization preserves continuity better
- memory.md is loaded into context at session start — keep it under 2,000 tokens to preserve room for actual conversation
Context overflow is the problem most builders diagnose last, after blaming the model, the prompts, and the integration. Here's what actually happens: your agent runs a long session, tools inject large results, the conversation history grows, and eventually the model can no longer see the instructions you gave it 40 messages ago.
The agent doesn't throw an error. It just starts forgetting. And you start wondering why it worked yesterday but doesn't today.
What Context Overflow Actually Is
Every language model has a context window — a maximum number of tokens it can process in a single call. This window holds everything: your system prompt, the conversation history, all tool call inputs and outputs, injected memory, and the current message.
When the total exceeds the limit, something gets cut. The model can't tell you what it lost. It silently works with an incomplete picture. Results get weird.
| Model | Context Window | Effective Limit* |
|---|---|---|
| Claude 3.5 Sonnet | 200k tokens | ~150k before degradation |
| GPT-4o | 128k tokens | ~100k before degradation |
| Gemini 1.5 Flash | 1M tokens | Genuinely large |
| DeepSeek V3 | 64k tokens | ~50k before degradation |
| Llama 3.1 8B (Ollama) | 128k tokens | ~80k for reliable use |
*Effective limit is where quality starts degrading due to attention dilution, before the hard token limit is reached.
Mistake 1: Not Monitoring Token Usage
You can't manage what you don't measure. The first mistake is running agents without token usage logging. By the time you notice quality degrading, you've already been in overflow for hours.
Enable usage logging in your OpenClaw config:
logging:
token_usage: true
token_alert_threshold: 0.80 # alert when 80% of context used
log_level: info
With this enabled, you'll see token counts per call in your logs. Set an alert at 80% of the model's context window. That gives you room to respond before quality degrades.
Mistake 2: Untruncated Tool Results
Tool results are the biggest surprise context consumers. When your agent calls a web search tool and gets back a full page of results, that's often 3,000–8,000 tokens injected directly into the context. Do that three times in a session and you've consumed 24,000 tokens on tool results alone.
Configure tool result truncation for any tool that returns variable-length content:
tools:
web_search:
max_result_tokens: 1500
truncation: summary # summarize before injecting
file_reader:
max_result_tokens: 2000
truncation: head # take only the first N tokens
The right truncation strategy depends on the tool. For search results, summary truncation (compress the results into key points) preserves more information per token. For structured data files, head truncation (take the first chunk) works when the important data appears early.
Mistake 3: Bloated Memory Files
Every byte of memory.md gets injected into the context at session start. A memory file that's grown unchecked over weeks of agent operation can consume 10,000–30,000 tokens before the conversation even begins. That's 25% of a 128k context window gone before you've sent a single message.
Here's where most people stop. They don't actually fix the memory file — they just complain that the agent is slow.
memory:
persist: true
max_size_tokens: 2000 # hard limit on memory.md injection
pruning_strategy: recency # keep most recently accessed entries
Set a hard size limit on memory injection. 2,000 tokens for a personal agent. 5,000 for a complex business agent with lots of persistent context. Move verbose information — documents, full logs, long histories — to the long-term store and query it on demand rather than injecting it wholesale.
Mistake 4: No Explicit Truncation Strategy
OpenClaw's default truncation removes the oldest messages first (FIFO). This is the worst possible strategy for most agents. The oldest messages are usually the most important: initial instructions, user preferences established early, task context defined at session start.
context:
truncation_strategy: summarize # compress old messages before removal
pinned_messages:
- system_prompt # never remove these
- first_user_message # preserve initial task context
summary_interval: 20 # summarize every 20 messages
Summarization-based truncation compresses old conversation turns into a concise summary block before removing the originals. The agent retains the gist of earlier exchanges without consuming the full token count. This keeps continuity while managing context size.
Mistake 5: Using a Short-Context Model for Long Sessions
DeepSeek V3 at 64k tokens is excellent for cost but its smaller context window means you hit limits faster. For agents that run multi-hour research sessions or process long documents, 64k evaporates quickly.
Match your model to your session length requirements:
- Short sessions (<10k tokens): Any model works. Use the cheapest.
- Medium sessions (10k–50k tokens): GPT-4o mini or DeepSeek V3 work well.
- Long sessions (50k–100k tokens): GPT-4o or Claude 3.5 Haiku.
- Very long sessions or large documents: Gemini 1.5 Flash (1M context) is the clear choice.
The Summary Agent Pattern
The most reliable solution for long-running agents is a dedicated summary agent. A lightweight, cheap model (GPT-4o mini works well) runs every N turns and compresses the conversation history. The main agent's context is then reset with only the summary plus the most recent messages.
agents:
- name: main-agent
model: deepseek-chat
context_manager:
type: summary_agent
summary_agent: context-summarizer
summarize_every: 15 # compress after every 15 turns
keep_recent: 5 # always keep last 5 turns verbatim
- name: context-summarizer
model: gpt-4o-mini # cheap model for summarization
role: "Compress conversation history into a concise summary
preserving all decisions, user preferences, and task context."
We've seen this pattern reduce effective token usage by 60–70% in long research agent sessions while maintaining output quality. The summary agent costs almost nothing to run — and it pays for itself in avoided quality degradation and reduced total API costs from shorter contexts.
Frequently Asked Questions
What is context overflow in OpenClaw?
Context overflow occurs when the total tokens in an agent's conversation history — messages, tool call results, memory injections, and system prompts — exceed the model's context window. When this happens, the model can no longer see earlier parts of the conversation. Agents start forgetting earlier instructions, making repeated mistakes, and producing inconsistent responses.
How do I know if my OpenClaw agent is hitting context overflow?
Signs include: the agent repeating questions it already asked, ignoring instructions given earlier in the session, API errors mentioning token limits, and degraded response quality over long sessions. Enable token counting in your OpenClaw config to log usage per session and set an alert threshold.
What is the context window limit for common models used with OpenClaw?
GPT-4o supports 128k tokens. Claude 3.5 Sonnet supports 200k tokens. Gemini 1.5 Flash supports 1 million tokens. DeepSeek V3 supports 64k tokens. Local models via Ollama vary by model — Llama 3.1 8B supports 128k tokens by default. Always verify the effective context limit, not just the advertised one, as performance degrades before the hard limit.
How does OpenClaw handle context truncation?
OpenClaw's default truncation strategy removes the oldest messages first when approaching the context limit. You can configure alternative strategies: sliding window (keep the last N messages), summarization (compress old messages into a summary before removing them), or pinned messages (preserve specific messages like the system prompt regardless of position).
Can I use a summary agent to prevent context overflow?
Yes. A summary agent pattern runs a second lightweight agent that periodically compresses the conversation history into a concise summary. The main agent's context is then reset with only the summary plus recent messages. This approach keeps context usage bounded while preserving continuity across long sessions.
Does loading memory.md count against the context limit?
Yes. Everything injected into the conversation — including memory.md contents, system prompts, and tool definitions — counts against the context window. A large memory.md file can consume thousands of tokens before the conversation even starts. Keep memory.md concise and structured; store verbose information in the long-term store instead.