Managing long conversations
Keep multi-turn agent conversations inside the context window with sliding-window, truncation, and summarization history policies.
Every turn you send to a model includes the full prior transcript. Left unchecked, a long chat keeps growing until it overflows the model's context window — at which point requests start failing or silently dropping the oldest content. A history policy compacts the message list before each call so the conversation stays bounded.
This guide covers when transcripts grow too large, the three built-in policies, and how to pair a policy with persistent memory on one agent.
When transcripts grow too large
An Agent replays the entire message list on every turn of the run loop. That's what lets the model reason over earlier context, but it means token usage climbs with each exchange. You'll feel it in three ways:
- Cost. You pay for input tokens on every call, so a 40-turn chat re-sends turns 1–39 forty times over.
- Latency. Larger prompts take longer to process.
- Hard limits. Once the transcript exceeds the model's window, the request errors out.
A history policy fixes this by trimming or compressing the transcript before it's sent. You attach one via Agent(history=...), and the run loop calls its compact() method automatically — you don't invoke it yourself.
History policies operate on the in-flight message list for a single run. To carry state across separate runs or restarts, you want memory — see Combining memory and history below.
Choosing a history policy
AgentRoute ships three policies, all implementing the History protocol (async compact(messages, token_budget=None) -> list[Message]). All three preserve leading system messages and keep complete user turns intact — an assistant tool call and its tool results never get split from the user message that triggered them.
Sliding window
HistorySlidingWindow keeps the most recent N user turns and drops everything older. It's the cheapest option — no model call, no token counting — and a good default for chat-style agents where only recent context matters.
from agentroute import Agent, HistorySlidingWindow
agent = Agent(
name="support-chat",
model="claude-sonnet-4",
history=HistorySlidingWindow(window_size=10),
)
result = agent.run("What did I ask you three messages ago?")
print(result)window_size counts user turns, not individual messages, so one turn can include the user message, an assistant reply, and any tool-call/tool-result pairs in between. Turns beyond the window are discarded outright.
Truncate
HistoryTruncate drops the oldest turns until the transcript fits a token budget. It uses a dependency-free character heuristic (roughly 4 characters per token) rather than a real tokenizer, so treat max_tokens as an approximate ceiling and leave headroom.
from agentroute import Agent, HistoryTruncate
agent = Agent(
name="research-assistant",
model="claude-sonnet-4",
history=HistoryTruncate(max_tokens=80_000),
)System messages are counted first and always preserved; the remaining budget is filled with the most recent turns. At least one turn is always kept, even if a single turn exceeds the budget.
Summarize
HistorySummarize keeps the last few turns verbatim and replaces everything older with an LLM-generated summary injected as a system message. This is the only policy that preserves the gist of dropped turns instead of discarding them — useful for long-running agents that need to remember earlier decisions.
Because summarizing requires an actual model call, this policy must be constructed with a concrete Model. Passing model=None raises RuntimeError at compact time.
from agentroute import Agent, HistorySummarize, resolve_model
summarizer = resolve_model("claude-sonnet-4")
agent = Agent(
name="long-running-planner",
model="claude-sonnet-4",
history=HistorySummarize(model=summarizer, keep_recent=4),
)keep_recent controls how many recent turns stay untouched; everything before them is compressed into 3–5 bullet points that preserve names, numbers, and decisions. Use resolve_model to build the Model instance — see the models concept for resolution rules.
Each compaction triggers an extra completion against the summarizer model, which adds latency and cost. Reach for it only when losing older context outright (sliding window or truncate) would hurt the agent's task.
Decision guide
Recent context is all that matters and you want zero overhead. Fast, predictable, no model call.
You think in token budgets and want a hard-ish cap on prompt size. Approximate (char heuristic), still no model call.
Older turns carry decisions the agent must remember. Highest fidelity, but costs an extra completion per compaction and needs a concrete Model.
Short, bounded conversations. Nothing to configure — leave history unset.
| Policy | Keeps | Cost of compaction | Needs a Model |
|---|---|---|---|
HistorySlidingWindow(window_size) | last N turns | none | no |
HistoryTruncate(max_tokens) | recent turns under a token budget | none | no |
HistorySummarize(model, keep_recent) | last N turns + a summary of the rest | one extra completion | yes |
Combining memory and history
History and memory solve different problems and compose cleanly on one agent:
- History compacts the transcript within a run so it fits the context window. It's ephemeral.
- Memory persists facts and messages across runs and restarts. Attach Memory (in-RAM) or MemorySQLite (persistent, with full-text recall) and your tools read and write it through
ctx.memory.
Set both at construction. The history policy keeps each run's prompt lean while memory carries durable state forward:
from agentroute import (
Agent,
Context,
HistorySlidingWindow,
MemorySQLite,
)
agent = Agent(
name="account-concierge",
model="claude-sonnet-4",
instructions="Help the user manage their account. Save anything worth remembering.",
memory=MemorySQLite("concierge.db"),
history=HistorySlidingWindow(window_size=12),
)
@agent.tool
async def remember_preference(ctx: Context, key: str, value: str) -> str:
"""Persist a user preference for future sessions."""
await ctx.memory.remember(key, value)
return f"Saved {key}."
@agent.tool
async def lookup(ctx: Context, query: str) -> str:
"""Recall previously saved facts relevant to the query."""
hits = await ctx.memory.recall(query, limit=5)
return "\n".join(hits) if hits else "Nothing saved yet."
result = agent.run("My timezone is CET. Remember that for next time.")
print(result)Here a sliding window keeps each prompt within the context window, while MemorySQLite survives across process restarts so the next session can recall("timezone"). See the memory guide for the full MemoryProto surface.
The two are independent. You can run history with no memory (stateless chat, bounded prompts), memory with no history (durable state, short conversations), both, or neither.
Next steps
How compaction fits into the run loop and the turn-grouping rules.
Full API for the History protocol and the three built-in policies.
Persist messages and facts across runs with Memory and MemorySQLite.
Track tokens and cost so you know when a conversation is getting expensive.