fix(compression): include system prompt + tool schemas in token estimates (#18265)
The user-visible /compress banner and the post-compression last_prompt_tokens writeback both counted only the raw message transcript (chars/4). With a 15KB system prompt and 30 tool schemas (~26KB), a 4-message transcript that looks like ~45 tokens to the transcript-only estimator is really ~10.5K tokens of request pressure — a 234x gap. Two user-facing consequences: - Banner shows 'Compressing … (~45 tokens)…' while compression is actually firing on 10K+ tokens of real pressure, confusing users about why compression triggered (reported by @codecovenant on X; #6217). - Post-compression last_prompt_tokens writeback omits tool schemas, so the next should_compress() check compares real usage against a stale underestimate — compression triggers late, potentially past the model's context limit on small-context models (#14695). Swap estimate_messages_tokens_rough() for estimate_request_tokens_rough() at every user-visible banner and at the post-compression writeback. estimate_request_tokens_rough() already existed for exactly this purpose and includes system prompt + tool schemas. Touched call sites: - run_agent.py: post-compression last_prompt_tokens writeback, post-tool call should_compress() fallback when provider usage is missing - cli.py: /compress banner + summary - gateway/run.py: gateway /compress banner + summary - tui_gateway/server.py: TUI /compress status + summary - acp_adapter/server.py: ACP /compact before/after Left intentionally alone: - Session-hygiene fallback and the 'no agent' /status path in gateway/run.py — no agent instance is in scope to query for system prompt/tools, and the existing 30-50% overestimate wobble on hygiene is safety-accepted. - Verbose-mode 'Request size' logging — informational only, already counts system prompt via api_messages[0]. Also relabels the feedback line from 'Rough transcript estimate' to 'Approx request size' so the metric label matches what it actually measures. Credits: diagnoses from @devilardis (#14695) and @Jackten (#6217); user report @codecovenant on X (2026-04-30). Closes #14695 Closes #6217
This commit is contained in:
@@ -1144,7 +1144,7 @@ def _compress_session_history(
|
||||
before_messages: list | None = None,
|
||||
history_version: int | None = None,
|
||||
) -> tuple[int, dict]:
|
||||
from agent.model_metadata import estimate_messages_tokens_rough
|
||||
from agent.model_metadata import estimate_request_tokens_rough
|
||||
|
||||
agent = session["agent"]
|
||||
# Snapshot history under the lock so the LLM-bound compression call
|
||||
@@ -1160,7 +1160,13 @@ def _compress_session_history(
|
||||
usage = _get_usage(agent)
|
||||
return 0, usage
|
||||
if approx_tokens is None:
|
||||
approx_tokens = estimate_messages_tokens_rough(history)
|
||||
# Include system prompt + tool schemas so the figure reflects real
|
||||
# request pressure, not a transcript-only underestimate (#6217).
|
||||
_sys_prompt = getattr(agent, "_cached_system_prompt", "") or ""
|
||||
_tools = getattr(agent, "tools", None) or None
|
||||
approx_tokens = estimate_request_tokens_rough(
|
||||
history, system_prompt=_sys_prompt, tools=_tools
|
||||
)
|
||||
# Pass system_message=None so AIAgent._compress_context rebuilds the
|
||||
# system prompt cleanly via _build_system_prompt(None). Passing the
|
||||
# cached prompt (which already contains the agent identity block)
|
||||
@@ -2328,14 +2334,21 @@ def _(rid, params: dict) -> dict:
|
||||
focus_topic = str(params.get("focus_topic", "") or "").strip()
|
||||
try:
|
||||
from agent.manual_compression_feedback import summarize_manual_compression
|
||||
from agent.model_metadata import estimate_messages_tokens_rough
|
||||
from agent.model_metadata import estimate_request_tokens_rough
|
||||
|
||||
with session["history_lock"]:
|
||||
before_messages = list(session.get("history", []))
|
||||
history_version = int(session.get("history_version", 0))
|
||||
before_count = len(before_messages)
|
||||
_agent = session["agent"]
|
||||
_sys_prompt = getattr(_agent, "_cached_system_prompt", "") or ""
|
||||
_tools = getattr(_agent, "tools", None) or None
|
||||
before_tokens = (
|
||||
estimate_messages_tokens_rough(before_messages) if before_count else 0
|
||||
estimate_request_tokens_rough(
|
||||
before_messages, system_prompt=_sys_prompt, tools=_tools
|
||||
)
|
||||
if before_count
|
||||
else 0
|
||||
)
|
||||
|
||||
if before_count >= 4:
|
||||
@@ -2358,8 +2371,18 @@ def _(rid, params: dict) -> dict:
|
||||
with session["history_lock"]:
|
||||
messages = list(session.get("history", []))
|
||||
after_count = len(messages)
|
||||
# Re-read system prompt + tools after compression — _compress_context
|
||||
# may have rebuilt the system prompt (_cached_system_prompt=None).
|
||||
_sys_prompt_after = getattr(_agent, "_cached_system_prompt", "") or _sys_prompt
|
||||
_tools_after = getattr(_agent, "tools", None) or _tools
|
||||
after_tokens = (
|
||||
estimate_messages_tokens_rough(messages) if after_count else 0
|
||||
estimate_request_tokens_rough(
|
||||
messages,
|
||||
system_prompt=_sys_prompt_after,
|
||||
tools=_tools_after,
|
||||
)
|
||||
if after_count
|
||||
else 0
|
||||
)
|
||||
agent = session["agent"]
|
||||
_sync_session_key_after_compress(sid, session)
|
||||
|
||||
Reference in New Issue
Block a user