feat(image-input): native multimodal routing based on model vision capability (#16506)

* feat(image-input): native multimodal routing based on model vision capability

Attach user-sent images as OpenAI-style content parts on the user turn when
the active model supports native vision, so vision-capable models see real
pixels instead of a lossy text description from vision_analyze.

Routing decision (agent/image_routing.py::decide_image_input_mode):

  agent.image_input_mode = auto | native | text  (default: auto)

In auto mode:
  - If auxiliary.vision.provider/model is explicitly configured, keep the
    text pipeline (user paid for a dedicated vision backend).
  - Else if models.dev reports supports_vision=True for the active
    provider/model, attach natively.
  - Else fall back to text (current behaviour).

Call sites updated: gateway/run.py (all messaging platforms), tui_gateway
(dashboard/Ink), cli.py (interactive /attach + drag-drop).

run_agent.py changes:
  - _prepare_anthropic_messages_for_api now passes image parts through
    unchanged when the model supports vision — the Anthropic adapter
    translates them to native image blocks. Previous behaviour
    (vision_analyze → text) only runs for non-vision Anthropic models.
  - New _prepare_messages_for_non_vision_model mirrors the same contract
    for chat.completions and codex_responses paths, so non-vision models
    on any provider get text-fallback instead of failing at the provider.
  - New _model_supports_vision() helper reads models.dev caps.

vision_analyze description rewritten: positions it as a tool for images
NOT already visible in the conversation (URLs, tool output, deeper
inspection). Prevents the model from redundantly calling it on images
already attached natively.

Config default: agent.image_input_mode = auto.

Tests: 35 new (test_image_routing.py + test_vision_aware_preprocessing.py),
all existing tests that reference _prepare_anthropic_messages_for_api
still pass (198 targeted + new tests green).

* feat(image-input): size-cap + resize oversized images, charge image tokens in compressor

Two follow-ups that make the native image routing safer for long / heavy
sessions:

1) Oversize handling in build_native_content_parts:
   - 20 MB ceiling per image (matches vision_tools._MAX_BASE64_BYTES,
     the most restrictive provider — Gemini inline data).
   - Delegates to vision_tools._resize_image_for_vision (Pillow-based,
     already battle-tested) to downscale to 5 MB first-try.
   - If Pillow is missing or resize still overshoots, the image is
     dropped and reported back in skipped[]; caller falls back to text
     enrichment for that image.

2) Image-token accounting in context_compressor:
   - New _IMAGE_TOKEN_ESTIMATE = 1600 (matches Claude Code's constant;
     within the realistic range for Anthropic/GPT-4o/Gemini billing).
   - _content_length_for_budget() helper: sums text-part lengths and
     charges _IMAGE_CHAR_EQUIVALENT (1600 * 4 chars) per image/image_url/
     input_image part.  Base64 payload inside image_url is NOT counted
     as chars — dimensions don't matter, only image-presence.
   - Both tail-cut sites (_prune_old_tool_results L527 and
     _find_tail_cut_by_tokens L1126) now call the helper so multi-image
     conversations don't slip past compression budget.

Tests: 9 new in test_image_routing.py (oversize triggers resize,
resize-fails-returns-None, oversize-skipped-reported), 11 new in
test_compressor_image_tokens.py (flat charge per image, multiple images,
Responses-API / Anthropic-native / OpenAI-chat shapes, no-inflation on
raw base64, bounds-check on the constant, integration test that an
image-heavy tail actually gets trimmed).

* fix(image-input): replace blanket 20MB ceiling with empirically-verified per-provider limits

The previous commit imposed a hardcoded 20 MB base64 ceiling on all
providers, triggering auto-resize on anything larger. This was wrong in
both directions:

  * Too loose for Anthropic — actual limit is 5 MB (returns HTTP 400
    'image exceeds 5 MB maximum' above that).
  * Too strict for OpenAI / Codex / OpenRouter — accept 49 MB+ without
    complaint (empirically verified April 2026 with progressive PNG
    sizes).

New behaviour:

  * _PROVIDER_BASE64_CEILING table: only anthropic and bedrock have a
    ceiling (5 MB, since bedrock-on-Claude shares Anthropic's decoder).
  * Providers NOT in the table get no ceiling — images attach at native
    size and we trust the provider to return its own error if it
    disagrees. A provider-specific 400 message is clearer than us
    guessing wrong and silently degrading image quality.
  * build_native_content_parts() gains a keyword-only provider arg;
    gateway/CLI/TUI pass the active provider so Anthropic users get
    auto-resize protection while OpenAI users don't pay it.
  * Resize target dropped from 5 MB to 4 MB to slide safely under
    Anthropic's boundary with header overhead.

Empirical measurements (direct API, no Hermes in the loop):

    image b64     anthropic   openrouter/gpt5.5   codex-oauth/gpt5.5
    0.19 MB       ✓           ✓                   ✓
    12.37 MB      ✗ 400 5MB   ✓                   ✓
    23.85 MB      ✗ 400 5MB   ✓                   ✓
    49.46 MB      ✗ 413       ✓                   ✓

Tests: rewrote TestOversizeHandling (5 tests): no-ceiling pass-through,
Anthropic resize fires, Anthropic skip on resize-fail, build_native_parts
routes ceiling by provider, unknown provider gets no ceiling. All 52
targeted tests pass.

* refactor(image-input): attempt native, shrink-and-retry on provider reject

Replace proactive per-provider size ceilings with a reactive shrink path
on the provider's actual rejection. All providers now attempt native
full-size attachment first; if the provider returns an image-too-large
error, the agent silently shrinks and retries once.

Why the previous design was wrong: hardcoding provider ceilings
(anthropic=5MB, others=unlimited) meant OpenAI users on a 10MB image
paid no tax, but Anthropic users lost quality on anything >5MB even
though the empirical behaviour at provider-reject time is the same
(shrink + retry). Baking the table into the routing layer also
requires updating Hermes every time a provider's limit changes.

Reactive design:
  - image_routing.py: _file_to_data_url encodes native size, no ceiling.
    build_native_content_parts drops its provider kwarg.
  - error_classifier.py: new FailoverReason.image_too_large + pattern
    match ("image exceeds", "image too large", etc.) checked BEFORE
    context_overflow so Anthropic's 5MB rejection lands in the right
    bucket.
  - run_agent.py: new _try_shrink_image_parts_in_messages walks api
    messages in-place, re-encodes oversized data: URL image parts
    through vision_tools._resize_image_for_vision to fit under 4MB,
    handles both chat.completions (dict image_url) and Responses
    (string image_url) shapes, ignores http URLs (provider-fetched).
    New image_shrink_retry_attempted flag in the retry loop fires the
    shrink exactly once per turn after credential-pool recovery but
    before auth retries.

E2E verified live against Anthropic claude-sonnet-4-6:
  - 17.9MB PNG (23.9MB b64) attached at native size
  - Anthropic returns 400 "image exceeds 5 MB maximum"
  - Agent logs '📐 Image(s) exceeded provider size limit — shrank and
    retrying...'
  - Retry succeeds, correct response delivered in 6.8s total.

Tests: 12 new (8 shrink-helper shapes + 4 classifier signals),
replaces 5 proactive-ceiling tests with 3 simpler 'native attach works'
tests. 181 targeted tests pass. test_enum_members_exist in
test_error_classifier.py updated for the new enum value.
This commit is contained in:
Teknium
2026-04-27 06:27:59 -07:00
committed by GitHub
parent df3c9593f8
commit ec671c4154
14 changed files with 1539 additions and 42 deletions

View File

@@ -61,9 +61,52 @@ _PRUNED_TOOL_PLACEHOLDER = "[Old tool output cleared to save context space]"
# Chars per token rough estimate
_CHARS_PER_TOKEN = 4
# Flat token cost per attached image part. Real cost varies by provider and
# dimensions (Anthropic ≈ width×height/750, GPT-4o up to ~1700 for
# high-detail 2048×2048, Gemini 258/tile), but 1600 is a realistic ceiling
# that keeps compression budgeting honest for multi-image conversations.
# Matches Claude Code's IMAGE_TOKEN_ESTIMATE constant.
_IMAGE_TOKEN_ESTIMATE = 1600
# Same figure expressed in the char-budget currency the rest of the
# compressor speaks in. Used when accumulating message "content length"
# for tail-cut decisions.
_IMAGE_CHAR_EQUIVALENT = _IMAGE_TOKEN_ESTIMATE * _CHARS_PER_TOKEN
_SUMMARY_FAILURE_COOLDOWN_SECONDS = 600
def _content_length_for_budget(raw_content: Any) -> int:
"""Return the effective char-length of a message's content for token budgeting.
Plain strings: ``len(content)``. Multimodal lists: sum of text-part
``len(text)`` plus a flat ``_IMAGE_CHAR_EQUIVALENT`` per image part
(``image_url`` / ``input_image`` / Anthropic-style ``image``). This
keeps the compressor from treating a turn with 5 attached images as
near-zero tokens just because the text part is empty.
"""
if isinstance(raw_content, str):
return len(raw_content)
if not isinstance(raw_content, list):
return len(str(raw_content or ""))
total = 0
for p in raw_content:
if isinstance(p, str):
total += len(p)
continue
if not isinstance(p, dict):
total += len(str(p))
continue
ptype = p.get("type")
if ptype in {"image_url", "input_image", "image"}:
total += _IMAGE_CHAR_EQUIVALENT
else:
# text / input_text / tool_result-with-text / anything else with
# a text field. Ignore the raw base64 payload inside image_url
# dicts — dimensions don't matter, only whether it's an image.
total += len(p.get("text", "") or "")
return total
def _content_text_for_contains(content: Any) -> str:
"""Return a best-effort text view of message content.
@@ -484,18 +527,7 @@ class ContextCompressor(ContextEngine):
for i in range(len(result) - 1, -1, -1):
msg = result[i]
raw_content = msg.get("content") or ""
content_len = (
sum(
len(p.get("text", ""))
if isinstance(p, dict)
else len(p)
if isinstance(p, str)
else len(str(p))
for p in raw_content
)
if isinstance(raw_content, list)
else len(raw_content)
)
content_len = _content_length_for_budget(raw_content)
msg_tokens = content_len // _CHARS_PER_TOKEN + 10
for tc in msg.get("tool_calls") or []:
if isinstance(tc, dict):
@@ -1094,18 +1126,7 @@ The user has requested that this compaction PRIORITISE preserving all informatio
for i in range(n - 1, head_end - 1, -1):
msg = messages[i]
raw_content = msg.get("content") or ""
content_len = (
sum(
len(p.get("text", ""))
if isinstance(p, dict)
else len(p)
if isinstance(p, str)
else len(str(p))
for p in raw_content
)
if isinstance(raw_content, list)
else len(raw_content)
)
content_len = _content_length_for_budget(raw_content)
msg_tokens = content_len // _CHARS_PER_TOKEN + 10 # +10 for role/metadata
# Include tool call arguments in estimate
for tc in msg.get("tool_calls") or []:

View File

@@ -42,6 +42,7 @@ class FailoverReason(enum.Enum):
# Context / payload
context_overflow = "context_overflow" # Context too large — compress, not failover
payload_too_large = "payload_too_large" # 413 — compress payload
image_too_large = "image_too_large" # Native image part exceeds provider's per-image limit — shrink and retry
# Model
model_not_found = "model_not_found" # 404 or invalid model — fallback to different model
@@ -147,6 +148,20 @@ _PAYLOAD_TOO_LARGE_PATTERNS = [
"error code: 413",
]
# Image-size patterns. Matched against 400 bodies (not 413) because most
# providers return a 400 with a specific image-too-big message before the
# whole request hits the 413 size limit. Anthropic's wording is the most
# important here (hard 5 MB per image, returned as
# "messages.N.content.K.image.source.base64: image exceeds 5 MB maximum").
_IMAGE_TOO_LARGE_PATTERNS = [
"image exceeds", # Anthropic: "image exceeds 5 MB maximum"
"image too large", # generic
"image_too_large", # error_code variant
"image size exceeds", # variant
# "request_too_large" on a request known to contain an image → image is
# the likely culprit; we still try the shrink path before giving up.
]
# Context overflow patterns
_CONTEXT_OVERFLOW_PATTERNS = [
"context length",
@@ -671,6 +686,15 @@ def _classify_400(
) -> ClassifiedError:
"""Classify 400 Bad Request — context overflow, format error, or generic."""
# Image-too-large from 400 (Anthropic's 5 MB per-image check fires this way).
# Must be checked BEFORE context_overflow because messages can trip both
# patterns ("exceeds" + "image") and image-shrink is a cheaper recovery.
if any(p in error_msg for p in _IMAGE_TOO_LARGE_PATTERNS):
return result_fn(
FailoverReason.image_too_large,
retryable=True,
)
# Context overflow from 400
if any(p in error_msg for p in _CONTEXT_OVERFLOW_PATTERNS):
return result_fn(
@@ -798,6 +822,13 @@ def _classify_by_message(
should_compress=True,
)
# Image-too-large patterns (from message text when no status_code)
if any(p in error_msg for p in _IMAGE_TOO_LARGE_PATTERNS):
return result_fn(
FailoverReason.image_too_large,
retryable=True,
)
# Usage-limit patterns need the same disambiguation as 402: some providers
# surface "usage limit" errors without an HTTP status code. A transient
# signal ("try again", "resets at", …) means it's a periodic quota, not

236
agent/image_routing.py Normal file
View File

@@ -0,0 +1,236 @@
"""Routing helpers for inbound user-attached images.
Two modes:
native — attach images as OpenAI-style ``image_url`` content parts on the
user turn. Provider adapters (Anthropic, Gemini, Bedrock, Codex,
OpenAI chat.completions) already translate these into their
vendor-specific multimodal formats.
text — run ``vision_analyze`` on each image up-front and prepend the
description to the user's text. The model never sees the pixels;
it only sees a lossy text summary. This is the pre-existing
behaviour and still the right choice for non-vision models.
The decision is made once per message turn by :func:`decide_image_input_mode`.
It reads ``agent.image_input_mode`` from config.yaml (``auto`` | ``native``
| ``text``, default ``auto``) and the active model's capability metadata.
In ``auto`` mode:
- If the user has explicitly configured ``auxiliary.vision.provider``
(i.e. not ``auto`` and not empty), we assume they want the text pipeline
regardless of the main model — they've opted in to a specific vision
backend for a reason (cost, quality, local-only, etc.).
- Otherwise, if the active model reports ``supports_vision=True`` in its
models.dev metadata, we attach natively.
- Otherwise (non-vision model, no explicit override), we fall back to text.
This keeps ``vision_analyze`` surfaced as a tool in every session — skills
and agent flows that chain it (browser screenshots, deeper inspection of
URL-referenced images, style-gating loops) keep working. The routing only
affects *how user-attached images on the current turn* are presented to the
main model.
"""
from __future__ import annotations
import base64
import logging
import mimetypes
from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple
logger = logging.getLogger(__name__)
_VALID_MODES = frozenset({"auto", "native", "text"})
def _coerce_mode(raw: Any) -> str:
"""Normalize a config value into one of the valid modes."""
if not isinstance(raw, str):
return "auto"
val = raw.strip().lower()
if val in _VALID_MODES:
return val
return "auto"
def _explicit_aux_vision_override(cfg: Optional[Dict[str, Any]]) -> bool:
"""True when the user configured a specific auxiliary vision backend.
An explicit override means the user *wants* the text pipeline (they're
paying for a dedicated vision model), so we don't silently bypass it.
"""
if not isinstance(cfg, dict):
return False
aux = cfg.get("auxiliary") or {}
if not isinstance(aux, dict):
return False
vision = aux.get("vision") or {}
if not isinstance(vision, dict):
return False
provider = str(vision.get("provider") or "").strip().lower()
model = str(vision.get("model") or "").strip()
base_url = str(vision.get("base_url") or "").strip()
# "auto" / "" / blank = not explicit
if provider in ("", "auto") and not model and not base_url:
return False
return True
def _lookup_supports_vision(provider: str, model: str) -> Optional[bool]:
"""Return True/False if we can resolve caps, None if unknown."""
if not provider or not model:
return None
try:
from agent.models_dev import get_model_capabilities
caps = get_model_capabilities(provider, model)
except Exception as exc: # pragma: no cover - defensive
logger.debug("image_routing: caps lookup failed for %s:%s%s", provider, model, exc)
return None
if caps is None:
return None
return bool(caps.supports_vision)
def decide_image_input_mode(
provider: str,
model: str,
cfg: Optional[Dict[str, Any]],
) -> str:
"""Return ``"native"`` or ``"text"`` for the given turn.
Args:
provider: active inference provider ID (e.g. ``"anthropic"``, ``"openrouter"``).
model: active model slug as it would be sent to the provider.
cfg: loaded config.yaml dict, or None. When None, behaves as auto.
"""
mode_cfg = "auto"
if isinstance(cfg, dict):
agent_cfg = cfg.get("agent") or {}
if isinstance(agent_cfg, dict):
mode_cfg = _coerce_mode(agent_cfg.get("image_input_mode"))
if mode_cfg == "native":
return "native"
if mode_cfg == "text":
return "text"
# auto
if _explicit_aux_vision_override(cfg):
return "text"
supports = _lookup_supports_vision(provider, model)
if supports is True:
return "native"
return "text"
# Image size handling is REACTIVE rather than proactive: we attempt native
# attachment at full size regardless of provider, and rely on
# ``run_agent._try_shrink_image_parts_in_messages`` to shrink + retry if
# the provider rejects the request (e.g. Anthropic's hard 5 MB per-image
# ceiling returned as HTTP 400 "image exceeds 5 MB maximum").
#
# Why reactive: our knowledge of provider ceilings is partial and evolving
# (OpenAI accepts 49 MB+, Anthropic 5 MB, Gemini 100 MB, others unknown).
# A proactive per-provider table would be stale the moment a provider raises
# or lowers its limit, and silently degrading quality for users on providers
# that would have accepted the full image is the worse failure mode.
# The shrink-on-reject path loses 1 API call + maybe 1s of Pillow work when
# it fires, which is cheaper than permanent quality loss.
def _guess_mime(path: Path) -> str:
mime, _ = mimetypes.guess_type(str(path))
if mime and mime.startswith("image/"):
return mime
# mimetypes on some Linux distros mis-maps .jpg; default to jpeg when
# the suffix looks imagey.
suffix = path.suffix.lower()
return {
".jpg": "image/jpeg",
".jpeg": "image/jpeg",
".png": "image/png",
".gif": "image/gif",
".webp": "image/webp",
".bmp": "image/bmp",
}.get(suffix, "image/jpeg")
def _file_to_data_url(path: Path) -> Optional[str]:
"""Encode a local image as a base64 data URL at its native size.
Size limits are NOT enforced here — the agent retry loop
(``run_agent._try_shrink_image_parts_in_messages``) shrinks on the
provider's first rejection. Keeping this simple means providers that
accept large images (OpenAI 49 MB+, Gemini 100 MB) don't pay a silent
quality tax just because one other provider is stricter.
Returns None only if the file can't be read (missing, permission
denied, etc.); the caller reports those paths in ``skipped``.
"""
try:
raw = path.read_bytes()
except Exception as exc:
logger.warning("image_routing: failed to read %s%s", path, exc)
return None
mime = _guess_mime(path)
b64 = base64.b64encode(raw).decode("ascii")
return f"data:{mime};base64,{b64}"
def build_native_content_parts(
user_text: str,
image_paths: List[str],
) -> Tuple[List[Dict[str, Any]], List[str]]:
"""Build an OpenAI-style ``content`` list for a user turn.
Shape:
[{"type": "text", "text": "..."},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
...]
Images are attached at their native size. If a provider rejects the
request because an image is too large (e.g. Anthropic's 5 MB per-image
ceiling), the agent's retry loop transparently shrinks and retries
once — see ``run_agent._try_shrink_image_parts_in_messages``.
Returns (content_parts, skipped_paths). Skipped paths are files that
couldn't be read from disk.
"""
parts: List[Dict[str, Any]] = []
skipped: List[str] = []
text = (user_text or "").strip()
if text:
parts.append({"type": "text", "text": text})
for raw_path in image_paths:
p = Path(raw_path)
if not p.exists() or not p.is_file():
skipped.append(str(raw_path))
continue
data_url = _file_to_data_url(p)
if not data_url:
skipped.append(str(raw_path))
continue
parts.append({
"type": "image_url",
"image_url": {"url": data_url},
})
# If the text was empty, add a neutral prompt so the turn isn't just images.
if not text and any(p.get("type") == "image_url" for p in parts):
parts.insert(0, {"type": "text", "text": "What do you see in this image?"})
return parts, skipped
__all__ = [
"decide_image_input_mode",
"build_native_content_parts",
]

61
cli.py
View File

@@ -8433,13 +8433,62 @@ class HermesCLI:
):
return None
# Pre-process images through the vision tool (Gemini Flash) so the
# main model receives text descriptions instead of raw base64 image
# content — works with any model, not just vision-capable ones.
# Route image attachments based on the active model's vision capability.
# "native" → pass pixels as OpenAI-style content parts (adapters
# translate for Anthropic/Gemini/Bedrock).
# "text" → pre-analyze each image with vision_analyze and prepend the
# description as text — works with non-vision models.
# See agent/image_routing.py for the decision table.
if images:
message = self._preprocess_images_with_vision(
message if isinstance(message, str) else "", images
)
try:
from agent.image_routing import (
build_native_content_parts,
decide_image_input_mode,
)
from hermes_cli.config import load_config
_img_mode = decide_image_input_mode(
(self.provider or "").strip(),
(self.model or "").strip(),
load_config(),
)
except Exception as _img_exc:
logging.debug("image_routing decision failed, defaulting to text: %s", _img_exc)
_img_mode = "text"
if _img_mode == "native":
try:
_text_for_parts = message if isinstance(message, str) else ""
_img_str_paths = [str(p) for p in images]
_parts, _skipped = build_native_content_parts(
_text_for_parts,
_img_str_paths,
)
if _skipped:
_cprint(
f" {_DIM}⚠ skipped {len(_skipped)} unreadable image path(s){_RST}"
)
if any(p.get("type") == "image_url" for p in _parts):
_img_names = ", ".join(Path(p).name for p in _img_str_paths)
_cprint(
f" {_DIM}📎 attaching {len(images)} image(s) natively "
f"(model supports vision): {_img_names}{_RST}"
)
message = _parts
else:
# All images unreadable — fall back to text enrichment.
message = self._preprocess_images_with_vision(
message if isinstance(message, str) else "", images
)
except Exception as _img_exc:
logging.warning("native image attach failed, falling back to text: %s", _img_exc)
message = self._preprocess_images_with_vision(
message if isinstance(message, str) else "", images
)
else:
message = self._preprocess_images_with_vision(
message if isinstance(message, str) else "", images
)
# Expand @ context references (e.g. @file:main.py, @diff, @folder:src/)
if isinstance(message, str) and "@" in message:

View File

@@ -4199,9 +4199,18 @@ class GatewayRunner:
Keep the normal inbound path and the queued follow-up path on the same
preprocessing pipeline so sender attribution, image enrichment, STT,
document notes, reply context, and @ references all behave the same.
Side effect: writes ``self._pending_native_image_paths`` to a list of
local image paths when the active model supports native vision AND
the user has images attached. The caller consumes and clears this
attribute at the ``run_conversation`` site to build a multimodal user
turn. When the list is empty, the ``_enrich_message_with_vision``
text path has already run and images are represented in-text.
"""
history = history or []
message_text = event.text or ""
# Reset per-call buffer; set only when native routing is chosen.
self._pending_native_image_paths = []
_is_shared_multi_user = is_shared_multi_user_session(
source,
@@ -4222,10 +4231,25 @@ class GatewayRunner:
audio_paths.append(path)
if image_paths:
message_text = await self._enrich_message_with_vision(
message_text,
image_paths,
)
# Decide routing: native (attach pixels) vs text (vision_analyze
# pre-run + prepend description). See agent/image_routing.py.
_img_mode = self._decide_image_input_mode()
if _img_mode == "native":
# Defer attachment to the run_conversation call site.
self._pending_native_image_paths = list(image_paths)
logger.info(
"Image routing: native (model supports vision). %d image(s) will be attached inline.",
len(image_paths),
)
else:
logger.info(
"Image routing: text (mode=%s). Pre-analyzing %d image(s) via vision_analyze.",
_img_mode, len(image_paths),
)
message_text = await self._enrich_message_with_vision(
message_text,
image_paths,
)
if audio_paths:
message_text = await self._enrich_message_with_transcription(
@@ -8378,6 +8402,29 @@ class GatewayRunner:
ctx = copy_context()
return await loop.run_in_executor(None, ctx.run, func, *args)
def _decide_image_input_mode(self) -> str:
"""Resolve the image-input routing for the currently active model.
Returns ``"native"`` (attach pixels on the user turn) or ``"text"``
(pre-analyze with vision_analyze and prepend the description). See
agent/image_routing.py for the full decision table.
The active provider/model are read from config.yaml so the decision
tracks ``/model`` switches automatically on the next message.
"""
try:
from agent.image_routing import decide_image_input_mode
from agent.auxiliary_client import _read_main_model, _read_main_provider
from hermes_cli.config import load_config
cfg = load_config()
provider = _read_main_provider()
model = _read_main_model()
return decide_image_input_mode(provider, model, cfg)
except Exception as exc:
logger.debug("image_routing: decision failed, falling back to text — %s", exc)
return "text"
async def _enrich_message_with_vision(
self,
user_text: str,
@@ -10394,7 +10441,39 @@ class GatewayRunner:
_approval_session_token = set_current_session_key(_approval_session_key)
register_gateway_notify(_approval_session_key, _approval_notify_sync)
try:
result = agent.run_conversation(message, conversation_history=agent_history, task_id=session_id)
# If _prepare_inbound_message_text buffered image paths for native
# attachment, wrap the user turn as an OpenAI-style multimodal
# content list. Consume-and-clear so subsequent turns on the same
# runner instance don't re-attach stale images.
_native_imgs = list(getattr(self, "_pending_native_image_paths", []) or [])
self._pending_native_image_paths = []
if _native_imgs:
try:
from agent.image_routing import build_native_content_parts
_parts, _skipped = build_native_content_parts(
message,
_native_imgs,
)
if _skipped:
logger.warning(
"Native image attachment: skipped %d unreadable path(s): %s",
len(_skipped), _skipped,
)
if any(p.get("type") == "image_url" for p in _parts):
_run_message: Any = _parts
else:
# All images failed to read — fall back to plain text.
_run_message = message
except Exception as _img_exc:
logger.warning(
"Native image attachment failed, falling back to text: %s",
_img_exc,
)
_run_message = message
else:
_run_message = message
result = agent.run_conversation(_run_message, conversation_history=agent_history, task_id=session_id)
finally:
unregister_gateway_notify(_approval_session_key)
reset_current_session_key(_approval_session_token)

View File

@@ -389,6 +389,20 @@ DEFAULT_CONFIG = {
# (60+ tool iterations with tiny output) before users assume the
# bot is dead and /restart.
"gateway_notify_interval": 180,
# How user-attached images are presented to the main model on each turn.
# "auto" — attach natively when the active model reports
# supports_vision=True AND the user hasn't explicitly
# configured auxiliary.vision.provider. Otherwise fall
# back to text (vision_analyze pre-analysis).
# "native" — always attach natively; non-vision models will either
# error at the provider or get a last-chance text fallback
# (see run_agent._prepare_messages_for_api).
# "text" — always pre-analyze with vision_analyze and prepend the
# description as text; the main model never sees pixels.
# Affects gateway platforms, the TUI, and CLI /attach. vision_analyze
# remains available as a tool regardless of this setting — the routing
# only controls how inbound user images are presented.
"image_input_mode": "auto",
},
"terminal": {

View File

@@ -7287,6 +7287,26 @@ class AIAgent:
self._anthropic_image_fallback_cache[cache_key] = note
return note
def _model_supports_vision(self) -> bool:
"""Return True if the active provider+model reports native vision.
Used to decide whether to strip image content parts from API-bound
messages (for non-vision models) or let the provider adapter handle
them natively (for vision-capable models).
"""
try:
from agent.models_dev import get_model_capabilities
provider = (getattr(self, "provider", "") or "").strip()
model = (getattr(self, "model", "") or "").strip()
if not provider or not model:
return False
caps = get_model_capabilities(provider, model)
if caps is None:
return False
return bool(caps.supports_vision)
except Exception:
return False
def _preprocess_anthropic_content(self, content: Any, role: str) -> Any:
if not self._content_has_image_parts(content):
return content
@@ -7350,12 +7370,23 @@ class AIAgent:
return t
def _prepare_anthropic_messages_for_api(self, api_messages: list) -> list:
# Fast exit when no message carries image content at all.
if not any(
isinstance(msg, dict) and self._content_has_image_parts(msg.get("content"))
for msg in api_messages
):
return api_messages
# The Anthropic adapter (agent/anthropic_adapter.py:_convert_content_part_to_anthropic)
# already translates OpenAI-style image_url/input_image parts into
# native Anthropic ``{"type": "image", "source": ...}`` blocks. When
# the active model supports vision we let the adapter do its job and
# skip this legacy text-fallback preprocessor entirely.
if self._model_supports_vision():
return api_messages
# Non-vision Anthropic model (rare today, but keep the fallback for
# compat): replace each image part with a vision_analyze text note.
transformed = copy.deepcopy(api_messages)
for msg in transformed:
if not isinstance(msg, dict):
@@ -7366,6 +7397,150 @@ class AIAgent:
)
return transformed
def _prepare_messages_for_non_vision_model(self, api_messages: list) -> list:
"""Strip native image parts when the active model lacks vision.
Runs on the chat.completions / codex_responses paths. Vision-capable
models pass through unchanged (provider and any downstream translator
handle the image parts natively). Non-vision models get each image
replaced by a cached vision_analyze text description so the turn
doesn't fail with "model does not support image input".
"""
if not any(
isinstance(msg, dict) and self._content_has_image_parts(msg.get("content"))
for msg in api_messages
):
return api_messages
if self._model_supports_vision():
return api_messages
transformed = copy.deepcopy(api_messages)
for msg in transformed:
if not isinstance(msg, dict):
continue
# Reuse the Anthropic text-fallback preprocessor — the behaviour is
# identical (walk content parts, replace images with cached
# descriptions, merge back into a single text or structured
# content). Naming is historical.
msg["content"] = self._preprocess_anthropic_content(
msg.get("content"),
str(msg.get("role", "user") or "user"),
)
return transformed
def _try_shrink_image_parts_in_messages(self, api_messages: list) -> bool:
"""Re-encode all native image parts at a smaller size to recover from
image-too-large errors (Anthropic 5 MB, unknown other providers).
Mutates ``api_messages`` in place. Returns True if any image part was
actually replaced, False if there were no image parts to shrink or
Pillow couldn't help (caller should surface the original error).
Strategy: look for ``image_url`` / ``input_image`` parts carrying a
``data:image/...;base64,...`` payload. For each one whose encoded
size exceeds 4 MB (a safe target that slides under Anthropic's 5 MB
ceiling with header overhead), write the base64 to a tempfile, call
``vision_tools._resize_image_for_vision`` to produce a smaller data
URL, and substitute it in place.
Non-data-URL images (http/https URLs) are not touched — the provider
fetches those itself and the size limit is different.
"""
if not api_messages:
return False
try:
from tools.vision_tools import _resize_image_for_vision
except Exception as exc:
logger.warning("image-shrink recovery: vision_tools unavailable — %s", exc)
return False
# 4 MB target leaves comfortable headroom under Anthropic's 5 MB.
# Non-Anthropic providers we haven't observed rejecting are fine with
# much larger; shrinking to 4 MB here loses quality but only fires
# after a confirmed provider rejection, so the alternative is failure.
target_bytes = 4 * 1024 * 1024
changed_count = 0
def _shrink_data_url(url: str) -> Optional[str]:
"""Return a smaller data URL, or None if shrink can't help."""
if not isinstance(url, str) or not url.startswith("data:"):
return None
if len(url) <= target_bytes:
# This specific image wasn't the oversized one.
return None
try:
header, _, data = url.partition(",")
mime = "image/jpeg"
if header.startswith("data:"):
mime_part = header[len("data:"):].split(";", 1)[0].strip()
if mime_part.startswith("image/"):
mime = mime_part
import base64 as _b64
raw = _b64.b64decode(data)
suffix = {
"image/png": ".png", "image/gif": ".gif", "image/webp": ".webp",
"image/jpeg": ".jpg", "image/jpg": ".jpg", "image/bmp": ".bmp",
}.get(mime, ".jpg")
tmp = tempfile.NamedTemporaryFile(
prefix="hermes_shrink_", suffix=suffix, delete=False,
)
try:
tmp.write(raw)
tmp.close()
resized = _resize_image_for_vision(
Path(tmp.name),
mime_type=mime,
max_base64_bytes=target_bytes,
)
finally:
try:
Path(tmp.name).unlink(missing_ok=True)
except Exception:
pass
if not resized or len(resized) >= len(url):
# Shrink didn't help (or made it bigger — corrupt input?).
return None
return resized
except Exception as exc:
logger.warning("image-shrink recovery: re-encode failed — %s", exc)
return None
for msg in api_messages:
if not isinstance(msg, dict):
continue
content = msg.get("content")
if not isinstance(content, list):
continue
for part in content:
if not isinstance(part, dict):
continue
ptype = part.get("type")
if ptype not in {"image_url", "input_image"}:
continue
image_value = part.get("image_url")
# OpenAI chat.completions: {"image_url": {"url": "data:..."}}
# OpenAI Responses: {"image_url": "data:..."}
if isinstance(image_value, dict):
url = image_value.get("url", "")
resized = _shrink_data_url(url)
if resized:
image_value["url"] = resized
changed_count += 1
elif isinstance(image_value, str):
resized = _shrink_data_url(image_value)
if resized:
part["image_url"] = resized
changed_count += 1
if changed_count:
logger.info(
"image-shrink recovery: re-encoded %d image part(s) to fit under %.0f MB",
changed_count, target_bytes / (1024 * 1024),
)
return changed_count > 0
def _anthropic_preserve_dots(self) -> bool:
"""True when using an anthropic-compatible endpoint that preserves dots in model names.
Alibaba/DashScope keeps dots (e.g. qwen3.5-plus).
@@ -7514,9 +7689,10 @@ class AIAgent:
)
)
is_xai_responses = self.provider == "xai" or self._base_url_hostname == "api.x.ai"
_msgs_for_codex = self._prepare_messages_for_non_vision_model(api_messages)
return _ct.build_kwargs(
model=self.model,
messages=api_messages,
messages=_msgs_for_codex,
tools=self.tools,
reasoning_config=self.reasoning_config,
session_id=getattr(self, "session_id", None),
@@ -7595,9 +7771,12 @@ class AIAgent:
if _ephemeral_out is not None:
self._ephemeral_max_output_tokens = None
# Strip image parts for non-vision models (no-op when vision-capable).
_msgs_for_chat = self._prepare_messages_for_non_vision_model(api_messages)
return _ct.build_kwargs(
model=self.model,
messages=api_messages,
messages=_msgs_for_chat,
tools=self.tools,
timeout=self._resolved_api_call_timeout(),
max_tokens=self.max_tokens,
@@ -9891,6 +10070,7 @@ class AIAgent:
nous_auth_retry_attempted=False
copilot_auth_retry_attempted=False
thinking_sig_retry_attempted = False
image_shrink_retry_attempted = False
has_retried_429 = False
restart_with_compressed_messages = False
restart_with_length_continuation = False
@@ -10812,6 +10992,31 @@ class AIAgent:
)
if recovered_with_pool:
continue
# Image-too-large recovery: shrink oversized native image
# parts in-place and retry once. Triggered by Anthropic's
# per-image 5 MB ceiling (400 with "image exceeds 5 MB
# maximum") or any other provider that complains about
# image size. If shrink fails or a second attempt still
# fails, fall through to normal error handling.
if (
classified.reason == FailoverReason.image_too_large
and not image_shrink_retry_attempted
):
image_shrink_retry_attempted = True
if self._try_shrink_image_parts_in_messages(api_messages):
self._vprint(
f"{self.log_prefix}📐 Image(s) exceeded provider size limit — "
f"shrank and retrying...",
force=True,
)
continue
else:
logger.info(
"image-shrink recovery: no data-URL image parts found "
"or shrink didn't reduce size; surfacing original error."
)
if (
self.api_mode == "codex_responses"
and self.provider == "openai-codex"

View File

@@ -0,0 +1,141 @@
"""Tests for image-token accounting in the context compressor.
Covers the native-image-routing PR's companion change: the compressor's
multimodal message length counter now charges ~1600 tokens per attached
image part instead of 0, so tail-cut / prune decisions are accurate for
creative workflows that iterate on images across many turns.
"""
from __future__ import annotations
import pytest
from agent.context_compressor import (
_CHARS_PER_TOKEN,
_IMAGE_CHAR_EQUIVALENT,
_IMAGE_TOKEN_ESTIMATE,
_content_length_for_budget,
)
class TestContentLengthForBudget:
def test_plain_string(self):
assert _content_length_for_budget("hello world") == 11
def test_empty_string(self):
assert _content_length_for_budget("") == 0
def test_none_coerces_to_zero(self):
assert _content_length_for_budget(None) == 0
def test_text_only_list(self):
content = [
{"type": "text", "text": "first"},
{"type": "text", "text": "second"},
]
assert _content_length_for_budget(content) == 5 + 6
def test_single_image_part_charges_fixed_budget(self):
content = [
{"type": "text", "text": "look"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,XXXX"}},
]
# 4 chars of text + 1 image at fixed char-equivalent
assert _content_length_for_budget(content) == 4 + _IMAGE_CHAR_EQUIVALENT
def test_image_url_raw_base64_is_not_counted_as_chars(self):
"""A 1MB base64 blob inside an image_url must NOT inflate token count.
The flat image estimate is what the provider actually bills; the raw
base64 is transport payload, not context tokens.
"""
huge_url = "data:image/png;base64," + ("A" * 1_000_000)
content = [
{"type": "image_url", "image_url": {"url": huge_url}},
]
# Exactly one image's worth, not 1M + something.
assert _content_length_for_budget(content) == _IMAGE_CHAR_EQUIVALENT
def test_multiple_image_parts(self):
content = [
{"type": "text", "text": "compare"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,AAA"}},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,BBB"}},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,CCC"}},
]
assert _content_length_for_budget(content) == 7 + 3 * _IMAGE_CHAR_EQUIVALENT
def test_openai_responses_input_image_shape(self):
"""Responses API uses type=input_image with top-level image_url string."""
content = [
{"type": "input_text", "text": "hey"},
{"type": "input_image", "image_url": "data:image/png;base64,XX"},
]
# input_text has .text "hey" (3 chars) + 1 image
assert _content_length_for_budget(content) == 3 + _IMAGE_CHAR_EQUIVALENT
def test_anthropic_native_image_shape(self):
"""Anthropic native shape: {type: image, source: {...}}."""
content = [
{"type": "text", "text": "hi"},
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": "XX"}},
]
assert _content_length_for_budget(content) == 2 + _IMAGE_CHAR_EQUIVALENT
def test_bare_string_part_in_list(self):
"""Older code paths sometimes produce mixed list-of-strings content."""
content = ["hello", {"type": "text", "text": "world"}]
assert _content_length_for_budget(content) == 5 + 5
def test_image_estimate_constant_is_reasonable(self):
"""Sanity-check the estimate aligns with real provider billing.
Anthropic ≈ width*height/750 → ~1600 for 1000×1200.
OpenAI GPT-4o high-detail 2048×2048 ≈ 1445.
Gemini 258/tile × 6 tiles for a 2048×2048 ≈ 1548.
Anything in the 800-2000 range is defensible. Enforce bounds so an
accidental edit doesn't drop it to e.g. 16.
"""
assert 800 <= _IMAGE_TOKEN_ESTIMATE <= 2500
assert _IMAGE_CHAR_EQUIVALENT == _IMAGE_TOKEN_ESTIMATE * _CHARS_PER_TOKEN
class TestTokenBudgetWithImages:
"""Integration: the compressor's tail-cut decision now respects image cost."""
def test_image_heavy_turns_count_toward_budget(self):
"""A tail with 5 image-bearing turns should blow past a 5K token budget."""
from agent.context_compressor import ContextCompressor
# Minimal compressor fixture — just enough to call _find_tail_cut_by_tokens
cc = object.__new__(ContextCompressor)
cc.tail_token_budget = 5000
# Build 10 messages: 5 with images, 5 with short text. Without the
# image-tokens fix, the compressor would think all 10 fit in 5K and
# protect them all. With the fix, images alone cost 5 × 1600 = 8K,
# so the tail should be trimmed.
messages = [{"role": "system", "content": "sys"}]
for i in range(5):
messages.append({
"role": "user",
"content": [
{"type": "text", "text": f"turn {i}"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,AAA"}},
],
})
messages.append({
"role": "assistant",
"content": f"response {i}",
})
cut = cc._find_tail_cut_by_tokens(messages, head_end=0, token_budget=5000)
# Budget is 5K, soft ceiling 7.5K. 5 images alone = 8000 image-tokens.
# Walking backward, the compressor should stop before including all 5.
# Exact cut depends on text lengths and min_tail, but it MUST be > 1
# (at least some head-side messages should be compressible).
assert cut > 1, (
f"Expected image-heavy tail to be trimmed; compressor placed cut at "
f"{cut} out of {len(messages)} (image tokens were likely ignored)."
)

View File

@@ -54,7 +54,7 @@ class TestFailoverReason:
expected = {
"auth", "auth_permanent", "billing", "rate_limit",
"overloaded", "server_error", "timeout",
"context_overflow", "payload_too_large",
"context_overflow", "payload_too_large", "image_too_large",
"model_not_found", "format_error",
"provider_policy_blocked",
"thinking_signature", "long_context_tier", "unknown",

View File

@@ -0,0 +1,213 @@
"""Tests for agent/image_routing.py — the per-turn image input mode decision."""
from __future__ import annotations
import base64
from pathlib import Path
from unittest.mock import patch
import pytest
from agent.image_routing import (
_coerce_mode,
_explicit_aux_vision_override,
build_native_content_parts,
decide_image_input_mode,
)
# ─── _coerce_mode ────────────────────────────────────────────────────────────
class TestCoerceMode:
def test_valid_modes_pass_through(self):
assert _coerce_mode("auto") == "auto"
assert _coerce_mode("native") == "native"
assert _coerce_mode("text") == "text"
def test_case_insensitive(self):
assert _coerce_mode("NATIVE") == "native"
assert _coerce_mode("Auto") == "auto"
def test_invalid_falls_back_to_auto(self):
assert _coerce_mode("nonsense") == "auto"
assert _coerce_mode("") == "auto"
assert _coerce_mode(None) == "auto"
assert _coerce_mode(42) == "auto"
def test_strips_whitespace(self):
assert _coerce_mode(" native ") == "native"
# ─── _explicit_aux_vision_override ───────────────────────────────────────────
class TestExplicitAuxVisionOverride:
def test_none_config(self):
assert _explicit_aux_vision_override(None) is False
def test_empty_config(self):
assert _explicit_aux_vision_override({}) is False
def test_default_auto_is_not_explicit(self):
cfg = {"auxiliary": {"vision": {"provider": "auto", "model": "", "base_url": ""}}}
assert _explicit_aux_vision_override(cfg) is False
def test_provider_set_is_explicit(self):
cfg = {"auxiliary": {"vision": {"provider": "openrouter", "model": ""}}}
assert _explicit_aux_vision_override(cfg) is True
def test_model_set_is_explicit(self):
cfg = {"auxiliary": {"vision": {"provider": "auto", "model": "google/gemini-2.5-flash"}}}
assert _explicit_aux_vision_override(cfg) is True
def test_base_url_set_is_explicit(self):
cfg = {"auxiliary": {"vision": {"provider": "auto", "base_url": "http://localhost:11434"}}}
assert _explicit_aux_vision_override(cfg) is True
# ─── decide_image_input_mode ─────────────────────────────────────────────────
class TestDecideImageInputMode:
def test_explicit_native_overrides_everything(self):
cfg = {"agent": {"image_input_mode": "native"}}
# Non-vision model, aux-vision explicitly configured: native still wins.
cfg["auxiliary"] = {"vision": {"provider": "openrouter", "model": "foo"}}
with patch("agent.image_routing._lookup_supports_vision", return_value=False):
assert decide_image_input_mode("openrouter", "some-non-vision-model", cfg) == "native"
def test_explicit_text_overrides_everything(self):
cfg = {"agent": {"image_input_mode": "text"}}
with patch("agent.image_routing._lookup_supports_vision", return_value=True):
assert decide_image_input_mode("anthropic", "claude-sonnet-4", cfg) == "text"
def test_auto_with_vision_capable_model(self):
with patch("agent.image_routing._lookup_supports_vision", return_value=True):
assert decide_image_input_mode("anthropic", "claude-sonnet-4", {}) == "native"
def test_auto_with_non_vision_model(self):
with patch("agent.image_routing._lookup_supports_vision", return_value=False):
assert decide_image_input_mode("openrouter", "qwen/qwen3-235b", {}) == "text"
def test_auto_with_unknown_model(self):
with patch("agent.image_routing._lookup_supports_vision", return_value=None):
assert decide_image_input_mode("openrouter", "brand-new-slug", {}) == "text"
def test_auto_respects_aux_vision_override_even_for_vision_model(self):
"""If the user configured a dedicated vision backend, don't bypass it."""
cfg = {"auxiliary": {"vision": {"provider": "openrouter", "model": "google/gemini-2.5-flash"}}}
with patch("agent.image_routing._lookup_supports_vision", return_value=True):
assert decide_image_input_mode("anthropic", "claude-sonnet-4", cfg) == "text"
def test_none_config_is_auto(self):
with patch("agent.image_routing._lookup_supports_vision", return_value=True):
assert decide_image_input_mode("anthropic", "claude-sonnet-4", None) == "native"
def test_invalid_mode_coerces_to_auto(self):
cfg = {"agent": {"image_input_mode": "weird-value"}}
with patch("agent.image_routing._lookup_supports_vision", return_value=True):
assert decide_image_input_mode("anthropic", "claude-sonnet-4", cfg) == "native"
# ─── build_native_content_parts ──────────────────────────────────────────────
def _png_bytes() -> bytes:
"""Return a tiny valid 1x1 transparent PNG."""
return base64.b64decode(
"iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR4nGNgYGBgAAAABQABpfZFQAAAAABJRU5ErkJggg=="
)
class TestBuildNativeContentParts:
def test_text_then_image(self, tmp_path: Path):
img = tmp_path / "cat.png"
img.write_bytes(_png_bytes())
parts, skipped = build_native_content_parts("hello", [str(img)])
assert skipped == []
assert len(parts) == 2
assert parts[0] == {"type": "text", "text": "hello"}
assert parts[1]["type"] == "image_url"
assert parts[1]["image_url"]["url"].startswith("data:image/png;base64,")
def test_empty_text_inserts_default_prompt(self, tmp_path: Path):
img = tmp_path / "cat.jpg"
img.write_bytes(_png_bytes())
parts, skipped = build_native_content_parts("", [str(img)])
assert skipped == []
# Even with empty user text, we insert a neutral prompt so the turn
# isn't just pixels.
assert parts[0]["type"] == "text"
assert parts[0]["text"] == "What do you see in this image?"
assert parts[1]["type"] == "image_url"
def test_missing_file_is_skipped(self, tmp_path: Path):
parts, skipped = build_native_content_parts("hi", [str(tmp_path / "missing.png")])
assert skipped == [str(tmp_path / "missing.png")]
# Only text remains.
assert parts == [{"type": "text", "text": "hi"}]
def test_multiple_images(self, tmp_path: Path):
img1 = tmp_path / "a.png"
img2 = tmp_path / "b.png"
img1.write_bytes(_png_bytes())
img2.write_bytes(_png_bytes())
parts, skipped = build_native_content_parts("compare these", [str(img1), str(img2)])
assert skipped == []
image_parts = [p for p in parts if p.get("type") == "image_url"]
assert len(image_parts) == 2
def test_mime_inference_jpg(self, tmp_path: Path):
img = tmp_path / "photo.jpg"
img.write_bytes(_png_bytes()) # bytes are PNG but extension is jpg
parts, _ = build_native_content_parts("x", [str(img)])
url = parts[1]["image_url"]["url"]
assert url.startswith("data:image/jpeg;base64,")
def test_mime_inference_webp(self, tmp_path: Path):
img = tmp_path / "pic.webp"
img.write_bytes(_png_bytes())
parts, _ = build_native_content_parts("", [str(img)])
url = parts[1]["image_url"]["url"]
assert url.startswith("data:image/webp;base64,")
# ─── Oversize handling ───────────────────────────────────────────────────────
class TestLargeImageHandling:
"""Large images attach at native size; shrink is handled reactively at
retry time in ``run_agent._try_shrink_image_parts_in_messages`` rather
than proactively here.
"""
def test_large_image_passes_through_unchanged(self, tmp_path: Path):
"""A multi-MB image is attached as-is — no resize, no skip."""
from agent import image_routing as _ir
img = tmp_path / "medium.png"
# 200 KB of real bytes; not huge but enough to verify no size gate fires.
img.write_bytes(b"\x89PNG\r\n\x1a\n" + b"X" * 200_000)
url = _ir._file_to_data_url(img)
assert url is not None
assert url.startswith("data:image/png;base64,")
# Base64 expansion means output is ~4/3 of input, plus header.
assert len(url) > 200_000
def test_missing_file_returns_none(self, tmp_path: Path):
from agent import image_routing as _ir
missing = tmp_path / "does_not_exist.png"
assert _ir._file_to_data_url(missing) is None
def test_build_native_parts_no_provider_kwarg(self, tmp_path: Path):
"""build_native_content_parts takes text + paths, no provider kwarg."""
from agent import image_routing as _ir
img = tmp_path / "cat.png"
img.write_bytes(_png_bytes())
parts, skipped = _ir.build_native_content_parts("hi", [str(img)])
assert skipped == []
assert len(parts) == 2
assert parts[0]["type"] == "text"
assert parts[1]["type"] == "image_url"

View File

@@ -0,0 +1,277 @@
"""Tests for reactive image-shrink recovery.
Covers the full chain for Anthropic's 5 MB per-image ceiling (and any
future provider that returns an image-too-large error):
1. agent/error_classifier.py: 400 with "image exceeds 5 MB maximum"
gets FailoverReason.image_too_large, not context_overflow.
2. run_agent._try_shrink_image_parts_in_messages mutates the API
payload in-place, re-encoding native data: URL image parts to fit
under 4 MB using vision_tools._resize_image_for_vision.
The end-to-end wiring in the retry loop is not unit-tested here — it's
covered by the live E2E in the PR description. These tests lock in the
two pieces that matter independently: the classifier signal and the
payload rewriter.
"""
from __future__ import annotations
import base64
from pathlib import Path
import pytest
from agent.error_classifier import FailoverReason, classify_api_error
class _FakeApiError(Exception):
"""Stand-in for an openai.BadRequestError with status_code + body."""
def __init__(self, status_code: int, message: str, body: dict | None = None):
super().__init__(message)
self.status_code = status_code
self.body = body or {"error": {"message": message}}
self.response = None # required by some code paths
# ─── Classifier ──────────────────────────────────────────────────────────────
class TestImageTooLargeClassification:
def test_anthropic_400_image_exceeds_message(self):
"""Anthropic's exact wording must classify as image_too_large, not context."""
err = _FakeApiError(
status_code=400,
message=(
"messages.0.content.1.image.source.base64: image exceeds 5 MB "
"maximum: 12966600 bytes > 5242880 bytes"
),
)
result = classify_api_error(err, provider="anthropic", model="claude-sonnet-4-6")
assert result.reason == FailoverReason.image_too_large
assert result.retryable is True
def test_generic_image_too_large_no_status(self):
"""No status_code path: message text alone triggers classification."""
err = Exception("image too large for this endpoint")
result = classify_api_error(err, provider="some-provider", model="some-model")
assert result.reason == FailoverReason.image_too_large
assert result.retryable is True
def test_image_too_large_not_confused_with_context_overflow(self):
"""'image exceeds' must NOT be mis-classified as context_overflow.
The context_overflow patterns include 'exceeds the limit' which is a
superstring risk — verify the image-too-large check fires first.
"""
err = _FakeApiError(
status_code=400,
message="image exceeds the limit for this model",
)
result = classify_api_error(err, provider="anthropic", model="claude-sonnet-4-6")
assert result.reason == FailoverReason.image_too_large
def test_regular_context_overflow_unaffected(self):
"""Context-overflow errors without image keywords still classify correctly."""
err = _FakeApiError(
status_code=400,
message="prompt is too long: context length 300000 exceeds max of 200000",
)
result = classify_api_error(err, provider="anthropic", model="claude-sonnet-4-6")
assert result.reason == FailoverReason.context_overflow
# ─── Shrink helper ───────────────────────────────────────────────────────────
def _big_png_data_url(size_kb: int) -> str:
"""Build a data URL with a plausible large base64 payload."""
# Use real PNG header so MIME detection works; fill to target size.
raw = b"\x89PNG\r\n\x1a\n" + b"X" * (size_kb * 1024)
return "data:image/png;base64," + base64.b64encode(raw).decode("ascii")
def _make_agent():
"""Build a bare AIAgent for method-level testing, no provider setup."""
from run_agent import AIAgent
agent = object.__new__(AIAgent)
agent.provider = "anthropic"
agent.model = "claude-sonnet-4-6"
return agent
class TestShrinkImagePartsHelper:
def test_no_messages_returns_false(self):
agent = _make_agent()
assert agent._try_shrink_image_parts_in_messages([]) is False
assert agent._try_shrink_image_parts_in_messages(None) is False
def test_no_image_parts_returns_false(self):
agent = _make_agent()
msgs = [
{"role": "user", "content": "plain text"},
{"role": "assistant", "content": "ack"},
]
assert agent._try_shrink_image_parts_in_messages(msgs) is False
def test_small_image_part_not_shrunk(self, monkeypatch):
"""An image under 4 MB is left alone — shrink helper only touches oversized ones."""
agent = _make_agent()
small_url = _big_png_data_url(100) # ~100 KB + b64 overhead
resize_hits = {"count": 0}
monkeypatch.setattr(
"tools.vision_tools._resize_image_for_vision",
lambda *a, **kw: resize_hits.__setitem__("count", resize_hits["count"] + 1) or small_url,
raising=False,
)
msgs = [{
"role": "user",
"content": [
{"type": "text", "text": "hi"},
{"type": "image_url", "image_url": {"url": small_url}},
],
}]
assert agent._try_shrink_image_parts_in_messages(msgs) is False
assert resize_hits["count"] == 0
# URL unchanged.
assert msgs[0]["content"][1]["image_url"]["url"] == small_url
def test_oversized_image_url_dict_shape_rewritten(self, monkeypatch):
"""OpenAI chat.completions shape: {image_url: {url: data:...}}."""
agent = _make_agent()
oversized_url = _big_png_data_url(5000) # ~5 MB raw → ~6.7 MB b64
shrunk = "data:image/jpeg;base64," + "A" * 1000 # small
def _fake_resize(path, mime_type=None, max_base64_bytes=None):
return shrunk
monkeypatch.setattr(
"tools.vision_tools._resize_image_for_vision",
_fake_resize,
raising=False,
)
msgs = [{
"role": "user",
"content": [
{"type": "text", "text": "look"},
{"type": "image_url", "image_url": {"url": oversized_url}},
],
}]
changed = agent._try_shrink_image_parts_in_messages(msgs)
assert changed is True
assert msgs[0]["content"][1]["image_url"]["url"] == shrunk
def test_oversized_input_image_string_shape_rewritten(self, monkeypatch):
"""OpenAI Responses shape: {type: input_image, image_url: "data:..."}."""
agent = _make_agent()
oversized_url = _big_png_data_url(5000)
shrunk = "data:image/jpeg;base64," + "B" * 1000
monkeypatch.setattr(
"tools.vision_tools._resize_image_for_vision",
lambda *a, **kw: shrunk,
raising=False,
)
msgs = [{
"role": "user",
"content": [
{"type": "input_text", "text": "look"},
{"type": "input_image", "image_url": oversized_url},
],
}]
changed = agent._try_shrink_image_parts_in_messages(msgs)
assert changed is True
assert msgs[0]["content"][1]["image_url"] == shrunk
def test_multiple_images_all_shrunk(self, monkeypatch):
agent = _make_agent()
big1 = _big_png_data_url(5000)
big2 = _big_png_data_url(6000)
shrunk = "data:image/jpeg;base64," + "C" * 500
monkeypatch.setattr(
"tools.vision_tools._resize_image_for_vision",
lambda *a, **kw: shrunk,
raising=False,
)
msgs = [{
"role": "user",
"content": [
{"type": "text", "text": "compare"},
{"type": "image_url", "image_url": {"url": big1}},
{"type": "image_url", "image_url": {"url": big2}},
],
}]
changed = agent._try_shrink_image_parts_in_messages(msgs)
assert changed is True
assert msgs[0]["content"][1]["image_url"]["url"] == shrunk
assert msgs[0]["content"][2]["image_url"]["url"] == shrunk
def test_http_url_images_not_touched(self, monkeypatch):
"""Only data: URLs are candidates — http URLs are server-fetched."""
agent = _make_agent()
resize_hits = {"count": 0}
monkeypatch.setattr(
"tools.vision_tools._resize_image_for_vision",
lambda *a, **kw: resize_hits.__setitem__("count", resize_hits["count"] + 1) or "shrunk",
raising=False,
)
msgs = [{
"role": "user",
"content": [
{"type": "text", "text": "at this url"},
{"type": "image_url", "image_url": {"url": "https://example.com/big.png"}},
],
}]
assert agent._try_shrink_image_parts_in_messages(msgs) is False
assert resize_hits["count"] == 0
def test_shrink_failure_returns_false_and_leaves_url_intact(self, monkeypatch):
"""If re-encode fails, leave the URL alone so the caller surfaces the original error."""
agent = _make_agent()
oversized_url = _big_png_data_url(5000)
monkeypatch.setattr(
"tools.vision_tools._resize_image_for_vision",
lambda *a, **kw: None, # resize returned nothing usable
raising=False,
)
msgs = [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": oversized_url}},
],
}]
assert agent._try_shrink_image_parts_in_messages(msgs) is False
assert msgs[0]["content"][0]["image_url"]["url"] == oversized_url
def test_shrink_that_makes_it_bigger_rejected(self, monkeypatch):
"""If the 'shrink' somehow produces a larger payload, skip it."""
agent = _make_agent()
oversized_url = _big_png_data_url(5000)
even_bigger = "data:image/png;base64," + "Z" * (10 * 1024 * 1024)
monkeypatch.setattr(
"tools.vision_tools._resize_image_for_vision",
lambda *a, **kw: even_bigger,
raising=False,
)
msgs = [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": oversized_url}},
],
}]
assert agent._try_shrink_image_parts_in_messages(msgs) is False
# Original URL still in place, not replaced by the bigger one.
assert msgs[0]["content"][0]["image_url"]["url"] == oversized_url

View File

@@ -0,0 +1,170 @@
"""Tests for the vision-aware image preprocessing in run_agent.py.
Covers:
* ``_prepare_anthropic_messages_for_api`` — passes image parts through
unchanged when the active model reports ``supports_vision=True`` (the
adapter handles them natively), and falls back to text-description
replacement when the model lacks vision.
* ``_prepare_messages_for_non_vision_model`` — the mirror method for the
chat.completions / codex_responses paths. Same contract.
"""
from __future__ import annotations
from unittest.mock import MagicMock, patch
import pytest
from run_agent import AIAgent
def _make_agent() -> AIAgent:
"""Build a bare-bones AIAgent instance without running __init__.
Avoids the heavy provider/credential setup for these pure-method tests.
"""
agent = object.__new__(AIAgent)
agent.provider = "anthropic"
agent.model = "claude-sonnet-4"
agent._anthropic_image_fallback_cache = {}
return agent
IMG_PARTS_USER_MSG = {
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,AAAA"}},
],
}
PLAIN_USER_MSG = {"role": "user", "content": "hello, no images here"}
# ─── _prepare_anthropic_messages_for_api ─────────────────────────────────────
class TestPrepareAnthropicMessages:
def test_no_images_passes_through(self):
agent = _make_agent()
msgs = [PLAIN_USER_MSG]
out = agent._prepare_anthropic_messages_for_api(msgs)
assert out is msgs # unchanged reference
def test_vision_capable_passes_images_through(self):
"""The Anthropic adapter handles image_url/input_image natively."""
agent = _make_agent()
with patch.object(agent, "_model_supports_vision", return_value=True):
out = agent._prepare_anthropic_messages_for_api([IMG_PARTS_USER_MSG])
# Passes through unchanged — image_url parts still present.
assert out[0]["content"][1]["type"] == "image_url"
def test_non_vision_replaces_images_with_text(self):
agent = _make_agent()
with patch.object(agent, "_model_supports_vision", return_value=False), \
patch.object(
agent,
"_describe_image_for_anthropic_fallback",
return_value="[Image description: a cat]",
):
out = agent._prepare_anthropic_messages_for_api([IMG_PARTS_USER_MSG])
# Content collapsed to a string containing the description + user text.
content = out[0]["content"]
assert isinstance(content, str)
assert "[Image description: a cat]" in content
assert "What's in this image?" in content
# No more image parts.
assert "image_url" not in content
# ─── _prepare_messages_for_non_vision_model ──────────────────────────────────
class TestPrepareMessagesForNonVision:
def test_no_images_passes_through(self):
agent = _make_agent()
msgs = [PLAIN_USER_MSG]
out = agent._prepare_messages_for_non_vision_model(msgs)
assert out is msgs
def test_vision_capable_passes_through(self):
"""For vision-capable models on chat.completions path, provider handles pixels."""
agent = _make_agent()
agent.provider = "openrouter"
agent.model = "anthropic/claude-sonnet-4"
with patch.object(agent, "_model_supports_vision", return_value=True):
out = agent._prepare_messages_for_non_vision_model([IMG_PARTS_USER_MSG])
assert out[0]["content"][1]["type"] == "image_url"
def test_non_vision_strips_images(self):
agent = _make_agent()
agent.provider = "openrouter"
agent.model = "qwen/qwen3-235b-a22b"
with patch.object(agent, "_model_supports_vision", return_value=False), \
patch.object(
agent,
"_describe_image_for_anthropic_fallback",
return_value="[Image description: a dog]",
):
out = agent._prepare_messages_for_non_vision_model([IMG_PARTS_USER_MSG])
content = out[0]["content"]
assert isinstance(content, str)
assert "[Image description: a dog]" in content
assert "image_url" not in content
def test_multiple_messages_with_mixed_content(self):
agent = _make_agent()
agent.model = "qwen/qwen3-235b"
msgs = [
{"role": "user", "content": "first turn"},
{"role": "assistant", "content": "ack"},
IMG_PARTS_USER_MSG,
]
with patch.object(agent, "_model_supports_vision", return_value=False), \
patch.object(
agent,
"_describe_image_for_anthropic_fallback",
return_value="[Image: thing]",
):
out = agent._prepare_messages_for_non_vision_model(msgs)
# First two messages unchanged (no images), third stripped.
assert out[0]["content"] == "first turn"
assert out[1]["content"] == "ack"
assert isinstance(out[2]["content"], str)
assert "[Image: thing]" in out[2]["content"]
# ─── _model_supports_vision ──────────────────────────────────────────────────
class TestModelSupportsVision:
def test_missing_provider_or_model_returns_false(self):
agent = _make_agent()
agent.provider = ""
agent.model = "claude-sonnet-4"
assert agent._model_supports_vision() is False
agent.provider = "anthropic"
agent.model = ""
assert agent._model_supports_vision() is False
def test_uses_get_model_capabilities(self):
agent = _make_agent()
fake_caps = MagicMock()
fake_caps.supports_vision = True
with patch("agent.models_dev.get_model_capabilities", return_value=fake_caps):
assert agent._model_supports_vision() is True
fake_caps.supports_vision = False
with patch("agent.models_dev.get_model_capabilities", return_value=fake_caps):
assert agent._model_supports_vision() is False
def test_none_caps_returns_false(self):
agent = _make_agent()
with patch("agent.models_dev.get_model_capabilities", return_value=None):
assert agent._model_supports_vision() is False
def test_exception_returns_false(self):
agent = _make_agent()
with patch("agent.models_dev.get_model_capabilities", side_effect=RuntimeError("boom")):
assert agent._model_supports_vision() is False

View File

@@ -754,7 +754,15 @@ from tools.registry import registry, tool_error
VISION_ANALYZE_SCHEMA = {
"name": "vision_analyze",
"description": "Analyze images using AI vision. Provides a comprehensive description and answers a specific question about the image content.",
"description": (
"Inspect an image from a URL, file path, or tool output when you need "
"closer detail than what's visible in the conversation. If the user's "
"image is already attached to the conversation and you can see it, "
"just answer directly — only call this tool for images referenced by "
"URL/path, images returned inside other tool results (browser "
"screenshots, search thumbnails), or when you need a deeper look at "
"a specific region the main model's vision may have missed."
),
"parameters": {
"type": "object",
"properties": {

View File

@@ -13,7 +13,7 @@ import time
import uuid
from datetime import datetime
from pathlib import Path
from typing import Optional
from typing import Any, Optional
from hermes_constants import get_hermes_home
from hermes_cli.env_loader import load_hermes_dotenv
@@ -2274,7 +2274,60 @@ def _(rid, params: dict) -> dict:
return
prompt = ctx.message
prompt = _enrich_with_attached_images(prompt, images) if images else prompt
# Decide image routing per-turn based on active provider/model.
# "native" → pass pixels to the main model as OpenAI-style content
# parts (adapters translate for Anthropic/Gemini/Bedrock/etc.).
# "text" → pre-analyze with vision_analyze and prepend the text.
# See agent/image_routing.py for the full decision table.
run_message: Any = prompt
if images:
try:
from agent.image_routing import (
decide_image_input_mode,
build_native_content_parts,
)
from agent.auxiliary_client import (
_read_main_model,
_read_main_provider,
)
from hermes_cli.config import load_config as _tui_load_config
_cfg = _tui_load_config()
_mode = decide_image_input_mode(
_read_main_provider(),
_read_main_model(),
_cfg,
)
except Exception as _img_exc:
print(
f"[tui_gateway] image_routing decision failed, defaulting to text: {_img_exc}",
file=sys.stderr,
)
_mode = "text"
if _mode == "native":
try:
_parts, _skipped = build_native_content_parts(
prompt,
images,
)
if _skipped:
print(
f"[tui_gateway] native image attachment skipped {len(_skipped)} unreadable path(s)",
file=sys.stderr,
)
if any(p.get("type") == "image_url" for p in _parts):
run_message = _parts
else:
run_message = _enrich_with_attached_images(prompt, images)
except Exception as _img_exc:
print(
f"[tui_gateway] native attach failed, falling back to text: {_img_exc}",
file=sys.stderr,
)
run_message = _enrich_with_attached_images(prompt, images)
else:
run_message = _enrich_with_attached_images(prompt, images)
def _stream(delta):
payload = {"text": delta}
@@ -2283,7 +2336,7 @@ def _(rid, params: dict) -> dict:
_emit("message.delta", sid, payload)
result = agent.run_conversation(
prompt,
run_message,
conversation_history=list(history),
stream_callback=_stream,
)