feat(image-input): native multimodal routing based on model vision capability (#16506)

* feat(image-input): native multimodal routing based on model vision capability Attach user-sent images as OpenAI-style content parts on the user turn when the active model supports native vision, so vision-capable models see real pixels instead of a lossy text description from vision_analyze. Routing decision (agent/image_routing.py::decide_image_input_mode): agent.image_input_mode = auto | native | text (default: auto) In auto mode: - If auxiliary.vision.provider/model is explicitly configured, keep the text pipeline (user paid for a dedicated vision backend). - Else if models.dev reports supports_vision=True for the active provider/model, attach natively. - Else fall back to text (current behaviour). Call sites updated: gateway/run.py (all messaging platforms), tui_gateway (dashboard/Ink), cli.py (interactive /attach + drag-drop). run_agent.py changes: - _prepare_anthropic_messages_for_api now passes image parts through unchanged when the model supports vision — the Anthropic adapter translates them to native image blocks. Previous behaviour (vision_analyze → text) only runs for non-vision Anthropic models. - New _prepare_messages_for_non_vision_model mirrors the same contract for chat.completions and codex_responses paths, so non-vision models on any provider get text-fallback instead of failing at the provider. - New _model_supports_vision() helper reads models.dev caps. vision_analyze description rewritten: positions it as a tool for images NOT already visible in the conversation (URLs, tool output, deeper inspection). Prevents the model from redundantly calling it on images already attached natively. Config default: agent.image_input_mode = auto. Tests: 35 new (test_image_routing.py + test_vision_aware_preprocessing.py), all existing tests that reference _prepare_anthropic_messages_for_api still pass (198 targeted + new tests green). * feat(image-input): size-cap + resize oversized images, charge image tokens in compressor Two follow-ups that make the native image routing safer for long / heavy sessions: 1) Oversize handling in build_native_content_parts: - 20 MB ceiling per image (matches vision_tools._MAX_BASE64_BYTES, the most restrictive provider — Gemini inline data). - Delegates to vision_tools._resize_image_for_vision (Pillow-based, already battle-tested) to downscale to 5 MB first-try. - If Pillow is missing or resize still overshoots, the image is dropped and reported back in skipped[]; caller falls back to text enrichment for that image. 2) Image-token accounting in context_compressor: - New _IMAGE_TOKEN_ESTIMATE = 1600 (matches Claude Code's constant; within the realistic range for Anthropic/GPT-4o/Gemini billing). - _content_length_for_budget() helper: sums text-part lengths and charges _IMAGE_CHAR_EQUIVALENT (1600 * 4 chars) per image/image_url/ input_image part. Base64 payload inside image_url is NOT counted as chars — dimensions don't matter, only image-presence. - Both tail-cut sites (_prune_old_tool_results L527 and _find_tail_cut_by_tokens L1126) now call the helper so multi-image conversations don't slip past compression budget. Tests: 9 new in test_image_routing.py (oversize triggers resize, resize-fails-returns-None, oversize-skipped-reported), 11 new in test_compressor_image_tokens.py (flat charge per image, multiple images, Responses-API / Anthropic-native / OpenAI-chat shapes, no-inflation on raw base64, bounds-check on the constant, integration test that an image-heavy tail actually gets trimmed). * fix(image-input): replace blanket 20MB ceiling with empirically-verified per-provider limits The previous commit imposed a hardcoded 20 MB base64 ceiling on all providers, triggering auto-resize on anything larger. This was wrong in both directions: * Too loose for Anthropic — actual limit is 5 MB (returns HTTP 400 'image exceeds 5 MB maximum' above that). * Too strict for OpenAI / Codex / OpenRouter — accept 49 MB+ without complaint (empirically verified April 2026 with progressive PNG sizes). New behaviour: * _PROVIDER_BASE64_CEILING table: only anthropic and bedrock have a ceiling (5 MB, since bedrock-on-Claude shares Anthropic's decoder). * Providers NOT in the table get no ceiling — images attach at native size and we trust the provider to return its own error if it disagrees. A provider-specific 400 message is clearer than us guessing wrong and silently degrading image quality. * build_native_content_parts() gains a keyword-only provider arg; gateway/CLI/TUI pass the active provider so Anthropic users get auto-resize protection while OpenAI users don't pay it. * Resize target dropped from 5 MB to 4 MB to slide safely under Anthropic's boundary with header overhead. Empirical measurements (direct API, no Hermes in the loop): image b64 anthropic openrouter/gpt5.5 codex-oauth/gpt5.5 0.19 MB ✓ ✓ ✓ 12.37 MB ✗ 400 5MB ✓ ✓ 23.85 MB ✗ 400 5MB ✓ ✓ 49.46 MB ✗ 413 ✓ ✓ Tests: rewrote TestOversizeHandling (5 tests): no-ceiling pass-through, Anthropic resize fires, Anthropic skip on resize-fail, build_native_parts routes ceiling by provider, unknown provider gets no ceiling. All 52 targeted tests pass. * refactor(image-input): attempt native, shrink-and-retry on provider reject Replace proactive per-provider size ceilings with a reactive shrink path on the provider's actual rejection. All providers now attempt native full-size attachment first; if the provider returns an image-too-large error, the agent silently shrinks and retries once. Why the previous design was wrong: hardcoding provider ceilings (anthropic=5MB, others=unlimited) meant OpenAI users on a 10MB image paid no tax, but Anthropic users lost quality on anything >5MB even though the empirical behaviour at provider-reject time is the same (shrink + retry). Baking the table into the routing layer also requires updating Hermes every time a provider's limit changes. Reactive design: - image_routing.py: _file_to_data_url encodes native size, no ceiling. build_native_content_parts drops its provider kwarg. - error_classifier.py: new FailoverReason.image_too_large + pattern match ("image exceeds", "image too large", etc.) checked BEFORE context_overflow so Anthropic's 5MB rejection lands in the right bucket. - run_agent.py: new _try_shrink_image_parts_in_messages walks api messages in-place, re-encodes oversized data: URL image parts through vision_tools._resize_image_for_vision to fit under 4MB, handles both chat.completions (dict image_url) and Responses (string image_url) shapes, ignores http URLs (provider-fetched). New image_shrink_retry_attempted flag in the retry loop fires the shrink exactly once per turn after credential-pool recovery but before auth retries. E2E verified live against Anthropic claude-sonnet-4-6: - 17.9MB PNG (23.9MB b64) attached at native size - Anthropic returns 400 "image exceeds 5 MB maximum" - Agent logs '📐 Image(s) exceeded provider size limit — shrank and retrying...' - Retry succeeds, correct response delivered in 6.8s total. Tests: 12 new (8 shrink-helper shapes + 4 classifier signals), replaces 5 proactive-ceiling tests with 3 simpler 'native attach works' tests. 181 targeted tests pass. test_enum_members_exist in test_error_classifier.py updated for the new enum value.
2026-04-27 06:27:59 -07:00
parent df3c9593f8
commit ec671c4154
14 changed files with 1539 additions and 42 deletions
--- a/agent/context_compressor.py
+++ b/agent/context_compressor.py
@@ -61,9 +61,52 @@ _PRUNED_TOOL_PLACEHOLDER = "[Old tool output cleared to save context space]"

 # Chars per token rough estimate
 _CHARS_PER_TOKEN = 4
+# Flat token cost per attached image part.  Real cost varies by provider and
+# dimensions (Anthropic ≈ width×height/750, GPT-4o up to ~1700 for
+# high-detail 2048×2048, Gemini 258/tile), but 1600 is a realistic ceiling
+# that keeps compression budgeting honest for multi-image conversations.
+# Matches Claude Code's IMAGE_TOKEN_ESTIMATE constant.
+_IMAGE_TOKEN_ESTIMATE = 1600
+# Same figure expressed in the char-budget currency the rest of the
+# compressor speaks in.  Used when accumulating message "content length"
+# for tail-cut decisions.
+_IMAGE_CHAR_EQUIVALENT = _IMAGE_TOKEN_ESTIMATE * _CHARS_PER_TOKEN
 _SUMMARY_FAILURE_COOLDOWN_SECONDS = 600


+def _content_length_for_budget(raw_content: Any) -> int:
+    """Return the effective char-length of a message's content for token budgeting.
+
+    Plain strings: ``len(content)``. Multimodal lists: sum of text-part
+    ``len(text)`` plus a flat ``_IMAGE_CHAR_EQUIVALENT`` per image part
+    (``image_url`` / ``input_image`` / Anthropic-style ``image``). This
+    keeps the compressor from treating a turn with 5 attached images as
+    near-zero tokens just because the text part is empty.
+    """
+    if isinstance(raw_content, str):
+        return len(raw_content)
+    if not isinstance(raw_content, list):
+        return len(str(raw_content or ""))
+
+    total = 0
+    for p in raw_content:
+        if isinstance(p, str):
+            total += len(p)
+            continue
+        if not isinstance(p, dict):
+            total += len(str(p))
+            continue
+        ptype = p.get("type")
+        if ptype in {"image_url", "input_image", "image"}:
+            total += _IMAGE_CHAR_EQUIVALENT
+        else:
+            # text / input_text / tool_result-with-text / anything else with
+            # a text field.  Ignore the raw base64 payload inside image_url
+            # dicts — dimensions don't matter, only whether it's an image.
+            total += len(p.get("text", "") or "")
+    return total
+
+
 def _content_text_for_contains(content: Any) -> str:
    """Return a best-effort text view of message content.

@@ -484,18 +527,7 @@ class ContextCompressor(ContextEngine):
            for i in range(len(result) - 1, -1, -1):
                msg = result[i]
                raw_content = msg.get("content") or ""
-                content_len = (
-                    sum(
-                        len(p.get("text", ""))
-                        if isinstance(p, dict)
-                        else len(p)
-                        if isinstance(p, str)
-                        else len(str(p))
-                        for p in raw_content
-                    )
-                    if isinstance(raw_content, list)
-                    else len(raw_content)
-                )
+                content_len = _content_length_for_budget(raw_content)
                msg_tokens = content_len // _CHARS_PER_TOKEN + 10
                for tc in msg.get("tool_calls") or []:
                    if isinstance(tc, dict):
@@ -1094,18 +1126,7 @@ The user has requested that this compaction PRIORITISE preserving all informatio
        for i in range(n - 1, head_end - 1, -1):
            msg = messages[i]
            raw_content = msg.get("content") or ""
-            content_len = (
-                sum(
-                    len(p.get("text", ""))
-                    if isinstance(p, dict)
-                    else len(p)
-                    if isinstance(p, str)
-                    else len(str(p))
-                    for p in raw_content
-                )
-                if isinstance(raw_content, list)
-                else len(raw_content)
-            )
+            content_len = _content_length_for_budget(raw_content)
            msg_tokens = content_len // _CHARS_PER_TOKEN + 10  # +10 for role/metadata
            # Include tool call arguments in estimate
            for tc in msg.get("tool_calls") or []:
--- a/agent/error_classifier.py
+++ b/agent/error_classifier.py
@@ -42,6 +42,7 @@ class FailoverReason(enum.Enum):
    # Context / payload
    context_overflow = "context_overflow"  # Context too large — compress, not failover
    payload_too_large = "payload_too_large"  # 413 — compress payload
+    image_too_large = "image_too_large"   # Native image part exceeds provider's per-image limit — shrink and retry

    # Model
    model_not_found = "model_not_found"  # 404 or invalid model — fallback to different model
@@ -147,6 +148,20 @@ _PAYLOAD_TOO_LARGE_PATTERNS = [
    "error code: 413",
 ]

+# Image-size patterns.  Matched against 400 bodies (not 413) because most
+# providers return a 400 with a specific image-too-big message before the
+# whole request hits the 413 size limit.  Anthropic's wording is the most
+# important here (hard 5 MB per image, returned as
+# "messages.N.content.K.image.source.base64: image exceeds 5 MB maximum").
+_IMAGE_TOO_LARGE_PATTERNS = [
+    "image exceeds",        # Anthropic: "image exceeds 5 MB maximum"
+    "image too large",      # generic
+    "image_too_large",      # error_code variant
+    "image size exceeds",   # variant
+    # "request_too_large" on a request known to contain an image → image is
+    # the likely culprit; we still try the shrink path before giving up.
+]
+
 # Context overflow patterns
 _CONTEXT_OVERFLOW_PATTERNS = [
    "context length",
@@ -671,6 +686,15 @@ def _classify_400(
 ) -> ClassifiedError:
    """Classify 400 Bad Request — context overflow, format error, or generic."""

+    # Image-too-large from 400 (Anthropic's 5 MB per-image check fires this way).
+    # Must be checked BEFORE context_overflow because messages can trip both
+    # patterns ("exceeds" + "image") and image-shrink is a cheaper recovery.
+    if any(p in error_msg for p in _IMAGE_TOO_LARGE_PATTERNS):
+        return result_fn(
+            FailoverReason.image_too_large,
+            retryable=True,
+        )
+
    # Context overflow from 400
    if any(p in error_msg for p in _CONTEXT_OVERFLOW_PATTERNS):
        return result_fn(
@@ -798,6 +822,13 @@ def _classify_by_message(
            should_compress=True,
        )

+    # Image-too-large patterns (from message text when no status_code)
+    if any(p in error_msg for p in _IMAGE_TOO_LARGE_PATTERNS):
+        return result_fn(
+            FailoverReason.image_too_large,
+            retryable=True,
+        )
+
    # Usage-limit patterns need the same disambiguation as 402: some providers
    # surface "usage limit" errors without an HTTP status code.  A transient
    # signal ("try again", "resets at", …) means it's a periodic quota, not
--- a/agent/image_routing.py
+++ b/agent/image_routing.py
@@ -0,0 +1,236 @@
+"""Routing helpers for inbound user-attached images.
+
+Two modes:
+
+  native  — attach images as OpenAI-style ``image_url`` content parts on the
+            user turn. Provider adapters (Anthropic, Gemini, Bedrock, Codex,
+            OpenAI chat.completions) already translate these into their
+            vendor-specific multimodal formats.
+
+  text    — run ``vision_analyze`` on each image up-front and prepend the
+            description to the user's text. The model never sees the pixels;
+            it only sees a lossy text summary. This is the pre-existing
+            behaviour and still the right choice for non-vision models.
+
+The decision is made once per message turn by :func:`decide_image_input_mode`.
+It reads ``agent.image_input_mode`` from config.yaml (``auto`` | ``native``
+| ``text``, default ``auto``) and the active model's capability metadata.
+
+In ``auto`` mode:
+  - If the user has explicitly configured ``auxiliary.vision.provider``
+    (i.e. not ``auto`` and not empty), we assume they want the text pipeline
+    regardless of the main model — they've opted in to a specific vision
+    backend for a reason (cost, quality, local-only, etc.).
+  - Otherwise, if the active model reports ``supports_vision=True`` in its
+    models.dev metadata, we attach natively.
+  - Otherwise (non-vision model, no explicit override), we fall back to text.
+
+This keeps ``vision_analyze`` surfaced as a tool in every session — skills
+and agent flows that chain it (browser screenshots, deeper inspection of
+URL-referenced images, style-gating loops) keep working. The routing only
+affects *how user-attached images on the current turn* are presented to the
+main model.
+"""
+
+from __future__ import annotations
+
+import base64
+import logging
+import mimetypes
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple
+
+logger = logging.getLogger(__name__)
+
+
+_VALID_MODES = frozenset({"auto", "native", "text"})
+
+
+def _coerce_mode(raw: Any) -> str:
+    """Normalize a config value into one of the valid modes."""
+    if not isinstance(raw, str):
+        return "auto"
+    val = raw.strip().lower()
+    if val in _VALID_MODES:
+        return val
+    return "auto"
+
+
+def _explicit_aux_vision_override(cfg: Optional[Dict[str, Any]]) -> bool:
+    """True when the user configured a specific auxiliary vision backend.
+
+    An explicit override means the user *wants* the text pipeline (they're
+    paying for a dedicated vision model), so we don't silently bypass it.
+    """
+    if not isinstance(cfg, dict):
+        return False
+    aux = cfg.get("auxiliary") or {}
+    if not isinstance(aux, dict):
+        return False
+    vision = aux.get("vision") or {}
+    if not isinstance(vision, dict):
+        return False
+
+    provider = str(vision.get("provider") or "").strip().lower()
+    model = str(vision.get("model") or "").strip()
+    base_url = str(vision.get("base_url") or "").strip()
+
+    # "auto" / "" / blank = not explicit
+    if provider in ("", "auto") and not model and not base_url:
+        return False
+    return True
+
+
+def _lookup_supports_vision(provider: str, model: str) -> Optional[bool]:
+    """Return True/False if we can resolve caps, None if unknown."""
+    if not provider or not model:
+        return None
+    try:
+        from agent.models_dev import get_model_capabilities
+        caps = get_model_capabilities(provider, model)
+    except Exception as exc:  # pragma: no cover - defensive
+        logger.debug("image_routing: caps lookup failed for %s:%s — %s", provider, model, exc)
+        return None
+    if caps is None:
+        return None
+    return bool(caps.supports_vision)
+
+
+def decide_image_input_mode(
+    provider: str,
+    model: str,
+    cfg: Optional[Dict[str, Any]],
+) -> str:
+    """Return ``"native"`` or ``"text"`` for the given turn.
+
+    Args:
+      provider: active inference provider ID (e.g. ``"anthropic"``, ``"openrouter"``).
+      model:    active model slug as it would be sent to the provider.
+      cfg:      loaded config.yaml dict, or None. When None, behaves as auto.
+    """
+    mode_cfg = "auto"
+    if isinstance(cfg, dict):
+        agent_cfg = cfg.get("agent") or {}
+        if isinstance(agent_cfg, dict):
+            mode_cfg = _coerce_mode(agent_cfg.get("image_input_mode"))
+
+    if mode_cfg == "native":
+        return "native"
+    if mode_cfg == "text":
+        return "text"
+
+    # auto
+    if _explicit_aux_vision_override(cfg):
+        return "text"
+
+    supports = _lookup_supports_vision(provider, model)
+    if supports is True:
+        return "native"
+    return "text"
+
+
+# Image size handling is REACTIVE rather than proactive: we attempt native
+# attachment at full size regardless of provider, and rely on
+# ``run_agent._try_shrink_image_parts_in_messages`` to shrink + retry if
+# the provider rejects the request (e.g. Anthropic's hard 5 MB per-image
+# ceiling returned as HTTP 400 "image exceeds 5 MB maximum").
+#
+# Why reactive: our knowledge of provider ceilings is partial and evolving
+# (OpenAI accepts 49 MB+, Anthropic 5 MB, Gemini 100 MB, others unknown).
+# A proactive per-provider table would be stale the moment a provider raises
+# or lowers its limit, and silently degrading quality for users on providers
+# that would have accepted the full image is the worse failure mode.
+# The shrink-on-reject path loses 1 API call + maybe 1s of Pillow work when
+# it fires, which is cheaper than permanent quality loss.
+
+
+def _guess_mime(path: Path) -> str:
+    mime, _ = mimetypes.guess_type(str(path))
+    if mime and mime.startswith("image/"):
+        return mime
+    # mimetypes on some Linux distros mis-maps .jpg; default to jpeg when
+    # the suffix looks imagey.
+    suffix = path.suffix.lower()
+    return {
+        ".jpg": "image/jpeg",
+        ".jpeg": "image/jpeg",
+        ".png": "image/png",
+        ".gif": "image/gif",
+        ".webp": "image/webp",
+        ".bmp": "image/bmp",
+    }.get(suffix, "image/jpeg")
+
+
+def _file_to_data_url(path: Path) -> Optional[str]:
+    """Encode a local image as a base64 data URL at its native size.
+
+    Size limits are NOT enforced here — the agent retry loop
+    (``run_agent._try_shrink_image_parts_in_messages``) shrinks on the
+    provider's first rejection. Keeping this simple means providers that
+    accept large images (OpenAI 49 MB+, Gemini 100 MB) don't pay a silent
+    quality tax just because one other provider is stricter.
+
+    Returns None only if the file can't be read (missing, permission
+    denied, etc.); the caller reports those paths in ``skipped``.
+    """
+    try:
+        raw = path.read_bytes()
+    except Exception as exc:
+        logger.warning("image_routing: failed to read %s — %s", path, exc)
+        return None
+    mime = _guess_mime(path)
+    b64 = base64.b64encode(raw).decode("ascii")
+    return f"data:{mime};base64,{b64}"
+
+
+def build_native_content_parts(
+    user_text: str,
+    image_paths: List[str],
+) -> Tuple[List[Dict[str, Any]], List[str]]:
+    """Build an OpenAI-style ``content`` list for a user turn.
+
+    Shape:
+      [{"type": "text", "text": "..."},
+       {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
+       ...]
+
+    Images are attached at their native size. If a provider rejects the
+    request because an image is too large (e.g. Anthropic's 5 MB per-image
+    ceiling), the agent's retry loop transparently shrinks and retries
+    once — see ``run_agent._try_shrink_image_parts_in_messages``.
+
+    Returns (content_parts, skipped_paths). Skipped paths are files that
+    couldn't be read from disk.
+    """
+    parts: List[Dict[str, Any]] = []
+    skipped: List[str] = []
+
+    text = (user_text or "").strip()
+    if text:
+        parts.append({"type": "text", "text": text})
+
+    for raw_path in image_paths:
+        p = Path(raw_path)
+        if not p.exists() or not p.is_file():
+            skipped.append(str(raw_path))
+            continue
+        data_url = _file_to_data_url(p)
+        if not data_url:
+            skipped.append(str(raw_path))
+            continue
+        parts.append({
+            "type": "image_url",
+            "image_url": {"url": data_url},
+        })
+
+    # If the text was empty, add a neutral prompt so the turn isn't just images.
+    if not text and any(p.get("type") == "image_url" for p in parts):
+        parts.insert(0, {"type": "text", "text": "What do you see in this image?"})
+
+    return parts, skipped
+
+
+__all__ = [
+    "decide_image_input_mode",
+    "build_native_content_parts",
+]
--- a/cli.py
+++ b/cli.py
@@ -8433,13 +8433,62 @@ class HermesCLI:
        ):
            return None
        
-        # Pre-process images through the vision tool (Gemini Flash) so the
-        # main model receives text descriptions instead of raw base64 image
-        # content — works with any model, not just vision-capable ones.
+        # Route image attachments based on the active model's vision capability.
+        # "native" → pass pixels as OpenAI-style content parts (adapters
+        #            translate for Anthropic/Gemini/Bedrock).
+        # "text"   → pre-analyze each image with vision_analyze and prepend the
+        #            description as text — works with non-vision models.
+        # See agent/image_routing.py for the decision table.
        if images:
-            message = self._preprocess_images_with_vision(
-                message if isinstance(message, str) else "", images
-            )
+            try:
+                from agent.image_routing import (
+                    build_native_content_parts,
+                    decide_image_input_mode,
+                )
+                from hermes_cli.config import load_config
+
+                _img_mode = decide_image_input_mode(
+                    (self.provider or "").strip(),
+                    (self.model or "").strip(),
+                    load_config(),
+                )
+            except Exception as _img_exc:
+                logging.debug("image_routing decision failed, defaulting to text: %s", _img_exc)
+                _img_mode = "text"
+
+            if _img_mode == "native":
+                try:
+                    _text_for_parts = message if isinstance(message, str) else ""
+                    _img_str_paths = [str(p) for p in images]
+                    _parts, _skipped = build_native_content_parts(
+                        _text_for_parts,
+                        _img_str_paths,
+                    )
+                    if _skipped:
+                        _cprint(
+                            f"  {_DIM}⚠ skipped {len(_skipped)} unreadable image path(s){_RST}"
+                        )
+                    if any(p.get("type") == "image_url" for p in _parts):
+                        _img_names = ", ".join(Path(p).name for p in _img_str_paths)
+                        _cprint(
+                            f"  {_DIM}📎 attaching {len(images)} image(s) natively "
+                            f"(model supports vision): {_img_names}{_RST}"
+                        )
+                        message = _parts
+                    else:
+                        # All images unreadable — fall back to text enrichment.
+                        message = self._preprocess_images_with_vision(
+                            message if isinstance(message, str) else "", images
+                        )
+                except Exception as _img_exc:
+                    logging.warning("native image attach failed, falling back to text: %s", _img_exc)
+                    message = self._preprocess_images_with_vision(
+                        message if isinstance(message, str) else "", images
+                    )
+            else:
+                message = self._preprocess_images_with_vision(
+                    message if isinstance(message, str) else "", images
+                )

        # Expand @ context references (e.g. @file:main.py, @diff, @folder:src/)
        if isinstance(message, str) and "@" in message:
--- a/gateway/run.py
+++ b/gateway/run.py
@@ -4199,9 +4199,18 @@ class GatewayRunner:
        Keep the normal inbound path and the queued follow-up path on the same
        preprocessing pipeline so sender attribution, image enrichment, STT,
        document notes, reply context, and @ references all behave the same.
+
+        Side effect: writes ``self._pending_native_image_paths`` to a list of
+        local image paths when the active model supports native vision AND
+        the user has images attached. The caller consumes and clears this
+        attribute at the ``run_conversation`` site to build a multimodal user
+        turn. When the list is empty, the ``_enrich_message_with_vision``
+        text path has already run and images are represented in-text.
        """
        history = history or []
        message_text = event.text or ""
+        # Reset per-call buffer; set only when native routing is chosen.
+        self._pending_native_image_paths = []

        _is_shared_multi_user = is_shared_multi_user_session(
            source,
@@ -4222,10 +4231,25 @@ class GatewayRunner:
                    audio_paths.append(path)

            if image_paths:
-                message_text = await self._enrich_message_with_vision(
-                    message_text,
-                    image_paths,
-                )
+                # Decide routing: native (attach pixels) vs text (vision_analyze
+                # pre-run + prepend description).  See agent/image_routing.py.
+                _img_mode = self._decide_image_input_mode()
+                if _img_mode == "native":
+                    # Defer attachment to the run_conversation call site.
+                    self._pending_native_image_paths = list(image_paths)
+                    logger.info(
+                        "Image routing: native (model supports vision). %d image(s) will be attached inline.",
+                        len(image_paths),
+                    )
+                else:
+                    logger.info(
+                        "Image routing: text (mode=%s). Pre-analyzing %d image(s) via vision_analyze.",
+                        _img_mode, len(image_paths),
+                    )
+                    message_text = await self._enrich_message_with_vision(
+                        message_text,
+                        image_paths,
+                    )

            if audio_paths:
                message_text = await self._enrich_message_with_transcription(
@@ -8378,6 +8402,29 @@ class GatewayRunner:
        ctx = copy_context()
        return await loop.run_in_executor(None, ctx.run, func, *args)

+    def _decide_image_input_mode(self) -> str:
+        """Resolve the image-input routing for the currently active model.
+
+        Returns ``"native"`` (attach pixels on the user turn) or ``"text"``
+        (pre-analyze with vision_analyze and prepend the description). See
+        agent/image_routing.py for the full decision table.
+
+        The active provider/model are read from config.yaml so the decision
+        tracks ``/model`` switches automatically on the next message.
+        """
+        try:
+            from agent.image_routing import decide_image_input_mode
+            from agent.auxiliary_client import _read_main_model, _read_main_provider
+            from hermes_cli.config import load_config
+
+            cfg = load_config()
+            provider = _read_main_provider()
+            model = _read_main_model()
+            return decide_image_input_mode(provider, model, cfg)
+        except Exception as exc:
+            logger.debug("image_routing: decision failed, falling back to text — %s", exc)
+            return "text"
+
    async def _enrich_message_with_vision(
        self,
        user_text: str,
@@ -10394,7 +10441,39 @@ class GatewayRunner:
            _approval_session_token = set_current_session_key(_approval_session_key)
            register_gateway_notify(_approval_session_key, _approval_notify_sync)
            try:
-                result = agent.run_conversation(message, conversation_history=agent_history, task_id=session_id)
+                # If _prepare_inbound_message_text buffered image paths for native
+                # attachment, wrap the user turn as an OpenAI-style multimodal
+                # content list. Consume-and-clear so subsequent turns on the same
+                # runner instance don't re-attach stale images.
+                _native_imgs = list(getattr(self, "_pending_native_image_paths", []) or [])
+                self._pending_native_image_paths = []
+                if _native_imgs:
+                    try:
+                        from agent.image_routing import build_native_content_parts
+                        _parts, _skipped = build_native_content_parts(
+                            message,
+                            _native_imgs,
+                        )
+                        if _skipped:
+                            logger.warning(
+                                "Native image attachment: skipped %d unreadable path(s): %s",
+                                len(_skipped), _skipped,
+                            )
+                        if any(p.get("type") == "image_url" for p in _parts):
+                            _run_message: Any = _parts
+                        else:
+                            # All images failed to read — fall back to plain text.
+                            _run_message = message
+                    except Exception as _img_exc:
+                        logger.warning(
+                            "Native image attachment failed, falling back to text: %s",
+                            _img_exc,
+                        )
+                        _run_message = message
+                else:
+                    _run_message = message
+
+                result = agent.run_conversation(_run_message, conversation_history=agent_history, task_id=session_id)
            finally:
                unregister_gateway_notify(_approval_session_key)
                reset_current_session_key(_approval_session_token)
--- a/hermes_cli/config.py
+++ b/hermes_cli/config.py
@@ -389,6 +389,20 @@ DEFAULT_CONFIG = {
        # (60+ tool iterations with tiny output) before users assume the
        # bot is dead and /restart.
        "gateway_notify_interval": 180,
+        # How user-attached images are presented to the main model on each turn.
+        #   "auto"   — attach natively when the active model reports
+        #              supports_vision=True AND the user hasn't explicitly
+        #              configured auxiliary.vision.provider.  Otherwise fall
+        #              back to text (vision_analyze pre-analysis).
+        #   "native" — always attach natively; non-vision models will either
+        #              error at the provider or get a last-chance text fallback
+        #              (see run_agent._prepare_messages_for_api).
+        #   "text"   — always pre-analyze with vision_analyze and prepend the
+        #              description as text; the main model never sees pixels.
+        # Affects gateway platforms, the TUI, and CLI /attach.  vision_analyze
+        # remains available as a tool regardless of this setting — the routing
+        # only controls how inbound user images are presented.
+        "image_input_mode": "auto",
    },
    
    "terminal": {
--- a/run_agent.py
+++ b/run_agent.py
@@ -7287,6 +7287,26 @@ class AIAgent:
        self._anthropic_image_fallback_cache[cache_key] = note
        return note

+    def _model_supports_vision(self) -> bool:
+        """Return True if the active provider+model reports native vision.
+
+        Used to decide whether to strip image content parts from API-bound
+        messages (for non-vision models) or let the provider adapter handle
+        them natively (for vision-capable models).
+        """
+        try:
+            from agent.models_dev import get_model_capabilities
+            provider = (getattr(self, "provider", "") or "").strip()
+            model = (getattr(self, "model", "") or "").strip()
+            if not provider or not model:
+                return False
+            caps = get_model_capabilities(provider, model)
+            if caps is None:
+                return False
+            return bool(caps.supports_vision)
+        except Exception:
+            return False
+
    def _preprocess_anthropic_content(self, content: Any, role: str) -> Any:
        if not self._content_has_image_parts(content):
            return content
@@ -7350,12 +7370,23 @@ class AIAgent:
        return t

    def _prepare_anthropic_messages_for_api(self, api_messages: list) -> list:
+        # Fast exit when no message carries image content at all.
        if not any(
            isinstance(msg, dict) and self._content_has_image_parts(msg.get("content"))
            for msg in api_messages
        ):
            return api_messages

+        # The Anthropic adapter (agent/anthropic_adapter.py:_convert_content_part_to_anthropic)
+        # already translates OpenAI-style image_url/input_image parts into
+        # native Anthropic ``{"type": "image", "source": ...}`` blocks. When
+        # the active model supports vision we let the adapter do its job and
+        # skip this legacy text-fallback preprocessor entirely.
+        if self._model_supports_vision():
+            return api_messages
+
+        # Non-vision Anthropic model (rare today, but keep the fallback for
+        # compat): replace each image part with a vision_analyze text note.
        transformed = copy.deepcopy(api_messages)
        for msg in transformed:
            if not isinstance(msg, dict):
@@ -7366,6 +7397,150 @@ class AIAgent:
            )
        return transformed

+    def _prepare_messages_for_non_vision_model(self, api_messages: list) -> list:
+        """Strip native image parts when the active model lacks vision.
+
+        Runs on the chat.completions / codex_responses paths. Vision-capable
+        models pass through unchanged (provider and any downstream translator
+        handle the image parts natively). Non-vision models get each image
+        replaced by a cached vision_analyze text description so the turn
+        doesn't fail with "model does not support image input".
+        """
+        if not any(
+            isinstance(msg, dict) and self._content_has_image_parts(msg.get("content"))
+            for msg in api_messages
+        ):
+            return api_messages
+
+        if self._model_supports_vision():
+            return api_messages
+
+        transformed = copy.deepcopy(api_messages)
+        for msg in transformed:
+            if not isinstance(msg, dict):
+                continue
+            # Reuse the Anthropic text-fallback preprocessor — the behaviour is
+            # identical (walk content parts, replace images with cached
+            # descriptions, merge back into a single text or structured
+            # content). Naming is historical.
+            msg["content"] = self._preprocess_anthropic_content(
+                msg.get("content"),
+                str(msg.get("role", "user") or "user"),
+            )
+        return transformed
+
+    def _try_shrink_image_parts_in_messages(self, api_messages: list) -> bool:
+        """Re-encode all native image parts at a smaller size to recover from
+        image-too-large errors (Anthropic 5 MB, unknown other providers).
+
+        Mutates ``api_messages`` in place. Returns True if any image part was
+        actually replaced, False if there were no image parts to shrink or
+        Pillow couldn't help (caller should surface the original error).
+
+        Strategy: look for ``image_url`` / ``input_image`` parts carrying a
+        ``data:image/...;base64,...`` payload.  For each one whose encoded
+        size exceeds 4 MB (a safe target that slides under Anthropic's 5 MB
+        ceiling with header overhead), write the base64 to a tempfile, call
+        ``vision_tools._resize_image_for_vision`` to produce a smaller data
+        URL, and substitute it in place.
+
+        Non-data-URL images (http/https URLs) are not touched — the provider
+        fetches those itself and the size limit is different.
+        """
+        if not api_messages:
+            return False
+
+        try:
+            from tools.vision_tools import _resize_image_for_vision
+        except Exception as exc:
+            logger.warning("image-shrink recovery: vision_tools unavailable — %s", exc)
+            return False
+
+        # 4 MB target leaves comfortable headroom under Anthropic's 5 MB.
+        # Non-Anthropic providers we haven't observed rejecting are fine with
+        # much larger; shrinking to 4 MB here loses quality but only fires
+        # after a confirmed provider rejection, so the alternative is failure.
+        target_bytes = 4 * 1024 * 1024
+        changed_count = 0
+
+        def _shrink_data_url(url: str) -> Optional[str]:
+            """Return a smaller data URL, or None if shrink can't help."""
+            if not isinstance(url, str) or not url.startswith("data:"):
+                return None
+            if len(url) <= target_bytes:
+                # This specific image wasn't the oversized one.
+                return None
+            try:
+                header, _, data = url.partition(",")
+                mime = "image/jpeg"
+                if header.startswith("data:"):
+                    mime_part = header[len("data:"):].split(";", 1)[0].strip()
+                    if mime_part.startswith("image/"):
+                        mime = mime_part
+                import base64 as _b64
+                raw = _b64.b64decode(data)
+                suffix = {
+                    "image/png": ".png", "image/gif": ".gif", "image/webp": ".webp",
+                    "image/jpeg": ".jpg", "image/jpg": ".jpg", "image/bmp": ".bmp",
+                }.get(mime, ".jpg")
+                tmp = tempfile.NamedTemporaryFile(
+                    prefix="hermes_shrink_", suffix=suffix, delete=False,
+                )
+                try:
+                    tmp.write(raw)
+                    tmp.close()
+                    resized = _resize_image_for_vision(
+                        Path(tmp.name),
+                        mime_type=mime,
+                        max_base64_bytes=target_bytes,
+                    )
+                finally:
+                    try:
+                        Path(tmp.name).unlink(missing_ok=True)
+                    except Exception:
+                        pass
+                if not resized or len(resized) >= len(url):
+                    # Shrink didn't help (or made it bigger — corrupt input?).
+                    return None
+                return resized
+            except Exception as exc:
+                logger.warning("image-shrink recovery: re-encode failed — %s", exc)
+                return None
+
+        for msg in api_messages:
+            if not isinstance(msg, dict):
+                continue
+            content = msg.get("content")
+            if not isinstance(content, list):
+                continue
+            for part in content:
+                if not isinstance(part, dict):
+                    continue
+                ptype = part.get("type")
+                if ptype not in {"image_url", "input_image"}:
+                    continue
+                image_value = part.get("image_url")
+                # OpenAI chat.completions: {"image_url": {"url": "data:..."}}
+                # OpenAI Responses: {"image_url": "data:..."}
+                if isinstance(image_value, dict):
+                    url = image_value.get("url", "")
+                    resized = _shrink_data_url(url)
+                    if resized:
+                        image_value["url"] = resized
+                        changed_count += 1
+                elif isinstance(image_value, str):
+                    resized = _shrink_data_url(image_value)
+                    if resized:
+                        part["image_url"] = resized
+                        changed_count += 1
+
+        if changed_count:
+            logger.info(
+                "image-shrink recovery: re-encoded %d image part(s) to fit under %.0f MB",
+                changed_count, target_bytes / (1024 * 1024),
+            )
+        return changed_count > 0
+
    def _anthropic_preserve_dots(self) -> bool:
        """True when using an anthropic-compatible endpoint that preserves dots in model names.
        Alibaba/DashScope keeps dots (e.g. qwen3.5-plus).
@@ -7514,9 +7689,10 @@ class AIAgent:
                )
            )
            is_xai_responses = self.provider == "xai" or self._base_url_hostname == "api.x.ai"
+            _msgs_for_codex = self._prepare_messages_for_non_vision_model(api_messages)
            return _ct.build_kwargs(
                model=self.model,
-                messages=api_messages,
+                messages=_msgs_for_codex,
                tools=self.tools,
                reasoning_config=self.reasoning_config,
                session_id=getattr(self, "session_id", None),
@@ -7595,9 +7771,12 @@ class AIAgent:
        if _ephemeral_out is not None:
            self._ephemeral_max_output_tokens = None

+        # Strip image parts for non-vision models (no-op when vision-capable).
+        _msgs_for_chat = self._prepare_messages_for_non_vision_model(api_messages)
+
        return _ct.build_kwargs(
            model=self.model,
-            messages=api_messages,
+            messages=_msgs_for_chat,
            tools=self.tools,
            timeout=self._resolved_api_call_timeout(),
            max_tokens=self.max_tokens,
@@ -9891,6 +10070,7 @@ class AIAgent:
            nous_auth_retry_attempted=False
            copilot_auth_retry_attempted=False
            thinking_sig_retry_attempted = False
+            image_shrink_retry_attempted = False
            has_retried_429 = False
            restart_with_compressed_messages = False
            restart_with_length_continuation = False
@@ -10812,6 +10992,31 @@ class AIAgent:
                    )
                    if recovered_with_pool:
                        continue
+
+                    # Image-too-large recovery: shrink oversized native image
+                    # parts in-place and retry once.  Triggered by Anthropic's
+                    # per-image 5 MB ceiling (400 with "image exceeds 5 MB
+                    # maximum") or any other provider that complains about
+                    # image size.  If shrink fails or a second attempt still
+                    # fails, fall through to normal error handling.
+                    if (
+                        classified.reason == FailoverReason.image_too_large
+                        and not image_shrink_retry_attempted
+                    ):
+                        image_shrink_retry_attempted = True
+                        if self._try_shrink_image_parts_in_messages(api_messages):
+                            self._vprint(
+                                f"{self.log_prefix}📐 Image(s) exceeded provider size limit — "
+                                f"shrank and retrying...",
+                                force=True,
+                            )
+                            continue
+                        else:
+                            logger.info(
+                                "image-shrink recovery: no data-URL image parts found "
+                                "or shrink didn't reduce size; surfacing original error."
+                            )
+
                    if (
                        self.api_mode == "codex_responses"
                        and self.provider == "openai-codex"
--- a/tests/agent/test_compressor_image_tokens.py
+++ b/tests/agent/test_compressor_image_tokens.py
@@ -0,0 +1,141 @@
+"""Tests for image-token accounting in the context compressor.
+
+Covers the native-image-routing PR's companion change: the compressor's
+multimodal message length counter now charges ~1600 tokens per attached
+image part instead of 0, so tail-cut / prune decisions are accurate for
+creative workflows that iterate on images across many turns.
+"""
+
+from __future__ import annotations
+
+import pytest
+
+from agent.context_compressor import (
+    _CHARS_PER_TOKEN,
+    _IMAGE_CHAR_EQUIVALENT,
+    _IMAGE_TOKEN_ESTIMATE,
+    _content_length_for_budget,
+)
+
+
+class TestContentLengthForBudget:
+    def test_plain_string(self):
+        assert _content_length_for_budget("hello world") == 11
+
+    def test_empty_string(self):
+        assert _content_length_for_budget("") == 0
+
+    def test_none_coerces_to_zero(self):
+        assert _content_length_for_budget(None) == 0
+
+    def test_text_only_list(self):
+        content = [
+            {"type": "text", "text": "first"},
+            {"type": "text", "text": "second"},
+        ]
+        assert _content_length_for_budget(content) == 5 + 6
+
+    def test_single_image_part_charges_fixed_budget(self):
+        content = [
+            {"type": "text", "text": "look"},
+            {"type": "image_url", "image_url": {"url": "data:image/png;base64,XXXX"}},
+        ]
+        # 4 chars of text + 1 image at fixed char-equivalent
+        assert _content_length_for_budget(content) == 4 + _IMAGE_CHAR_EQUIVALENT
+
+    def test_image_url_raw_base64_is_not_counted_as_chars(self):
+        """A 1MB base64 blob inside an image_url must NOT inflate token count.
+
+        The flat image estimate is what the provider actually bills; the raw
+        base64 is transport payload, not context tokens.
+        """
+        huge_url = "data:image/png;base64," + ("A" * 1_000_000)
+        content = [
+            {"type": "image_url", "image_url": {"url": huge_url}},
+        ]
+        # Exactly one image's worth, not 1M + something.
+        assert _content_length_for_budget(content) == _IMAGE_CHAR_EQUIVALENT
+
+    def test_multiple_image_parts(self):
+        content = [
+            {"type": "text", "text": "compare"},
+            {"type": "image_url", "image_url": {"url": "data:image/png;base64,AAA"}},
+            {"type": "image_url", "image_url": {"url": "data:image/png;base64,BBB"}},
+            {"type": "image_url", "image_url": {"url": "data:image/png;base64,CCC"}},
+        ]
+        assert _content_length_for_budget(content) == 7 + 3 * _IMAGE_CHAR_EQUIVALENT
+
+    def test_openai_responses_input_image_shape(self):
+        """Responses API uses type=input_image with top-level image_url string."""
+        content = [
+            {"type": "input_text", "text": "hey"},
+            {"type": "input_image", "image_url": "data:image/png;base64,XX"},
+        ]
+        # input_text has .text "hey" (3 chars) + 1 image
+        assert _content_length_for_budget(content) == 3 + _IMAGE_CHAR_EQUIVALENT
+
+    def test_anthropic_native_image_shape(self):
+        """Anthropic native shape: {type: image, source: {...}}."""
+        content = [
+            {"type": "text", "text": "hi"},
+            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": "XX"}},
+        ]
+        assert _content_length_for_budget(content) == 2 + _IMAGE_CHAR_EQUIVALENT
+
+    def test_bare_string_part_in_list(self):
+        """Older code paths sometimes produce mixed list-of-strings content."""
+        content = ["hello", {"type": "text", "text": "world"}]
+        assert _content_length_for_budget(content) == 5 + 5
+
+    def test_image_estimate_constant_is_reasonable(self):
+        """Sanity-check the estimate aligns with real provider billing.
+
+        Anthropic ≈ width*height/750 → ~1600 for 1000×1200.
+        OpenAI GPT-4o high-detail 2048×2048 ≈ 1445.
+        Gemini 258/tile × 6 tiles for a 2048×2048 ≈ 1548.
+        Anything in the 800-2000 range is defensible. Enforce bounds so an
+        accidental edit doesn't drop it to e.g. 16.
+        """
+        assert 800 <= _IMAGE_TOKEN_ESTIMATE <= 2500
+        assert _IMAGE_CHAR_EQUIVALENT == _IMAGE_TOKEN_ESTIMATE * _CHARS_PER_TOKEN
+
+
+class TestTokenBudgetWithImages:
+    """Integration: the compressor's tail-cut decision now respects image cost."""
+
+    def test_image_heavy_turns_count_toward_budget(self):
+        """A tail with 5 image-bearing turns should blow past a 5K token budget."""
+        from agent.context_compressor import ContextCompressor
+
+        # Minimal compressor fixture — just enough to call _find_tail_cut_by_tokens
+        cc = object.__new__(ContextCompressor)
+        cc.tail_token_budget = 5000
+
+        # Build 10 messages: 5 with images, 5 with short text. Without the
+        # image-tokens fix, the compressor would think all 10 fit in 5K and
+        # protect them all. With the fix, images alone cost 5 × 1600 = 8K,
+        # so the tail should be trimmed.
+        messages = [{"role": "system", "content": "sys"}]
+        for i in range(5):
+            messages.append({
+                "role": "user",
+                "content": [
+                    {"type": "text", "text": f"turn {i}"},
+                    {"type": "image_url", "image_url": {"url": "data:image/png;base64,AAA"}},
+                ],
+            })
+            messages.append({
+                "role": "assistant",
+                "content": f"response {i}",
+            })
+
+        cut = cc._find_tail_cut_by_tokens(messages, head_end=0, token_budget=5000)
+
+        # Budget is 5K, soft ceiling 7.5K. 5 images alone = 8000 image-tokens.
+        # Walking backward, the compressor should stop before including all 5.
+        # Exact cut depends on text lengths and min_tail, but it MUST be > 1
+        # (at least some head-side messages should be compressible).
+        assert cut > 1, (
+            f"Expected image-heavy tail to be trimmed; compressor placed cut at "
+            f"{cut} out of {len(messages)} (image tokens were likely ignored)."
+        )
--- a/tests/agent/test_error_classifier.py
+++ b/tests/agent/test_error_classifier.py
@@ -54,7 +54,7 @@ class TestFailoverReason:
        expected = {
            "auth", "auth_permanent", "billing", "rate_limit",
            "overloaded", "server_error", "timeout",
-            "context_overflow", "payload_too_large",
+            "context_overflow", "payload_too_large", "image_too_large",
            "model_not_found", "format_error",
            "provider_policy_blocked",
            "thinking_signature", "long_context_tier", "unknown",
--- a/tests/agent/test_image_routing.py
+++ b/tests/agent/test_image_routing.py
@@ -0,0 +1,213 @@
+"""Tests for agent/image_routing.py — the per-turn image input mode decision."""
+
+from __future__ import annotations
+
+import base64
+from pathlib import Path
+from unittest.mock import patch
+
+import pytest
+
+from agent.image_routing import (
+    _coerce_mode,
+    _explicit_aux_vision_override,
+    build_native_content_parts,
+    decide_image_input_mode,
+)
+
+
+# ─── _coerce_mode ────────────────────────────────────────────────────────────
+
+
+class TestCoerceMode:
+    def test_valid_modes_pass_through(self):
+        assert _coerce_mode("auto") == "auto"
+        assert _coerce_mode("native") == "native"
+        assert _coerce_mode("text") == "text"
+
+    def test_case_insensitive(self):
+        assert _coerce_mode("NATIVE") == "native"
+        assert _coerce_mode("Auto") == "auto"
+
+    def test_invalid_falls_back_to_auto(self):
+        assert _coerce_mode("nonsense") == "auto"
+        assert _coerce_mode("") == "auto"
+        assert _coerce_mode(None) == "auto"
+        assert _coerce_mode(42) == "auto"
+
+    def test_strips_whitespace(self):
+        assert _coerce_mode("  native  ") == "native"
+
+
+# ─── _explicit_aux_vision_override ───────────────────────────────────────────
+
+
+class TestExplicitAuxVisionOverride:
+    def test_none_config(self):
+        assert _explicit_aux_vision_override(None) is False
+
+    def test_empty_config(self):
+        assert _explicit_aux_vision_override({}) is False
+
+    def test_default_auto_is_not_explicit(self):
+        cfg = {"auxiliary": {"vision": {"provider": "auto", "model": "", "base_url": ""}}}
+        assert _explicit_aux_vision_override(cfg) is False
+
+    def test_provider_set_is_explicit(self):
+        cfg = {"auxiliary": {"vision": {"provider": "openrouter", "model": ""}}}
+        assert _explicit_aux_vision_override(cfg) is True
+
+    def test_model_set_is_explicit(self):
+        cfg = {"auxiliary": {"vision": {"provider": "auto", "model": "google/gemini-2.5-flash"}}}
+        assert _explicit_aux_vision_override(cfg) is True
+
+    def test_base_url_set_is_explicit(self):
+        cfg = {"auxiliary": {"vision": {"provider": "auto", "base_url": "http://localhost:11434"}}}
+        assert _explicit_aux_vision_override(cfg) is True
+
+
+# ─── decide_image_input_mode ─────────────────────────────────────────────────
+
+
+class TestDecideImageInputMode:
+    def test_explicit_native_overrides_everything(self):
+        cfg = {"agent": {"image_input_mode": "native"}}
+        # Non-vision model, aux-vision explicitly configured: native still wins.
+        cfg["auxiliary"] = {"vision": {"provider": "openrouter", "model": "foo"}}
+        with patch("agent.image_routing._lookup_supports_vision", return_value=False):
+            assert decide_image_input_mode("openrouter", "some-non-vision-model", cfg) == "native"
+
+    def test_explicit_text_overrides_everything(self):
+        cfg = {"agent": {"image_input_mode": "text"}}
+        with patch("agent.image_routing._lookup_supports_vision", return_value=True):
+            assert decide_image_input_mode("anthropic", "claude-sonnet-4", cfg) == "text"
+
+    def test_auto_with_vision_capable_model(self):
+        with patch("agent.image_routing._lookup_supports_vision", return_value=True):
+            assert decide_image_input_mode("anthropic", "claude-sonnet-4", {}) == "native"
+
+    def test_auto_with_non_vision_model(self):
+        with patch("agent.image_routing._lookup_supports_vision", return_value=False):
+            assert decide_image_input_mode("openrouter", "qwen/qwen3-235b", {}) == "text"
+
+    def test_auto_with_unknown_model(self):
+        with patch("agent.image_routing._lookup_supports_vision", return_value=None):
+            assert decide_image_input_mode("openrouter", "brand-new-slug", {}) == "text"
+
+    def test_auto_respects_aux_vision_override_even_for_vision_model(self):
+        """If the user configured a dedicated vision backend, don't bypass it."""
+        cfg = {"auxiliary": {"vision": {"provider": "openrouter", "model": "google/gemini-2.5-flash"}}}
+        with patch("agent.image_routing._lookup_supports_vision", return_value=True):
+            assert decide_image_input_mode("anthropic", "claude-sonnet-4", cfg) == "text"
+
+    def test_none_config_is_auto(self):
+        with patch("agent.image_routing._lookup_supports_vision", return_value=True):
+            assert decide_image_input_mode("anthropic", "claude-sonnet-4", None) == "native"
+
+    def test_invalid_mode_coerces_to_auto(self):
+        cfg = {"agent": {"image_input_mode": "weird-value"}}
+        with patch("agent.image_routing._lookup_supports_vision", return_value=True):
+            assert decide_image_input_mode("anthropic", "claude-sonnet-4", cfg) == "native"
+
+
+# ─── build_native_content_parts ──────────────────────────────────────────────
+
+
+def _png_bytes() -> bytes:
+    """Return a tiny valid 1x1 transparent PNG."""
+    return base64.b64decode(
+        "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR4nGNgYGBgAAAABQABpfZFQAAAAABJRU5ErkJggg=="
+    )
+
+
+class TestBuildNativeContentParts:
+    def test_text_then_image(self, tmp_path: Path):
+        img = tmp_path / "cat.png"
+        img.write_bytes(_png_bytes())
+        parts, skipped = build_native_content_parts("hello", [str(img)])
+        assert skipped == []
+        assert len(parts) == 2
+        assert parts[0] == {"type": "text", "text": "hello"}
+        assert parts[1]["type"] == "image_url"
+        assert parts[1]["image_url"]["url"].startswith("data:image/png;base64,")
+
+    def test_empty_text_inserts_default_prompt(self, tmp_path: Path):
+        img = tmp_path / "cat.jpg"
+        img.write_bytes(_png_bytes())
+        parts, skipped = build_native_content_parts("", [str(img)])
+        assert skipped == []
+        # Even with empty user text, we insert a neutral prompt so the turn
+        # isn't just pixels.
+        assert parts[0]["type"] == "text"
+        assert parts[0]["text"] == "What do you see in this image?"
+        assert parts[1]["type"] == "image_url"
+
+    def test_missing_file_is_skipped(self, tmp_path: Path):
+        parts, skipped = build_native_content_parts("hi", [str(tmp_path / "missing.png")])
+        assert skipped == [str(tmp_path / "missing.png")]
+        # Only text remains.
+        assert parts == [{"type": "text", "text": "hi"}]
+
+    def test_multiple_images(self, tmp_path: Path):
+        img1 = tmp_path / "a.png"
+        img2 = tmp_path / "b.png"
+        img1.write_bytes(_png_bytes())
+        img2.write_bytes(_png_bytes())
+        parts, skipped = build_native_content_parts("compare these", [str(img1), str(img2)])
+        assert skipped == []
+        image_parts = [p for p in parts if p.get("type") == "image_url"]
+        assert len(image_parts) == 2
+
+    def test_mime_inference_jpg(self, tmp_path: Path):
+        img = tmp_path / "photo.jpg"
+        img.write_bytes(_png_bytes())  # bytes are PNG but extension is jpg
+        parts, _ = build_native_content_parts("x", [str(img)])
+        url = parts[1]["image_url"]["url"]
+        assert url.startswith("data:image/jpeg;base64,")
+
+    def test_mime_inference_webp(self, tmp_path: Path):
+        img = tmp_path / "pic.webp"
+        img.write_bytes(_png_bytes())
+        parts, _ = build_native_content_parts("", [str(img)])
+        url = parts[1]["image_url"]["url"]
+        assert url.startswith("data:image/webp;base64,")
+
+
+# ─── Oversize handling ───────────────────────────────────────────────────────
+
+
+class TestLargeImageHandling:
+    """Large images attach at native size; shrink is handled reactively at
+    retry time in ``run_agent._try_shrink_image_parts_in_messages`` rather
+    than proactively here.
+    """
+
+    def test_large_image_passes_through_unchanged(self, tmp_path: Path):
+        """A multi-MB image is attached as-is — no resize, no skip."""
+        from agent import image_routing as _ir
+
+        img = tmp_path / "medium.png"
+        # 200 KB of real bytes; not huge but enough to verify no size gate fires.
+        img.write_bytes(b"\x89PNG\r\n\x1a\n" + b"X" * 200_000)
+        url = _ir._file_to_data_url(img)
+        assert url is not None
+        assert url.startswith("data:image/png;base64,")
+        # Base64 expansion means output is ~4/3 of input, plus header.
+        assert len(url) > 200_000
+
+    def test_missing_file_returns_none(self, tmp_path: Path):
+        from agent import image_routing as _ir
+        missing = tmp_path / "does_not_exist.png"
+        assert _ir._file_to_data_url(missing) is None
+
+    def test_build_native_parts_no_provider_kwarg(self, tmp_path: Path):
+        """build_native_content_parts takes text + paths, no provider kwarg."""
+        from agent import image_routing as _ir
+
+        img = tmp_path / "cat.png"
+        img.write_bytes(_png_bytes())
+        parts, skipped = _ir.build_native_content_parts("hi", [str(img)])
+        assert skipped == []
+        assert len(parts) == 2
+        assert parts[0]["type"] == "text"
+        assert parts[1]["type"] == "image_url"
--- a/tests/run_agent/test_image_shrink_recovery.py
+++ b/tests/run_agent/test_image_shrink_recovery.py
@@ -0,0 +1,277 @@
+"""Tests for reactive image-shrink recovery.
+
+Covers the full chain for Anthropic's 5 MB per-image ceiling (and any
+future provider that returns an image-too-large error):
+
+  1. agent/error_classifier.py: 400 with "image exceeds 5 MB maximum"
+     gets FailoverReason.image_too_large, not context_overflow.
+  2. run_agent._try_shrink_image_parts_in_messages mutates the API
+     payload in-place, re-encoding native data: URL image parts to fit
+     under 4 MB using vision_tools._resize_image_for_vision.
+
+The end-to-end wiring in the retry loop is not unit-tested here — it's
+covered by the live E2E in the PR description. These tests lock in the
+two pieces that matter independently: the classifier signal and the
+payload rewriter.
+"""
+
+from __future__ import annotations
+
+import base64
+from pathlib import Path
+
+import pytest
+
+from agent.error_classifier import FailoverReason, classify_api_error
+
+
+class _FakeApiError(Exception):
+    """Stand-in for an openai.BadRequestError with status_code + body."""
+
+    def __init__(self, status_code: int, message: str, body: dict | None = None):
+        super().__init__(message)
+        self.status_code = status_code
+        self.body = body or {"error": {"message": message}}
+        self.response = None  # required by some code paths
+
+
+# ─── Classifier ──────────────────────────────────────────────────────────────
+
+
+class TestImageTooLargeClassification:
+    def test_anthropic_400_image_exceeds_message(self):
+        """Anthropic's exact wording must classify as image_too_large, not context."""
+        err = _FakeApiError(
+            status_code=400,
+            message=(
+                "messages.0.content.1.image.source.base64: image exceeds 5 MB "
+                "maximum: 12966600 bytes > 5242880 bytes"
+            ),
+        )
+        result = classify_api_error(err, provider="anthropic", model="claude-sonnet-4-6")
+        assert result.reason == FailoverReason.image_too_large
+        assert result.retryable is True
+
+    def test_generic_image_too_large_no_status(self):
+        """No status_code path: message text alone triggers classification."""
+        err = Exception("image too large for this endpoint")
+        result = classify_api_error(err, provider="some-provider", model="some-model")
+        assert result.reason == FailoverReason.image_too_large
+        assert result.retryable is True
+
+    def test_image_too_large_not_confused_with_context_overflow(self):
+        """'image exceeds' must NOT be mis-classified as context_overflow.
+
+        The context_overflow patterns include 'exceeds the limit' which is a
+        superstring risk — verify the image-too-large check fires first.
+        """
+        err = _FakeApiError(
+            status_code=400,
+            message="image exceeds the limit for this model",
+        )
+        result = classify_api_error(err, provider="anthropic", model="claude-sonnet-4-6")
+        assert result.reason == FailoverReason.image_too_large
+
+    def test_regular_context_overflow_unaffected(self):
+        """Context-overflow errors without image keywords still classify correctly."""
+        err = _FakeApiError(
+            status_code=400,
+            message="prompt is too long: context length 300000 exceeds max of 200000",
+        )
+        result = classify_api_error(err, provider="anthropic", model="claude-sonnet-4-6")
+        assert result.reason == FailoverReason.context_overflow
+
+
+# ─── Shrink helper ───────────────────────────────────────────────────────────
+
+
+def _big_png_data_url(size_kb: int) -> str:
+    """Build a data URL with a plausible large base64 payload."""
+    # Use real PNG header so MIME detection works; fill to target size.
+    raw = b"\x89PNG\r\n\x1a\n" + b"X" * (size_kb * 1024)
+    return "data:image/png;base64," + base64.b64encode(raw).decode("ascii")
+
+
+def _make_agent():
+    """Build a bare AIAgent for method-level testing, no provider setup."""
+    from run_agent import AIAgent
+    agent = object.__new__(AIAgent)
+    agent.provider = "anthropic"
+    agent.model = "claude-sonnet-4-6"
+    return agent
+
+
+class TestShrinkImagePartsHelper:
+    def test_no_messages_returns_false(self):
+        agent = _make_agent()
+        assert agent._try_shrink_image_parts_in_messages([]) is False
+        assert agent._try_shrink_image_parts_in_messages(None) is False
+
+    def test_no_image_parts_returns_false(self):
+        agent = _make_agent()
+        msgs = [
+            {"role": "user", "content": "plain text"},
+            {"role": "assistant", "content": "ack"},
+        ]
+        assert agent._try_shrink_image_parts_in_messages(msgs) is False
+
+    def test_small_image_part_not_shrunk(self, monkeypatch):
+        """An image under 4 MB is left alone — shrink helper only touches oversized ones."""
+        agent = _make_agent()
+        small_url = _big_png_data_url(100)  # ~100 KB + b64 overhead
+
+        resize_hits = {"count": 0}
+        monkeypatch.setattr(
+            "tools.vision_tools._resize_image_for_vision",
+            lambda *a, **kw: resize_hits.__setitem__("count", resize_hits["count"] + 1) or small_url,
+            raising=False,
+        )
+
+        msgs = [{
+            "role": "user",
+            "content": [
+                {"type": "text", "text": "hi"},
+                {"type": "image_url", "image_url": {"url": small_url}},
+            ],
+        }]
+        assert agent._try_shrink_image_parts_in_messages(msgs) is False
+        assert resize_hits["count"] == 0
+        # URL unchanged.
+        assert msgs[0]["content"][1]["image_url"]["url"] == small_url
+
+    def test_oversized_image_url_dict_shape_rewritten(self, monkeypatch):
+        """OpenAI chat.completions shape: {image_url: {url: data:...}}."""
+        agent = _make_agent()
+        oversized_url = _big_png_data_url(5000)  # ~5 MB raw → ~6.7 MB b64
+        shrunk = "data:image/jpeg;base64," + "A" * 1000  # small
+
+        def _fake_resize(path, mime_type=None, max_base64_bytes=None):
+            return shrunk
+
+        monkeypatch.setattr(
+            "tools.vision_tools._resize_image_for_vision",
+            _fake_resize,
+            raising=False,
+        )
+
+        msgs = [{
+            "role": "user",
+            "content": [
+                {"type": "text", "text": "look"},
+                {"type": "image_url", "image_url": {"url": oversized_url}},
+            ],
+        }]
+        changed = agent._try_shrink_image_parts_in_messages(msgs)
+        assert changed is True
+        assert msgs[0]["content"][1]["image_url"]["url"] == shrunk
+
+    def test_oversized_input_image_string_shape_rewritten(self, monkeypatch):
+        """OpenAI Responses shape: {type: input_image, image_url: "data:..."}."""
+        agent = _make_agent()
+        oversized_url = _big_png_data_url(5000)
+        shrunk = "data:image/jpeg;base64," + "B" * 1000
+
+        monkeypatch.setattr(
+            "tools.vision_tools._resize_image_for_vision",
+            lambda *a, **kw: shrunk,
+            raising=False,
+        )
+
+        msgs = [{
+            "role": "user",
+            "content": [
+                {"type": "input_text", "text": "look"},
+                {"type": "input_image", "image_url": oversized_url},
+            ],
+        }]
+        changed = agent._try_shrink_image_parts_in_messages(msgs)
+        assert changed is True
+        assert msgs[0]["content"][1]["image_url"] == shrunk
+
+    def test_multiple_images_all_shrunk(self, monkeypatch):
+        agent = _make_agent()
+        big1 = _big_png_data_url(5000)
+        big2 = _big_png_data_url(6000)
+        shrunk = "data:image/jpeg;base64," + "C" * 500
+
+        monkeypatch.setattr(
+            "tools.vision_tools._resize_image_for_vision",
+            lambda *a, **kw: shrunk,
+            raising=False,
+        )
+
+        msgs = [{
+            "role": "user",
+            "content": [
+                {"type": "text", "text": "compare"},
+                {"type": "image_url", "image_url": {"url": big1}},
+                {"type": "image_url", "image_url": {"url": big2}},
+            ],
+        }]
+        changed = agent._try_shrink_image_parts_in_messages(msgs)
+        assert changed is True
+        assert msgs[0]["content"][1]["image_url"]["url"] == shrunk
+        assert msgs[0]["content"][2]["image_url"]["url"] == shrunk
+
+    def test_http_url_images_not_touched(self, monkeypatch):
+        """Only data: URLs are candidates — http URLs are server-fetched."""
+        agent = _make_agent()
+
+        resize_hits = {"count": 0}
+        monkeypatch.setattr(
+            "tools.vision_tools._resize_image_for_vision",
+            lambda *a, **kw: resize_hits.__setitem__("count", resize_hits["count"] + 1) or "shrunk",
+            raising=False,
+        )
+
+        msgs = [{
+            "role": "user",
+            "content": [
+                {"type": "text", "text": "at this url"},
+                {"type": "image_url", "image_url": {"url": "https://example.com/big.png"}},
+            ],
+        }]
+        assert agent._try_shrink_image_parts_in_messages(msgs) is False
+        assert resize_hits["count"] == 0
+
+    def test_shrink_failure_returns_false_and_leaves_url_intact(self, monkeypatch):
+        """If re-encode fails, leave the URL alone so the caller surfaces the original error."""
+        agent = _make_agent()
+        oversized_url = _big_png_data_url(5000)
+
+        monkeypatch.setattr(
+            "tools.vision_tools._resize_image_for_vision",
+            lambda *a, **kw: None,  # resize returned nothing usable
+            raising=False,
+        )
+
+        msgs = [{
+            "role": "user",
+            "content": [
+                {"type": "image_url", "image_url": {"url": oversized_url}},
+            ],
+        }]
+        assert agent._try_shrink_image_parts_in_messages(msgs) is False
+        assert msgs[0]["content"][0]["image_url"]["url"] == oversized_url
+
+    def test_shrink_that_makes_it_bigger_rejected(self, monkeypatch):
+        """If the 'shrink' somehow produces a larger payload, skip it."""
+        agent = _make_agent()
+        oversized_url = _big_png_data_url(5000)
+        even_bigger = "data:image/png;base64," + "Z" * (10 * 1024 * 1024)
+
+        monkeypatch.setattr(
+            "tools.vision_tools._resize_image_for_vision",
+            lambda *a, **kw: even_bigger,
+            raising=False,
+        )
+
+        msgs = [{
+            "role": "user",
+            "content": [
+                {"type": "image_url", "image_url": {"url": oversized_url}},
+            ],
+        }]
+        assert agent._try_shrink_image_parts_in_messages(msgs) is False
+        # Original URL still in place, not replaced by the bigger one.
+        assert msgs[0]["content"][0]["image_url"]["url"] == oversized_url
--- a/tests/run_agent/test_vision_aware_preprocessing.py
+++ b/tests/run_agent/test_vision_aware_preprocessing.py
@@ -0,0 +1,170 @@
+"""Tests for the vision-aware image preprocessing in run_agent.py.
+
+Covers:
+
+* ``_prepare_anthropic_messages_for_api`` — passes image parts through
+  unchanged when the active model reports ``supports_vision=True`` (the
+  adapter handles them natively), and falls back to text-description
+  replacement when the model lacks vision.
+
+* ``_prepare_messages_for_non_vision_model`` — the mirror method for the
+  chat.completions / codex_responses paths. Same contract.
+"""
+
+from __future__ import annotations
+
+from unittest.mock import MagicMock, patch
+
+import pytest
+
+from run_agent import AIAgent
+
+
+def _make_agent() -> AIAgent:
+    """Build a bare-bones AIAgent instance without running __init__.
+
+    Avoids the heavy provider/credential setup for these pure-method tests.
+    """
+    agent = object.__new__(AIAgent)
+    agent.provider = "anthropic"
+    agent.model = "claude-sonnet-4"
+    agent._anthropic_image_fallback_cache = {}
+    return agent
+
+
+IMG_PARTS_USER_MSG = {
+    "role": "user",
+    "content": [
+        {"type": "text", "text": "What's in this image?"},
+        {"type": "image_url", "image_url": {"url": "data:image/png;base64,AAAA"}},
+    ],
+}
+
+PLAIN_USER_MSG = {"role": "user", "content": "hello, no images here"}
+
+
+# ─── _prepare_anthropic_messages_for_api ─────────────────────────────────────
+
+
+class TestPrepareAnthropicMessages:
+    def test_no_images_passes_through(self):
+        agent = _make_agent()
+        msgs = [PLAIN_USER_MSG]
+        out = agent._prepare_anthropic_messages_for_api(msgs)
+        assert out is msgs  # unchanged reference
+
+    def test_vision_capable_passes_images_through(self):
+        """The Anthropic adapter handles image_url/input_image natively."""
+        agent = _make_agent()
+        with patch.object(agent, "_model_supports_vision", return_value=True):
+            out = agent._prepare_anthropic_messages_for_api([IMG_PARTS_USER_MSG])
+        # Passes through unchanged — image_url parts still present.
+        assert out[0]["content"][1]["type"] == "image_url"
+
+    def test_non_vision_replaces_images_with_text(self):
+        agent = _make_agent()
+        with patch.object(agent, "_model_supports_vision", return_value=False), \
+             patch.object(
+                 agent,
+                 "_describe_image_for_anthropic_fallback",
+                 return_value="[Image description: a cat]",
+             ):
+            out = agent._prepare_anthropic_messages_for_api([IMG_PARTS_USER_MSG])
+        # Content collapsed to a string containing the description + user text.
+        content = out[0]["content"]
+        assert isinstance(content, str)
+        assert "[Image description: a cat]" in content
+        assert "What's in this image?" in content
+        # No more image parts.
+        assert "image_url" not in content
+
+
+# ─── _prepare_messages_for_non_vision_model ──────────────────────────────────
+
+
+class TestPrepareMessagesForNonVision:
+    def test_no_images_passes_through(self):
+        agent = _make_agent()
+        msgs = [PLAIN_USER_MSG]
+        out = agent._prepare_messages_for_non_vision_model(msgs)
+        assert out is msgs
+
+    def test_vision_capable_passes_through(self):
+        """For vision-capable models on chat.completions path, provider handles pixels."""
+        agent = _make_agent()
+        agent.provider = "openrouter"
+        agent.model = "anthropic/claude-sonnet-4"
+        with patch.object(agent, "_model_supports_vision", return_value=True):
+            out = agent._prepare_messages_for_non_vision_model([IMG_PARTS_USER_MSG])
+        assert out[0]["content"][1]["type"] == "image_url"
+
+    def test_non_vision_strips_images(self):
+        agent = _make_agent()
+        agent.provider = "openrouter"
+        agent.model = "qwen/qwen3-235b-a22b"
+        with patch.object(agent, "_model_supports_vision", return_value=False), \
+             patch.object(
+                 agent,
+                 "_describe_image_for_anthropic_fallback",
+                 return_value="[Image description: a dog]",
+             ):
+            out = agent._prepare_messages_for_non_vision_model([IMG_PARTS_USER_MSG])
+        content = out[0]["content"]
+        assert isinstance(content, str)
+        assert "[Image description: a dog]" in content
+        assert "image_url" not in content
+
+    def test_multiple_messages_with_mixed_content(self):
+        agent = _make_agent()
+        agent.model = "qwen/qwen3-235b"
+        msgs = [
+            {"role": "user", "content": "first turn"},
+            {"role": "assistant", "content": "ack"},
+            IMG_PARTS_USER_MSG,
+        ]
+        with patch.object(agent, "_model_supports_vision", return_value=False), \
+             patch.object(
+                 agent,
+                 "_describe_image_for_anthropic_fallback",
+                 return_value="[Image: thing]",
+             ):
+            out = agent._prepare_messages_for_non_vision_model(msgs)
+        # First two messages unchanged (no images), third stripped.
+        assert out[0]["content"] == "first turn"
+        assert out[1]["content"] == "ack"
+        assert isinstance(out[2]["content"], str)
+        assert "[Image: thing]" in out[2]["content"]
+
+
+# ─── _model_supports_vision ──────────────────────────────────────────────────
+
+
+class TestModelSupportsVision:
+    def test_missing_provider_or_model_returns_false(self):
+        agent = _make_agent()
+        agent.provider = ""
+        agent.model = "claude-sonnet-4"
+        assert agent._model_supports_vision() is False
+        agent.provider = "anthropic"
+        agent.model = ""
+        assert agent._model_supports_vision() is False
+
+    def test_uses_get_model_capabilities(self):
+        agent = _make_agent()
+        fake_caps = MagicMock()
+        fake_caps.supports_vision = True
+        with patch("agent.models_dev.get_model_capabilities", return_value=fake_caps):
+            assert agent._model_supports_vision() is True
+        fake_caps.supports_vision = False
+        with patch("agent.models_dev.get_model_capabilities", return_value=fake_caps):
+            assert agent._model_supports_vision() is False
+
+    def test_none_caps_returns_false(self):
+        agent = _make_agent()
+        with patch("agent.models_dev.get_model_capabilities", return_value=None):
+            assert agent._model_supports_vision() is False
+
+    def test_exception_returns_false(self):
+        agent = _make_agent()
+        with patch("agent.models_dev.get_model_capabilities", side_effect=RuntimeError("boom")):
+            assert agent._model_supports_vision() is False
--- a/tools/vision_tools.py
+++ b/tools/vision_tools.py
@@ -754,7 +754,15 @@ from tools.registry import registry, tool_error

 VISION_ANALYZE_SCHEMA = {
    "name": "vision_analyze",
-    "description": "Analyze images using AI vision. Provides a comprehensive description and answers a specific question about the image content.",
+    "description": (
+        "Inspect an image from a URL, file path, or tool output when you need "
+        "closer detail than what's visible in the conversation. If the user's "
+        "image is already attached to the conversation and you can see it, "
+        "just answer directly — only call this tool for images referenced by "
+        "URL/path, images returned inside other tool results (browser "
+        "screenshots, search thumbnails), or when you need a deeper look at "
+        "a specific region the main model's vision may have missed."
+    ),
    "parameters": {
        "type": "object",
        "properties": {
--- a/tui_gateway/server.py
+++ b/tui_gateway/server.py
@@ -13,7 +13,7 @@ import time
 import uuid
 from datetime import datetime
 from pathlib import Path
-from typing import Optional
+from typing import Any, Optional

 from hermes_constants import get_hermes_home
 from hermes_cli.env_loader import load_hermes_dotenv
@@ -2274,7 +2274,60 @@ def _(rid, params: dict) -> dict:
                    return
                prompt = ctx.message

-            prompt = _enrich_with_attached_images(prompt, images) if images else prompt
+            # Decide image routing per-turn based on active provider/model.
+            # "native" → pass pixels to the main model as OpenAI-style content
+            # parts (adapters translate for Anthropic/Gemini/Bedrock/etc.).
+            # "text"   → pre-analyze with vision_analyze and prepend the text.
+            # See agent/image_routing.py for the full decision table.
+            run_message: Any = prompt
+            if images:
+                try:
+                    from agent.image_routing import (
+                        decide_image_input_mode,
+                        build_native_content_parts,
+                    )
+                    from agent.auxiliary_client import (
+                        _read_main_model,
+                        _read_main_provider,
+                    )
+                    from hermes_cli.config import load_config as _tui_load_config
+
+                    _cfg = _tui_load_config()
+                    _mode = decide_image_input_mode(
+                        _read_main_provider(),
+                        _read_main_model(),
+                        _cfg,
+                    )
+                except Exception as _img_exc:
+                    print(
+                        f"[tui_gateway] image_routing decision failed, defaulting to text: {_img_exc}",
+                        file=sys.stderr,
+                    )
+                    _mode = "text"
+
+                if _mode == "native":
+                    try:
+                        _parts, _skipped = build_native_content_parts(
+                            prompt,
+                            images,
+                        )
+                        if _skipped:
+                            print(
+                                f"[tui_gateway] native image attachment skipped {len(_skipped)} unreadable path(s)",
+                                file=sys.stderr,
+                            )
+                        if any(p.get("type") == "image_url" for p in _parts):
+                            run_message = _parts
+                        else:
+                            run_message = _enrich_with_attached_images(prompt, images)
+                    except Exception as _img_exc:
+                        print(
+                            f"[tui_gateway] native attach failed, falling back to text: {_img_exc}",
+                            file=sys.stderr,
+                        )
+                        run_message = _enrich_with_attached_images(prompt, images)
+                else:
+                    run_message = _enrich_with_attached_images(prompt, images)

            def _stream(delta):
                payload = {"text": delta}
@@ -2283,7 +2336,7 @@ def _(rid, params: dict) -> dict:
                _emit("message.delta", sid, payload)

            result = agent.run_conversation(
-                prompt,
+                run_message,
                conversation_history=list(history),
                stream_callback=_stream,
            )