perf(tui): instrument stdout drain — rule out terminal parse bottleneck

Adds four fields to FrameEvent.phases and the matching profile summary: optimizedPatches post-optimize patch count (what's actually written to stdout; the .patches field is pre-optimize) writeBytes UTF-8 byte count of the write this frame backpressure true when Node's stdout.write returned false (Writable buffer full — outer terminal can't keep up) prevFrameDrainMs end-to-end drain time of the PREVIOUS frame's write, captured from stdout.write's 2-arg callback. Reported on the next frame so the measurement reflects "time until OS flushed the bytes to the terminal fd", not "time until queued in Node". writeDiffToTerminal() now returns { bytes, backpressure } and accepts an optional onDrain callback. Only attached on TTY with diff; piped/non-TTY stdout bypasses flow control so the callback would fire synchronously anyway. Initial measurements under hold-wheel_up against 1106-msg session (30Hz for 6s): patches total 28,888 optimized total 16,700 (ratio 0.58 — optimizer cuts ~42%) writeBytes 42 KB / 10s = 4.2 KB/s throughput drainMs p50 0.14 ms terminal accepts bytes instantly drainMs p99 0.85 ms backpressure 0% of frames This rules out the terminal-parse hypothesis — Cursor's xterm.js drains our output in sub-millisecond time at only 4 KB/s. The remaining lag has to be in the render pipeline, not the wire. Profile output now includes the bytes+drain+backpressure lines to keep this visible on every subsequent iteration.
2026-04-26 17:06:22 -05:00
parent d3dedf10aa
commit f823535db2
6 changed files with 126 additions and 4 deletions
--- a/scripts/profile-tui.py
+++ b/scripts/profile-tui.py
@@ -219,6 +219,45 @@ def format_report(data: dict[str, Any]) -> str:
                    f"  patches     p50={pct(patches,0.5):.0f}  p99={pct(patches,0.99):.0f}  "
                    f"max={max(patches)}  total={sum(patches)}"
                )
+            optimized = [
+                f["phases"].get("optimizedPatches", 0)
+                for f in frames if f.get("phases")
+            ]
+            if any(optimized):
+                out.append(
+                    f"  optimized   p50={pct(optimized,0.5):.0f}  p99={pct(optimized,0.99):.0f}  "
+                    f"max={max(optimized)}  total={sum(optimized)}"
+                    f"  (ratio: {sum(optimized)/max(1,sum(patches)):.2f})"
+                )
+
+            # Write bytes + drain telemetry — the outer-terminal bottleneck gauge.
+            bytes_written = [
+                f["phases"].get("writeBytes", 0)
+                for f in frames if f.get("phases")
+            ]
+            if any(bytes_written):
+                total_b = sum(bytes_written)
+                kb = total_b / 1024
+                out.append(
+                    f"  writeBytes  p50={pct(bytes_written,0.5):.0f}B  p99={pct(bytes_written,0.99):.0f}B  "
+                    f"max={max(bytes_written)}B  total={kb:.1f}KB"
+                )
+            drains = [
+                f["phases"].get("prevFrameDrainMs", 0)
+                for f in frames if f.get("phases")
+            ]
+            if any(d > 0 for d in drains):
+                nonzero = [d for d in drains if d > 0]
+                out.append(
+                    f"  drainMs     p50={pct(nonzero,0.5):.2f}  p95={pct(nonzero,0.95):.2f}  "
+                    f"p99={pct(nonzero,0.99):.2f}  max={max(nonzero):.2f}   (terminal flush latency)"
+                )
+            backpressure = sum(1 for f in frames if f.get("phases", {}).get("backpressure"))
+            if backpressure:
+                out.append(
+                    f"  backpressure: {backpressure}/{len(frames)} frames "
+                    f"({100*backpressure/len(frames):.0f}%)   (Node stdout buffer full — terminal slow)"
+                )

        # Flickers
        flicker_frames = [f for f in frames if f.get("flickers")]