perf(tui): instrument stdout drain — rule out terminal parse bottleneck

Adds four fields to FrameEvent.phases and the matching profile
summary:

  optimizedPatches  post-optimize patch count (what's actually
                    written to stdout; the .patches field is
                    pre-optimize)
  writeBytes        UTF-8 byte count of the write this frame
  backpressure      true when Node's stdout.write returned false
                    (Writable buffer full — outer terminal can't
                    keep up)
  prevFrameDrainMs  end-to-end drain time of the PREVIOUS frame's
                    write, captured from stdout.write's 2-arg
                    callback.  Reported on the next frame so the
                    measurement reflects "time until OS flushed
                    the bytes to the terminal fd", not "time until
                    queued in Node".

writeDiffToTerminal() now returns { bytes, backpressure } and
accepts an optional onDrain callback.  Only attached on TTY with
diff; piped/non-TTY stdout bypasses flow control so the callback
would fire synchronously anyway.

Initial measurements under hold-wheel_up against 1106-msg session
(30Hz for 6s):

  patches total    28,888
  optimized total  16,700   (ratio 0.58 — optimizer cuts ~42%)
  writeBytes       42 KB / 10s = 4.2 KB/s throughput
  drainMs p50      0.14 ms   terminal accepts bytes instantly
  drainMs p99      0.85 ms
  backpressure     0% of frames

This rules out the terminal-parse hypothesis — Cursor's xterm.js
drains our output in sub-millisecond time at only 4 KB/s.  The
remaining lag has to be in the render pipeline, not the wire.
Profile output now includes the bytes+drain+backpressure lines to
keep this visible on every subsequent iteration.
This commit is contained in:
Brooklyn Nicholson
2026-04-26 17:06:22 -05:00
parent d3dedf10aa
commit f823535db2
6 changed files with 126 additions and 4 deletions

View File

@@ -219,6 +219,45 @@ def format_report(data: dict[str, Any]) -> str:
f" patches p50={pct(patches,0.5):.0f} p99={pct(patches,0.99):.0f} "
f"max={max(patches)} total={sum(patches)}"
)
optimized = [
f["phases"].get("optimizedPatches", 0)
for f in frames if f.get("phases")
]
if any(optimized):
out.append(
f" optimized p50={pct(optimized,0.5):.0f} p99={pct(optimized,0.99):.0f} "
f"max={max(optimized)} total={sum(optimized)}"
f" (ratio: {sum(optimized)/max(1,sum(patches)):.2f})"
)
# Write bytes + drain telemetry — the outer-terminal bottleneck gauge.
bytes_written = [
f["phases"].get("writeBytes", 0)
for f in frames if f.get("phases")
]
if any(bytes_written):
total_b = sum(bytes_written)
kb = total_b / 1024
out.append(
f" writeBytes p50={pct(bytes_written,0.5):.0f}B p99={pct(bytes_written,0.99):.0f}B "
f"max={max(bytes_written)}B total={kb:.1f}KB"
)
drains = [
f["phases"].get("prevFrameDrainMs", 0)
for f in frames if f.get("phases")
]
if any(d > 0 for d in drains):
nonzero = [d for d in drains if d > 0]
out.append(
f" drainMs p50={pct(nonzero,0.5):.2f} p95={pct(nonzero,0.95):.2f} "
f"p99={pct(nonzero,0.99):.2f} max={max(nonzero):.2f} (terminal flush latency)"
)
backpressure = sum(1 for f in frames if f.get("phases", {}).get("backpressure"))
if backpressure:
out.append(
f" backpressure: {backpressure}/{len(frames)} frames "
f"({100*backpressure/len(frames):.0f}%) (Node stdout buffer full — terminal slow)"
)
# Flickers
flicker_frames = [f for f in frames if f.get("flickers")]