fix(smolmachines): bridge host SIGWINCH into the VM PTY (issue #82) #83

Merged
didericis merged 6 commits from smolmachines-pty-resize-issue-82 into main 2026-05-27 21:03:17 -04:00
Collaborator

Closes #82.

Diagnosis

smolvm 0.8.0 machine exec -t allocates an in-VM PTY but never forwards the host terminal's window size to it. The in-VM PTY starts at 0 0, and any host-side resize during the session (tmux pane resize, terminal window resize) is dropped — the claude TUI keeps rendering for whatever tiny box it last saw. docker exec -it propagates window-size changes automatically via SIGWINCH; smolvm doesn't.

Reproduced: smolvm machine exec --name <M> -- stty -F /dev/pts/0 size reports 0 0 on a freshly-launched smolmachines bottle, even though the host terminal is a normal 80×24+.

Workaround

A small Python wrapper, claude_bottle/backend/smolmachines/pty_resize.py, interposes between the operator's terminal and smolvm machine exec. On startup and every host SIGWINCH, it runs a side-channel smolvm machine exec --name <M> -- sh -c 'for f in /dev/pts/*; do stty -F $f cols X rows Y; done'. The kernel delivers SIGWINCH to the foreground process group on the in-VM PTY's slave end when its size changes, so claude picks up the new dimensions without extra signalling.

SmolmachinesBottle.claude_argv prepends [sys.executable, -m, claude_bottle.backend.smolmachines.pty_resize, <machine>, --, ...] to the existing smolvm argv in TTY mode. Non-TTY mode (provisioning shell-outs that happen to go through this method) skips the wrapper — no PTY to resize.

The wrapper composes correctly with the dashboard's _build_resume_argv_with_fallback shell-wrap: the split-at-claude token still finds the right position because the wrapper's prefix wraps the entire smolvm-exec framing.

Tests

  • tests/unit/test_smolmachines_pty_resize.py (new): argv parsing, the side-channel command shape (cols/rows order, for-loop over /dev/pts/*), and _read_winsize's fallback across stdin/stdout/stderr including the ironic case where the smolvm-allocated PTY reports 0 0 and gets skipped.
  • tests/unit/test_smolmachines_bottle.py: TTY-mode assertions now unwrap the pty_resize prefix; new TestClaudeArgvNoTTY class locks the non-TTY skip.

636 unit tests pass.

Removable

This whole module can come out once smolvm grows native SIGWINCH forwarding. Upstream report not filed yet — followup tracked separately.

Closes #82. ## Diagnosis `smolvm 0.8.0 machine exec -t` allocates an in-VM PTY but never forwards the host terminal's window size to it. The in-VM PTY starts at `0 0`, and any host-side resize during the session (tmux pane resize, terminal window resize) is dropped — the claude TUI keeps rendering for whatever tiny box it last saw. `docker exec -it` propagates window-size changes automatically via SIGWINCH; smolvm doesn't. Reproduced: `smolvm machine exec --name <M> -- stty -F /dev/pts/0 size` reports `0 0` on a freshly-launched smolmachines bottle, even though the host terminal is a normal 80×24+. ## Workaround A small Python wrapper, `claude_bottle/backend/smolmachines/pty_resize.py`, interposes between the operator's terminal and `smolvm machine exec`. On startup and every host SIGWINCH, it runs a side-channel `smolvm machine exec --name <M> -- sh -c 'for f in /dev/pts/*; do stty -F $f cols X rows Y; done'`. The kernel delivers SIGWINCH to the foreground process group on the in-VM PTY's slave end when its size changes, so claude picks up the new dimensions without extra signalling. `SmolmachinesBottle.claude_argv` prepends `[sys.executable, -m, claude_bottle.backend.smolmachines.pty_resize, <machine>, --, ...]` to the existing smolvm argv in TTY mode. Non-TTY mode (provisioning shell-outs that happen to go through this method) skips the wrapper — no PTY to resize. The wrapper composes correctly with the dashboard's `_build_resume_argv_with_fallback` shell-wrap: the split-at-`claude` token still finds the right position because the wrapper's prefix wraps the entire smolvm-exec framing. ## Tests - `tests/unit/test_smolmachines_pty_resize.py` (new): argv parsing, the side-channel command shape (cols/rows order, for-loop over `/dev/pts/*`), and `_read_winsize`'s fallback across stdin/stdout/stderr including the ironic case where the smolvm-allocated PTY reports `0 0` and gets skipped. - `tests/unit/test_smolmachines_bottle.py`: TTY-mode assertions now unwrap the pty_resize prefix; new `TestClaudeArgvNoTTY` class locks the non-TTY skip. 636 unit tests pass. ## Removable This whole module can come out once smolvm grows native SIGWINCH forwarding. Upstream report not filed yet — followup tracked separately.
didericis-claude added 1 commit 2026-05-27 20:15:31 -04:00
fix(smolmachines): bridge host SIGWINCH into the VM PTY (issue #82)
test / unit (pull_request) Successful in 28s
test / integration (pull_request) Successful in 41s
3fb305f654
`smolvm 0.8.0 machine exec -t` allocates an in-VM PTY but never
forwards the host terminal's window size — the PTY starts at
`0 0` and host resizes (tmux pane resize, terminal window
resize) go unnoticed, so the claude TUI inside a smolmachines
bottle renders for whatever tiny box it last saw and ignores
operator resizes. `docker exec -it` propagates window-size
changes automatically; smolvm doesn't.

Workaround: a small Python wrapper
(`backend/smolmachines/pty_resize.py`) that interposes between
the operator's terminal and `smolvm machine exec`. It spawns
smolvm as a child, traps host SIGWINCH, and on every resize
(plus once at startup) runs a side-channel
`smolvm machine exec --name <M> -- sh -c 'for f in /dev/pts/*;
do stty -F $f cols X rows Y; done'`. The kernel delivers
SIGWINCH to the in-VM foreground process group when the slave
PTY's size changes, so claude picks up the new dimensions
without extra signalling.

`SmolmachinesBottle.claude_argv` prepends
`[sys.executable, -m, claude_bottle.backend.smolmachines.
pty_resize, <machine>, --, ...]` to the existing smolvm argv
in TTY mode. Non-TTY mode (provisioning shell-outs) skips the
wrapper — no PTY to resize.

The wrapper survives the dashboard's
`_build_resume_argv_with_fallback` shell-wrap because the
split-at-`claude` token still finds the right position — the
wrapper's prefix wraps the entire smolvm-exec framing.

Tests:
- `test_smolmachines_pty_resize.py` (new): argv parsing, the
  side-channel command shape (cols/rows / for-loop over
  /dev/pts/*), and `_read_winsize`'s fallback across
  stdin/stdout/stderr including the smolvm-allocated-PTY-
  reports-`0 0` ironic case.
- `test_smolmachines_bottle.py`: updated TTY-mode assertions
  to unwrap the pty_resize prefix; added `TestClaudeArgvNoTTY`
  to lock the non-TTY skip.

636 unit tests pass.

Removable when smolvm grows native SIGWINCH forwarding.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Owner

On startup and every host SIGWINCH, it runs a side-channel smolvm machine exec --name -- sh -c 'for f in /dev/pts/*; do stty -F $f cols X rows Y; done'. The kernel delivers SIGWINCH to the foreground process group on the in-VM PTY's slave end when its size changes, so claude picks up the new dimensions without extra signalling.

How much overhead does this add?

> On startup and every host SIGWINCH, it runs a side-channel smolvm machine exec --name <M> -- sh -c 'for f in /dev/pts/*; do stty -F $f cols X rows Y; done'. The kernel delivers SIGWINCH to the foreground process group on the in-VM PTY's slave end when its size changes, so claude picks up the new dimensions without extra signalling. How much overhead does this add?
Author
Collaborator

Measured on my box against a running claude-bottle smolvm machine:

At rest (no resize): zero CPU. The wrapper Python process just blocks on proc.wait() — no signal arrives, no syscalls run. Memory cost is one Python interpreter, ~12 MB RSS.

Per SIGWINCH event: the side-channel smolvm machine exec -- sh -c 'for f in /dev/pts/*; do stty -F $f cols X rows Y; done' takes about 50 ms in steady state (10 iters: 110, 53, 53, 49, 53, 56, 49, 41, 55, 51 ms — first is colder, rest are warm). That's almost entirely smolvm setup + libkrun exec channel round-trip, not the stty work.

Frequency: SIGWINCH is human-driven — fires only when the terminal window or tmux pane changes size. Typical session: zero events. Active drag-to-resize: maybe 5-10 events bursted over a second or two while the user is dragging, then nothing.

Worst case (someone holding the corner and dragging the window slowly for a few seconds): a queue of 50 ms exec calls trailing the drag, settling within ~100 ms of when they stop. Imperceptible against the 50–200 ms claude TUI redraw it triggers anyway.

If the per-event cost ever becomes load-bearing we could keep a long-lived smolvm machine exec session open over a pipe and poke it with stty commands instead of paying smolvm startup every time — but at one event per resize and 50 ms per event, the complexity isn't worth it.

Measured on my box against a running claude-bottle smolvm machine: **At rest (no resize):** zero CPU. The wrapper Python process just blocks on `proc.wait()` — no signal arrives, no syscalls run. Memory cost is one Python interpreter, ~12 MB RSS. **Per SIGWINCH event:** the side-channel `smolvm machine exec -- sh -c 'for f in /dev/pts/*; do stty -F $f cols X rows Y; done'` takes about **50 ms** in steady state (10 iters: 110, 53, 53, 49, 53, 56, 49, 41, 55, 51 ms — first is colder, rest are warm). That's almost entirely smolvm setup + libkrun exec channel round-trip, not the stty work. **Frequency:** SIGWINCH is human-driven — fires only when the terminal window or tmux pane changes size. Typical session: zero events. Active drag-to-resize: maybe 5-10 events bursted over a second or two while the user is dragging, then nothing. **Worst case** (someone holding the corner and dragging the window slowly for a few seconds): a queue of 50 ms exec calls trailing the drag, settling within ~100 ms of when they stop. Imperceptible against the 50–200 ms claude TUI redraw it triggers anyway. If the per-event cost ever becomes load-bearing we could keep a long-lived `smolvm machine exec` session open over a pipe and poke it with `stty` commands instead of paying smolvm startup every time — but at one event per resize and 50 ms per event, the complexity isn't worth it.
didericis added 1 commit 2026-05-27 20:26:46 -04:00
fix(smolmachines): invoke pty_resize by absolute path, not python -m
test / unit (pull_request) Successful in 25s
test / integration (pull_request) Successful in 40s
794e8666e1
The dashboard's launch path crashed inside tmux but worked
outside it. Root cause: `python -m
claude_bottle.backend.smolmachines.pty_resize` needs the
`claude_bottle` package on `sys.path`, which by default comes
from cwd. The outside-tmux path is `subprocess.run(...)` —
inherits the dashboard process's cwd (the repo root, where
`claude_bottle/` lives), so the import resolves. The
inside-tmux path is `tmux split-window / respawn-pane <argv>`,
and tmux opens the new pane with the pane's OWN cwd, not the
cwd of the process invoking split-window. If the operator
started their tmux pane anywhere outside the repo (typical:
`$HOME`), the wrapper hit `ModuleNotFoundError: No module
named 'claude_bottle'` and tmux closed the pane immediately.

Sidestep the cwd dependence by invoking the wrapper as
`python <absolute-path-to-pty_resize.py>` instead of
`python -m <dotted-path>`. The wrapper has no
`claude_bottle.*` imports — it's stdlib-only — so it runs as
a standalone script anywhere on the filesystem. The absolute
path comes from `pty_resize.__file__` at module-load time.

Tests:
- `test_pty_resize_wrapper_prefix`: updated to assert the
  absolute-script-path shape rather than the `-m <dotted>`
  shape.
- `test_no_wrapper_when_tty_false`: the substring check now
  uses `any("pty_resize" in a for a in argv)` instead of
  string-joining (so the absolute path's "pty_resize.py"
  filename match still catches a regression).

636 unit tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
didericis added 1 commit 2026-05-27 20:37:53 -04:00
chore(smolmachines): instrument pty_resize wrapper for crash diagnosis
test / unit (pull_request) Successful in 28s
test / integration (pull_request) Successful in 41s
37bd11b375
User reports launch crashing only inside tmux (works outside).
The wrapper itself runs fine in standalone tmux repros, so the
break is in some interaction we can't see — curses eats stderr,
default tmux remain-on-exit is off, and the pane closes before
the operator can read anything.

Add an always-on per-pid log at ~/.claude-bottle/pty_resize.log:

  - start record: argv, cwd, PATH, TMUX status
  - sync record: window size observed
  - child pid + exit rc
  - any KeyboardInterrupt forwarding
  - Popen failure traceback if it dies

Append-mode, small overhead, easy to grep + share.

Removable (along with the wrapper itself) once smolvm forwards
SIGWINCH natively.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
didericis added 1 commit 2026-05-27 20:44:02 -04:00
fix(smolmachines): give pty_resize side-channel DEVNULL stdin so it survives under tmux
test / unit (pull_request) Successful in 26s
test / integration (pull_request) Successful in 40s
b9853ae0c7
Inside tmux the dashboard's smolmachines launch crashed within
~100ms of the wrapper Popen-ing the main smolvm exec child —
sometimes with rc=137 (SIGKILL), sometimes with smolvm
spitting a runc-style "load `config.json`: cannot parse the
data: parse error: trailing garbage" and exiting 1. The same
wrapper ran fine outside tmux. Diagnostic logs showed the
SIGKILL landed ~100ms after the wrapper kicked off its
initial `sync()` (which fires the side-channel smolvm exec).

Root cause: the side-channel `subprocess.run([smolvm, machine,
exec, --, sh, -c, ...])` did not specify `stdin=`, so it
inherited the wrapper's stdin — the tmux pane PTY. The main
smolvm child (the agent session) also had that PTY as stdin.
Two concurrent smolvm processes sharing the PTY's
foreground-process-group / input plumbing caused smolvm to
abort one of them. iTerm's PTY plumbing apparently tolerated
this; tmux's didn't.

Fix is one line in `_push_size`: `stdin=subprocess.DEVNULL`.
The side-channel never needs stdin — it runs a fire-and-forget
`stty` and exits. Verified end-to-end: pre-fix the wrapper
crashed under `tmux respawn-pane` against a live VM; post-fix
the same invocation completes cleanly.

Also drop the diagnostic log added in 37bd11b — we have the
fix.

Regression test:
`test_side_channel_uses_devnull_stdin` locks the
`stdin=DEVNULL` invariant so a future "let's simplify the
subprocess.run kwargs" refactor surfaces this immediately.

637 unit tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
didericis added 1 commit 2026-05-27 20:47:35 -04:00
chore(smolmachines): re-add pty_resize debug log (temp, for issue diagnosis)
test / unit (pull_request) Successful in 28s
test / integration (pull_request) Successful in 41s
9c83ea6428
User reports the launch still crashes in tmux after b9853ae's
stdin=DEVNULL fix. Re-instrument to capture the next failure mode
(argv, ppid, sync size, child exit, Popen tracebacks).

Removable once the inside-tmux launch is confirmed stable.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
didericis added 1 commit 2026-05-27 20:55:03 -04:00
fix(smolmachines): defer pty_resize startup sync to dodge libkrun's bringup race
test / unit (pull_request) Successful in 26s
test / integration (pull_request) Successful in 41s
test / unit (push) Successful in 27s
test / integration (push) Successful in 45s
aa5aa1f031
The b9853ae stdin=DEVNULL fix wasn't sufficient. End-to-end
testing against a live VM in tmux revealed a second crash path:
libkrun spits "load \`config.json\`: parse error: trailing
garbage { \"ociVersion\": \"1.0.2\", ... }" and the main exec
dies (rc=1 or SIGKILL/rc=137, depending on race scheduling).

Root cause: each `smolvm machine exec` writes a per-invocation
OCI config.json to the same smolvm state dir during its bringup.
The wrapper's startup sync() fires within 1ms of Popen-ing the
main exec — both invocations write config.json concurrently,
libkrun loads one mid-write, and gets garbage. Trivial inner
commands (`sh -c "echo hi"`) finished before the overlap
mattered, masking the race in earlier tests. claude's slower
startup hits the race every time, and only inside tmux because
the outside-tmux foreground-handoff path takes a different
bringup sequence that happens to dodge the window.

Fix: schedule the initial sync on a 2-second `threading.Timer`
instead of calling it synchronously. By 2s the main exec is
past its bringup window, so the side-channel's config.json
write doesn't collide. Daemon thread so the timer doesn't
block exit when the child finishes quickly.

Trade-off: the in-VM PTY uses smolvm's default size for the
first ~2s, then snaps to the host pane size when the timer
fires. Verified end-to-end against a live VM in tmux: claude
renders at the default size during bringup, then redraws at
full pane width once the deferred sync lands. Operator-driven
resizes (SIGWINCH) still bridge in real time via the
already-installed signal handler.

Also drop the diagnostic log added in 9c83ea6 — we have the
fix.

Regression test:
`TestStartupSyncDeferred.test_main_schedules_timer_does_not_
call_sync_synchronously` mocks Popen + Timer + _push_size and
asserts `main()` schedules the timer with the documented
delay constant and never invokes _push_size synchronously.
Catches a "let's just inline the sync() call" regression
immediately.

638 unit tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
didericis reviewed 2026-05-27 20:58:09 -04:00
@@ -27,0 +33,4 @@
# with ModuleNotFoundError whenever the operator's tmux pane was
# started from anywhere outside the claude-bottle repo. Absolute
# path sidesteps the cwd dependence (the wrapper has no
# claude_bottle.* imports, so it runs as a standalone script).
Owner

This wasn't the core issue, so I don't know if this logic is true. But it doesn't hurt/might be true, I think?

This wasn't the core issue, so I don't know if this logic is true. But it doesn't hurt/might be true, I think?
didericis reviewed 2026-05-27 21:00:15 -04:00
@@ -0,0 +148,4 @@
# the side-channel writes a fresh config.json without
# collision, and the in-VM PTY is sized before claude has
# finished rendering its first frame. daemon=True so the
# timer doesn't block exit when the child finishes quickly.
Owner

comment is redundant with comment for the constant

comment is redundant with comment for the constant
didericis approved these changes 2026-05-27 21:02:59 -04:00
didericis left a comment
Owner

Fixed

<video src="attachments/b9feea9f-bb71-439e-9fe6-45039ba8ee35" title="May-27-2026 20-56-29.mp4" controls></video> Fixed
didericis merged commit aa5aa1f031 into main 2026-05-27 21:03:17 -04:00
didericis deleted branch smolmachines-pty-resize-issue-82 2026-05-27 21:03:18 -04:00
didericis-claude reviewed 2026-05-27 21:05:54 -04:00
didericis-claude left a comment
Author
Collaborator

Both addressed in follow-up b3c6d66 (direct to main).

Both addressed in follow-up b3c6d66 (direct to main).
@@ -27,0 +33,4 @@
# with ModuleNotFoundError whenever the operator's tmux pane was
# started from anywhere outside the claude-bottle repo. Absolute
# path sidesteps the cwd dependence (the wrapper has no
# claude_bottle.* imports, so it runs as a standalone script).
Author
Collaborator

Fair call. The tmux-pane-cwd hypothesis was just a guess that turned out not to be the actual bug (libkrun's bringup race was), and I never went back to verify it. Rewrote the comment in b3c6d66 to drop the speculation and just state the operative reason: the wrapper is self-contained (no claude_bottle.* imports), so an absolute-path invocation works regardless of cwd. Same one-line _PTY_RESIZE_SCRIPT = _pty_resize.__file__ line, leaner justification.

Fair call. The tmux-pane-cwd hypothesis was just a guess that turned out not to be the actual bug (libkrun's bringup race was), and I never went back to verify it. Rewrote the comment in b3c6d66 to drop the speculation and just state the operative reason: the wrapper is self-contained (no `claude_bottle.*` imports), so an absolute-path invocation works regardless of cwd. Same one-line `_PTY_RESIZE_SCRIPT = _pty_resize.__file__` line, leaner justification.
@@ -0,0 +148,4 @@
# the side-channel writes a fresh config.json without
# collision, and the in-VM PTY is sized before claude has
# finished rendering its first frame. daemon=True so the
# timer doesn't block exit when the child finishes quickly.
Author
Collaborator

Right — collapsed in b3c6d66 to a one-liner pointing at _STARTUP_SYNC_DELAY_SEC plus the daemon=True operational note. The libkrun-race + warm-VM-timing explanation now lives only on the constant.

Right — collapsed in b3c6d66 to a one-liner pointing at `_STARTUP_SYNC_DELAY_SEC` plus the `daemon=True` operational note. The libkrun-race + warm-VM-timing explanation now lives only on the constant.
Sign in to join this conversation.