fix(smolmachines): defer pty_resize startup sync to dodge libkrun's bringup race
Theb9853aestdin=DEVNULL fix wasn't sufficient. End-to-end testing against a live VM in tmux revealed a second crash path: libkrun spits "load \`config.json\`: parse error: trailing garbage { \"ociVersion\": \"1.0.2\", ... }" and the main exec dies (rc=1 or SIGKILL/rc=137, depending on race scheduling). Root cause: each `smolvm machine exec` writes a per-invocation OCI config.json to the same smolvm state dir during its bringup. The wrapper's startup sync() fires within 1ms of Popen-ing the main exec — both invocations write config.json concurrently, libkrun loads one mid-write, and gets garbage. Trivial inner commands (`sh -c "echo hi"`) finished before the overlap mattered, masking the race in earlier tests. claude's slower startup hits the race every time, and only inside tmux because the outside-tmux foreground-handoff path takes a different bringup sequence that happens to dodge the window. Fix: schedule the initial sync on a 2-second `threading.Timer` instead of calling it synchronously. By 2s the main exec is past its bringup window, so the side-channel's config.json write doesn't collide. Daemon thread so the timer doesn't block exit when the child finishes quickly. Trade-off: the in-VM PTY uses smolvm's default size for the first ~2s, then snaps to the host pane size when the timer fires. Verified end-to-end against a live VM in tmux: claude renders at the default size during bringup, then redraws at full pane width once the deferred sync lands. Operator-driven resizes (SIGWINCH) still bridge in real time via the already-installed signal handler. Also drop the diagnostic log added in9c83ea6— we have the fix. Regression test: `TestStartupSyncDeferred.test_main_schedules_timer_does_not_ call_sync_synchronously` mocks Popen + Timer + _push_size and asserts `main()` schedules the timer with the documented delay constant and never invokes _push_size synchronously. Catches a "let's just inline the sync() call" regression immediately. 638 unit tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit was merged in pull request #83.
This commit is contained in:
@@ -10,6 +10,7 @@ from __future__ import annotations
|
||||
|
||||
import io
|
||||
import unittest
|
||||
import unittest.mock
|
||||
from unittest.mock import patch
|
||||
|
||||
from claude_bottle.backend.smolmachines import pty_resize
|
||||
@@ -126,5 +127,38 @@ class TestMainArgvParsing(unittest.TestCase):
|
||||
self.assertEqual(2, pty_resize.main(["m", "--"]))
|
||||
|
||||
|
||||
class TestStartupSyncDeferred(unittest.TestCase):
|
||||
"""Regression: the initial sync MUST be deferred (timer), not
|
||||
called synchronously between Popen + wait. Calling it
|
||||
immediately races libkrun's per-exec OCI config write during
|
||||
the main exec's bringup and crashes the child (rc=137 or
|
||||
'parse error: trailing garbage')."""
|
||||
|
||||
def test_main_schedules_timer_does_not_call_sync_synchronously(self):
|
||||
# Fake Popen + wait so main returns immediately. Patch
|
||||
# Timer to record args without spawning a real thread.
|
||||
# _push_size patched so any rogue synchronous call would
|
||||
# be observable.
|
||||
fake_proc = unittest.mock.MagicMock()
|
||||
fake_proc.wait.return_value = 0
|
||||
with patch.object(
|
||||
pty_resize.subprocess, "Popen", return_value=fake_proc,
|
||||
), patch.object(
|
||||
pty_resize.threading, "Timer",
|
||||
) as timer_cls, patch.object(
|
||||
pty_resize, "_push_size",
|
||||
) as push:
|
||||
rc = pty_resize.main(["machine-name", "--", "echo", "hi"])
|
||||
|
||||
self.assertEqual(0, rc)
|
||||
# Timer scheduled with the documented delay constant.
|
||||
timer_cls.assert_called_once()
|
||||
delay, callback = timer_cls.call_args.args
|
||||
self.assertEqual(pty_resize._STARTUP_SYNC_DELAY_SEC, delay)
|
||||
# _push_size never called synchronously — the only path to
|
||||
# it is via the (mocked) timer's callback firing.
|
||||
push.assert_not_called()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
|
||||
Reference in New Issue
Block a user