8.0 KiB
PRD 0032: Decompose smolmachines launch and harden bringup sequencing
- Status: Active
- Author: didericis-claude
- Created: 2026-06-02
- Issue: #122
Summary
Split launch() into named per-step helpers, replace the empirical
time.sleep(1.5) with a readiness poll, and file-lock loopback alias
allocation. Addresses the three actionable issues from the #117 hotspot
review of smolmachines/launch.py.
Problem
1. launch() step ordering
launch() in smolmachines/launch.py is 207 lines. Seven sequenced
steps are marked by numbered inline comments (# 1. Reserve a loopback alias, # 2. Mint per-bottle CAs, ...) — the sequencing is
load-bearing (CA paths must be filled before the bundle spec is built;
the bundle must be running before port discovery; the VM must be created
before the allowlist is patched), but the dependencies are enforced only
by linear ordering within one function. Adding a new daemon, changing
the port-forward strategy, or debugging a bringup failure requires
reading the whole function to understand what state each step produces.
Each step is also not individually testable without mocking the entire
surrounding context.
2. time.sleep(1.5) for libkrun exec-channel race
After machine_start, back-to-back machine_exec calls occasionally
hit a SIGKILL in libkrun's exec channel at ~100ms. The sleep is
documented as "1.5s is empirically enough; provisioning already takes
seconds so the wait is amortized." The failure mode if the sleep is
insufficient: the filesystem-repair exec (chown -R node:node /home/node)
is SIGKILLed silently, and the agent later bails with ENOENT/EPERM
when Claude Code tries to write to ~/.claude.json. A poll-until-ready
loop is more robust than a fixed duration: it exits as soon as the exec
channel is up, fails loudly with a timeout if the VM never becomes
responsive, and is self-documenting about what it is waiting for.
3. Loopback alias allocation is not concurrent-safe
loopback_alias.allocate() reads docker container state to determine
which aliases are already in use, then returns the lowest free alias.
There is no lock between that read and the bundle's docker run (which
creates the container that will appear in future docker ps output). Two
simultaneous bottle launches can both see the same alias as free and
claim it, causing both bundles to bind on the same loopback IP. On macOS,
where users occasionally start multiple agents in quick succession, this
is a realistic failure mode.
Non-goals
- Removing
force_allowlist/ the--allow-cidrDB patch. That is a workaround for a smolvm 0.8.0 bug; removal is a one-liner when smolvm honors the CLI flag upstream. - Changing the ephemeral registry / crane detour in
local_registry.py. Required by Docker Desktop's network topology. - Changing
_ensure_smolmachine's cache design. Cache invalidation by docker image ID works; issue #111 tracks a separate stale-sidecar concern.
Design
1. Decompose launch() into named helpers
Extract six focused helpers. launch() becomes a coordinator that calls
them in order, passing the ExitStack for teardown registration:
_allocate_resources(plan, stack) → (loopback_ip, network)
Reserve the loopback alias, create the docker bridge network, register teardown callbacks for both.
_mint_certs(plan) → plan
Pipelock TLS init (always). Egress TLS init when plan.egress_plan.routes
is non-empty. Returns the plan with CA paths filled via
dataclasses.replace.
_start_bundle(plan, network, loopback_ip, stack) → plan
Build the BundleLaunchSpec, resolve token env, start the bundle
container, register teardown. Returns the plan with bundle_spec updated
(or unchanged if no plan field carries it — callers consume bundle_spec
directly from this call's return value if needed).
_discover_urls(plan, loopback_ip) → plan
Look up host-side ports for the published container ports; assemble
agent_proxy_url, agent_git_gate_host, agent_supervise_url; stamp
them onto the plan and into guest_env.
_launch_vm(plan, agent_from_path, stack) → None
machine_create + force_allowlist + machine_start. Register
machine_stop and machine_delete teardown callbacks on the stack.
_init_vm(plan) → None
Filesystem-repair exec (chown/chmod) followed by
_wait_exec_ready().
launch() reduces to:
loopback_ip, network = _allocate_resources(plan, stack)
plan = _mint_certs(plan)
plan = _start_bundle(plan, network, loopback_ip, stack)
plan = _discover_urls(plan, loopback_ip)
agent_from_path = _ensure_smolmachine(plan.agent_image_ref,
dockerfile=plan.agent_dockerfile_path)
_launch_vm(plan, agent_from_path, stack)
_init_vm(plan)
prompt_path = provision(plan, plan.machine_name)
yield SmolmachinesBottle(...)
Each helper's inputs and outputs are explicit; each is independently testable with a minimal set of mocks.
2. Replace time.sleep(1.5) with _wait_exec_ready
Add to smolvm.py:
def wait_exec_ready(name: str, *, timeout: float = 5.0) -> None:
"""Poll until `machine exec true` exits 0 or `timeout` elapses.
Replaces a fixed sleep after machine_start for the libkrun
exec-channel warm-up race."""
deadline = time.monotonic() + timeout
delay = 0.1
while time.monotonic() < deadline:
r = machine_exec(name, ["true"])
if r.returncode == 0:
return
remaining = deadline - time.monotonic()
if remaining <= 0:
break
time.sleep(min(delay, remaining))
delay = min(delay * 2, 0.5)
die(
f"smolvm machine {name!r}: exec channel not ready after "
f"{timeout:.0f}s — VM may have failed to boot."
)
_init_vm calls wait_exec_ready after the chown/chmod exec instead of
time.sleep(1.5). The time import in launch.py is removed.
3. File-lock loopback alias allocation
Add to loopback_alias.py:
import fcntl
_ALLOC_LOCK_PATH = Path.home() / ".cache" / "bot-bottle" / "smolmachines.lock"
def allocate(slug: str) -> str:
if not _is_macos():
return "127.0.0.1"
_ALLOC_LOCK_PATH.parent.mkdir(parents=True, exist_ok=True)
with open(_ALLOC_LOCK_PATH, "w") as lf:
fcntl.flock(lf, fcntl.LOCK_EX)
return _allocate_locked(slug)
def _allocate_locked(slug: str) -> str:
in_use = _aliases_in_use()
for ip in _pool_addresses():
if ip not in in_use:
return ip
die(...)
return ""
The lock is held only for the duration of _aliases_in_use() + the
allocate return. The bundle's docker run runs after the lock is
released. This is sufficient: once docker run returns, the container
is visible in docker state and future allocate() calls will see it.
The remaining window (lock released → container appears in docker state)
is narrowed from "the entire bringup sequence" to "a single subprocess
call," making a collision between two concurrent launches effectively
impossible in practice.
The lock is a no-op on Linux (the _is_macos() early-return fires
before the lock path is opened).
Test impact
- Unit tests for each extracted helper can mock one subprocess boundary
at a time (smolvm, docker, pipelock TLS init) without wiring the full
launch()ExitStack. wait_exec_readyneeds a test withmachine_execstubbed to return non-zero N times before 0 — verifies the backoff loop and the timeout die path.allocatetests are unchanged in shape; the lock is acquired and released within the call so tests don't need to be aware of it.
Implementation chunks
- PRD (this commit). Sets the design.
- Decompose
launch(). - Replace sleep with
wait_exec_ready. - File-lock
allocate(). - Tests. Unit tests for each helper;
wait_exec_readybackoff + timeout.
References
- Issue #122: Decompose smolmachines launch and harden bringup sequencing.
- Issue #117: Complexity hotspots — source of the smolmachines/launch.py finding.
- Issue #111: Smolmachine sidecar doesn't reliably get refreshed (separate, not addressed here).