# PRD 0032: Decompose smolmachines launch and harden bringup sequencing - **Status:** Active - **Author:** didericis-claude - **Created:** 2026-06-02 - **Issue:** #122 ## Summary Split `launch()` into named per-step helpers, replace the empirical `time.sleep(1.5)` with a readiness poll, and file-lock loopback alias allocation. Addresses the three actionable issues from the #117 hotspot review of `smolmachines/launch.py`. ## Problem ### 1. `launch()` step ordering `launch()` in `smolmachines/launch.py` is 207 lines. Seven sequenced steps are marked by numbered inline comments (`# 1. Reserve a loopback alias`, `# 2. Mint per-bottle CAs`, ...) — the sequencing is load-bearing (CA paths must be filled before the bundle spec is built; the bundle must be running before port discovery; the VM must be created before the allowlist is patched), but the dependencies are enforced only by linear ordering within one function. Adding a new daemon, changing the port-forward strategy, or debugging a bringup failure requires reading the whole function to understand what state each step produces. Each step is also not individually testable without mocking the entire surrounding context. ### 2. `time.sleep(1.5)` for libkrun exec-channel race After `machine_start`, back-to-back `machine_exec` calls occasionally hit a SIGKILL in libkrun's exec channel at ~100ms. The sleep is documented as "1.5s is empirically enough; provisioning already takes seconds so the wait is amortized." The failure mode if the sleep is insufficient: the filesystem-repair exec (`chown -R node:node /home/node`) is SIGKILLed silently, and the agent later bails with `ENOENT`/`EPERM` when Claude Code tries to write to `~/.claude.json`. A poll-until-ready loop is more robust than a fixed duration: it exits as soon as the exec channel is up, fails loudly with a timeout if the VM never becomes responsive, and is self-documenting about what it is waiting for. ### 3. Loopback alias allocation is not concurrent-safe `loopback_alias.allocate()` reads docker container state to determine which aliases are already in use, then returns the lowest free alias. There is no lock between that read and the bundle's `docker run` (which creates the container that will appear in future `docker ps` output). Two simultaneous bottle launches can both see the same alias as free and claim it, causing both bundles to bind on the same loopback IP. On macOS, where users occasionally start multiple agents in quick succession, this is a realistic failure mode. ## Non-goals - Removing `force_allowlist` / the `--allow-cidr` DB patch. That is a workaround for a smolvm 0.8.0 bug; removal is a one-liner when smolvm honors the CLI flag upstream. - Changing the ephemeral registry / crane detour in `local_registry.py`. Required by Docker Desktop's network topology. - Changing `_ensure_smolmachine`'s cache design. Cache invalidation by docker image ID works; issue #111 tracks a separate stale-sidecar concern. ## Design ### 1. Decompose `launch()` into named helpers Extract six focused helpers. `launch()` becomes a coordinator that calls them in order, passing the `ExitStack` for teardown registration: ``` _allocate_resources(plan, stack) → (loopback_ip, network) ``` Reserve the loopback alias, create the docker bridge network, register teardown callbacks for both. ``` _mint_certs(plan) → plan ``` Pipelock TLS init (always). Egress TLS init when `plan.egress_plan.routes` is non-empty. Returns the plan with CA paths filled via `dataclasses.replace`. ``` _start_bundle(plan, network, loopback_ip, stack) → plan ``` Build the `BundleLaunchSpec`, resolve token env, start the bundle container, register teardown. Returns the plan with `bundle_spec` updated (or unchanged if no plan field carries it — callers consume `bundle_spec` directly from this call's return value if needed). ``` _discover_urls(plan, loopback_ip) → plan ``` Look up host-side ports for the published container ports; assemble `agent_proxy_url`, `agent_git_gate_host`, `agent_supervise_url`; stamp them onto the plan and into `guest_env`. ``` _launch_vm(plan, agent_from_path, stack) → None ``` `machine_create` + `force_allowlist` + `machine_start`. Register `machine_stop` and `machine_delete` teardown callbacks on the stack. ``` _init_vm(plan) → None ``` Filesystem-repair exec (`chown`/`chmod`) followed by `_wait_exec_ready()`. `launch()` reduces to: ```python loopback_ip, network = _allocate_resources(plan, stack) plan = _mint_certs(plan) plan = _start_bundle(plan, network, loopback_ip, stack) plan = _discover_urls(plan, loopback_ip) agent_from_path = _ensure_smolmachine(plan.agent_image_ref, dockerfile=plan.agent_dockerfile_path) _launch_vm(plan, agent_from_path, stack) _init_vm(plan) prompt_path = provision(plan, plan.machine_name) yield SmolmachinesBottle(...) ``` Each helper's inputs and outputs are explicit; each is independently testable with a minimal set of mocks. ### 2. Replace `time.sleep(1.5)` with `_wait_exec_ready` Add to `smolvm.py`: ```python def wait_exec_ready(name: str, *, timeout: float = 5.0) -> None: """Poll until `machine exec true` exits 0 or `timeout` elapses. Replaces a fixed sleep after machine_start for the libkrun exec-channel warm-up race.""" deadline = time.monotonic() + timeout delay = 0.1 while time.monotonic() < deadline: r = machine_exec(name, ["true"]) if r.returncode == 0: return remaining = deadline - time.monotonic() if remaining <= 0: break time.sleep(min(delay, remaining)) delay = min(delay * 2, 0.5) die( f"smolvm machine {name!r}: exec channel not ready after " f"{timeout:.0f}s — VM may have failed to boot." ) ``` `_init_vm` calls `wait_exec_ready` after the chown/chmod exec instead of `time.sleep(1.5)`. The `time` import in `launch.py` is removed. ### 3. File-lock loopback alias allocation Add to `loopback_alias.py`: ```python import fcntl _ALLOC_LOCK_PATH = Path.home() / ".cache" / "bot-bottle" / "smolmachines.lock" def allocate(slug: str) -> str: if not _is_macos(): return "127.0.0.1" _ALLOC_LOCK_PATH.parent.mkdir(parents=True, exist_ok=True) with open(_ALLOC_LOCK_PATH, "w") as lf: fcntl.flock(lf, fcntl.LOCK_EX) return _allocate_locked(slug) def _allocate_locked(slug: str) -> str: in_use = _aliases_in_use() for ip in _pool_addresses(): if ip not in in_use: return ip die(...) return "" ``` The lock is held only for the duration of `_aliases_in_use()` + the `allocate` return. The bundle's `docker run` runs after the lock is released. This is sufficient: once `docker run` returns, the container is visible in docker state and future `allocate()` calls will see it. The remaining window (lock released → container appears in docker state) is narrowed from "the entire bringup sequence" to "a single subprocess call," making a collision between two concurrent launches effectively impossible in practice. The lock is a no-op on Linux (the `_is_macos()` early-return fires before the lock path is opened). ## Test impact - Unit tests for each extracted helper can mock one subprocess boundary at a time (smolvm, docker, pipelock TLS init) without wiring the full `launch()` ExitStack. - `wait_exec_ready` needs a test with `machine_exec` stubbed to return non-zero N times before 0 — verifies the backoff loop and the timeout die path. - `allocate` tests are unchanged in shape; the lock is acquired and released within the call so tests don't need to be aware of it. ## Implementation chunks 1. **PRD (this commit).** Sets the design. 2. **Decompose `launch()`.** 3. **Replace sleep with `wait_exec_ready`.** 4. **File-lock `allocate()`.** 5. **Tests.** Unit tests for each helper; `wait_exec_ready` backoff + timeout. ## References - Issue #122: Decompose smolmachines launch and harden bringup sequencing. - Issue #117: Complexity hotspots — source of the smolmachines/launch.py finding. - Issue #111: Smolmachine sidecar doesn't reliably get refreshed (separate, not addressed here).