From fe97b6014d44fc55cef107ed2ccd8b62c6bafd59 Mon Sep 17 00:00:00 2001 From: claude Date: Tue, 2 Jun 2026 06:14:16 +0000 Subject: [PATCH] =?UTF-8?q?docs(prd):=20PRD=200032=20=E2=80=94=20smolmachi?= =?UTF-8?q?nes=20launch=20decomposition?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Split launch() into named per-step helpers, replace time.sleep(1.5) with a readiness poll, and file-lock loopback alias allocation. Addresses the three actionable items from the #117 hotspot review of smolmachines/launch.py. --- .../0032-smolmachines-launch-decomposition.md | 221 ++++++++++++++++++ 1 file changed, 221 insertions(+) create mode 100644 docs/prds/0032-smolmachines-launch-decomposition.md diff --git a/docs/prds/0032-smolmachines-launch-decomposition.md b/docs/prds/0032-smolmachines-launch-decomposition.md new file mode 100644 index 0000000..1173f24 --- /dev/null +++ b/docs/prds/0032-smolmachines-launch-decomposition.md @@ -0,0 +1,221 @@ +# PRD 0032: Decompose smolmachines launch and harden bringup sequencing + +- **Status:** Draft +- **Author:** didericis-claude +- **Created:** 2026-06-02 +- **Issue:** #122 + +## Summary + +Split `launch()` into named per-step helpers, replace the empirical +`time.sleep(1.5)` with a readiness poll, and file-lock loopback alias +allocation. Addresses the three actionable issues from the #117 hotspot +review of `smolmachines/launch.py`. + +## Problem + +### 1. `launch()` step ordering + +`launch()` in `smolmachines/launch.py` is 207 lines. Seven sequenced +steps are marked by numbered inline comments (`# 1. Reserve a loopback +alias`, `# 2. Mint per-bottle CAs`, ...) — the sequencing is +load-bearing (CA paths must be filled before the bundle spec is built; +the bundle must be running before port discovery; the VM must be created +before the allowlist is patched), but the dependencies are enforced only +by linear ordering within one function. Adding a new daemon, changing +the port-forward strategy, or debugging a bringup failure requires +reading the whole function to understand what state each step produces. +Each step is also not individually testable without mocking the entire +surrounding context. + +### 2. `time.sleep(1.5)` for libkrun exec-channel race + +After `machine_start`, back-to-back `machine_exec` calls occasionally +hit a SIGKILL in libkrun's exec channel at ~100ms. The sleep is +documented as "1.5s is empirically enough; provisioning already takes +seconds so the wait is amortized." The failure mode if the sleep is +insufficient: the filesystem-repair exec (`chown -R node:node /home/node`) +is SIGKILLed silently, and the agent later bails with `ENOENT`/`EPERM` +when Claude Code tries to write to `~/.claude.json`. A poll-until-ready +loop is more robust than a fixed duration: it exits as soon as the exec +channel is up, fails loudly with a timeout if the VM never becomes +responsive, and is self-documenting about what it is waiting for. + +### 3. Loopback alias allocation is not concurrent-safe + +`loopback_alias.allocate()` reads docker container state to determine +which aliases are already in use, then returns the lowest free alias. +There is no lock between that read and the bundle's `docker run` (which +creates the container that will appear in future `docker ps` output). Two +simultaneous bottle launches can both see the same alias as free and +claim it, causing both bundles to bind on the same loopback IP. On macOS, +where users occasionally start multiple agents in quick succession, this +is a realistic failure mode. + +## Non-goals + +- Removing `force_allowlist` / the `--allow-cidr` DB patch. That is a + workaround for a smolvm 0.8.0 bug; removal is a one-liner when smolvm + honors the CLI flag upstream. +- Changing the ephemeral registry / crane detour in `local_registry.py`. + Required by Docker Desktop's network topology. +- Changing `_ensure_smolmachine`'s cache design. Cache invalidation by + docker image ID works; issue #111 tracks a separate stale-sidecar + concern. + +## Design + +### 1. Decompose `launch()` into named helpers + +Extract six focused helpers. `launch()` becomes a coordinator that calls +them in order, passing the `ExitStack` for teardown registration: + +``` +_allocate_resources(plan, stack) → (loopback_ip, network) +``` +Reserve the loopback alias, create the docker bridge network, register +teardown callbacks for both. + +``` +_mint_certs(plan) → plan +``` +Pipelock TLS init (always). Egress TLS init when `plan.egress_plan.routes` +is non-empty. Returns the plan with CA paths filled via +`dataclasses.replace`. + +``` +_start_bundle(plan, network, loopback_ip, stack) → plan +``` +Build the `BundleLaunchSpec`, resolve token env, start the bundle +container, register teardown. Returns the plan with `bundle_spec` updated +(or unchanged if no plan field carries it — callers consume `bundle_spec` +directly from this call's return value if needed). + +``` +_discover_urls(plan, loopback_ip) → plan +``` +Look up host-side ports for the published container ports; assemble +`agent_proxy_url`, `agent_git_gate_host`, `agent_supervise_url`; stamp +them onto the plan and into `guest_env`. + +``` +_launch_vm(plan, agent_from_path, stack) → None +``` +`machine_create` + `force_allowlist` + `machine_start`. Register +`machine_stop` and `machine_delete` teardown callbacks on the stack. + +``` +_init_vm(plan) → None +``` +Filesystem-repair exec (`chown`/`chmod`) followed by +`_wait_exec_ready()`. + +`launch()` reduces to: + +```python +loopback_ip, network = _allocate_resources(plan, stack) +plan = _mint_certs(plan) +plan = _start_bundle(plan, network, loopback_ip, stack) +plan = _discover_urls(plan, loopback_ip) +agent_from_path = _ensure_smolmachine(plan.agent_image_ref, + dockerfile=plan.agent_dockerfile_path) +_launch_vm(plan, agent_from_path, stack) +_init_vm(plan) +prompt_path = provision(plan, plan.machine_name) +yield SmolmachinesBottle(...) +``` + +Each helper's inputs and outputs are explicit; each is independently +testable with a minimal set of mocks. + +### 2. Replace `time.sleep(1.5)` with `_wait_exec_ready` + +Add to `smolvm.py`: + +```python +def wait_exec_ready(name: str, *, timeout: float = 5.0) -> None: + """Poll until `machine exec true` exits 0 or `timeout` elapses. + Replaces a fixed sleep after machine_start for the libkrun + exec-channel warm-up race.""" + deadline = time.monotonic() + timeout + delay = 0.1 + while time.monotonic() < deadline: + r = machine_exec(name, ["true"]) + if r.returncode == 0: + return + remaining = deadline - time.monotonic() + if remaining <= 0: + break + time.sleep(min(delay, remaining)) + delay = min(delay * 2, 0.5) + die( + f"smolvm machine {name!r}: exec channel not ready after " + f"{timeout:.0f}s — VM may have failed to boot." + ) +``` + +`_init_vm` calls `wait_exec_ready` after the chown/chmod exec instead of +`time.sleep(1.5)`. The `time` import in `launch.py` is removed. + +### 3. File-lock loopback alias allocation + +Add to `loopback_alias.py`: + +```python +import fcntl + +_ALLOC_LOCK_PATH = Path.home() / ".cache" / "bot-bottle" / "smolmachines.lock" + +def allocate(slug: str) -> str: + if not _is_macos(): + return "127.0.0.1" + _ALLOC_LOCK_PATH.parent.mkdir(parents=True, exist_ok=True) + with open(_ALLOC_LOCK_PATH, "w") as lf: + fcntl.flock(lf, fcntl.LOCK_EX) + return _allocate_locked(slug) + +def _allocate_locked(slug: str) -> str: + in_use = _aliases_in_use() + for ip in _pool_addresses(): + if ip not in in_use: + return ip + die(...) + return "" +``` + +The lock is held only for the duration of `_aliases_in_use()` + the +`allocate` return. The bundle's `docker run` runs after the lock is +released. This is sufficient: once `docker run` returns, the container +is visible in docker state and future `allocate()` calls will see it. +The remaining window (lock released → container appears in docker state) +is narrowed from "the entire bringup sequence" to "a single subprocess +call," making a collision between two concurrent launches effectively +impossible in practice. + +The lock is a no-op on Linux (the `_is_macos()` early-return fires +before the lock path is opened). + +## Test impact + +- Unit tests for each extracted helper can mock one subprocess boundary + at a time (smolvm, docker, pipelock TLS init) without wiring the full + `launch()` ExitStack. +- `wait_exec_ready` needs a test with `machine_exec` stubbed to return + non-zero N times before 0 — verifies the backoff loop and the timeout + die path. +- `allocate` tests are unchanged in shape; the lock is acquired and + released within the call so tests don't need to be aware of it. + +## Implementation chunks + +1. **PRD (this commit).** Sets the design. +2. **Decompose `launch()`.** +3. **Replace sleep with `wait_exec_ready`.** +4. **File-lock `allocate()`.** +5. **Tests.** Unit tests for each helper; `wait_exec_ready` backoff + timeout. + +## References + +- Issue #122: Decompose smolmachines launch and harden bringup sequencing. +- Issue #117: Complexity hotspots — source of the smolmachines/launch.py finding. +- Issue #111: Smolmachine sidecar doesn't reliably get refreshed (separate, not addressed here).