docs(prd): PRD 0032 — smolmachines launch decomposition

Split launch() into named per-step helpers, replace time.sleep(1.5) with a readiness poll, and file-lock loopback alias allocation. Addresses the three actionable items from the #117 hotspot review of smolmachines/launch.py.
2026-06-02 06:14:16 +00:00
parent 07c8593999
commit fe97b6014d
1 changed files with 221 additions and 0 deletions
@@ -0,0 +1,221 @@
+# PRD 0032: Decompose smolmachines launch and harden bringup sequencing
+
+- **Status:** Draft
+- **Author:** didericis-claude
+- **Created:** 2026-06-02
+- **Issue:** #122
+
+## Summary
+
+Split `launch()` into named per-step helpers, replace the empirical
+`time.sleep(1.5)` with a readiness poll, and file-lock loopback alias
+allocation. Addresses the three actionable issues from the #117 hotspot
+review of `smolmachines/launch.py`.
+
+## Problem
+
+### 1. `launch()` step ordering
+
+`launch()` in `smolmachines/launch.py` is 207 lines. Seven sequenced
+steps are marked by numbered inline comments (`# 1. Reserve a loopback
+alias`, `# 2. Mint per-bottle CAs`, ...) — the sequencing is
+load-bearing (CA paths must be filled before the bundle spec is built;
+the bundle must be running before port discovery; the VM must be created
+before the allowlist is patched), but the dependencies are enforced only
+by linear ordering within one function. Adding a new daemon, changing
+the port-forward strategy, or debugging a bringup failure requires
+reading the whole function to understand what state each step produces.
+Each step is also not individually testable without mocking the entire
+surrounding context.
+
+### 2. `time.sleep(1.5)` for libkrun exec-channel race
+
+After `machine_start`, back-to-back `machine_exec` calls occasionally
+hit a SIGKILL in libkrun's exec channel at ~100ms. The sleep is
+documented as "1.5s is empirically enough; provisioning already takes
+seconds so the wait is amortized." The failure mode if the sleep is
+insufficient: the filesystem-repair exec (`chown -R node:node /home/node`)
+is SIGKILLed silently, and the agent later bails with `ENOENT`/`EPERM`
+when Claude Code tries to write to `~/.claude.json`. A poll-until-ready
+loop is more robust than a fixed duration: it exits as soon as the exec
+channel is up, fails loudly with a timeout if the VM never becomes
+responsive, and is self-documenting about what it is waiting for.
+
+### 3. Loopback alias allocation is not concurrent-safe
+
+`loopback_alias.allocate()` reads docker container state to determine
+which aliases are already in use, then returns the lowest free alias.
+There is no lock between that read and the bundle's `docker run` (which
+creates the container that will appear in future `docker ps` output). Two
+simultaneous bottle launches can both see the same alias as free and
+claim it, causing both bundles to bind on the same loopback IP. On macOS,
+where users occasionally start multiple agents in quick succession, this
+is a realistic failure mode.
+
+## Non-goals
+
+- Removing `force_allowlist` / the `--allow-cidr` DB patch. That is a
+  workaround for a smolvm 0.8.0 bug; removal is a one-liner when smolvm
+  honors the CLI flag upstream.
+- Changing the ephemeral registry / crane detour in `local_registry.py`.
+  Required by Docker Desktop's network topology.
+- Changing `_ensure_smolmachine`'s cache design. Cache invalidation by
+  docker image ID works; issue #111 tracks a separate stale-sidecar
+  concern.
+
+## Design
+
+### 1. Decompose `launch()` into named helpers
+
+Extract six focused helpers. `launch()` becomes a coordinator that calls
+them in order, passing the `ExitStack` for teardown registration:
+
+```
+_allocate_resources(plan, stack) → (loopback_ip, network)
+```
+Reserve the loopback alias, create the docker bridge network, register
+teardown callbacks for both.
+
+```
+_mint_certs(plan) → plan
+```
+Pipelock TLS init (always). Egress TLS init when `plan.egress_plan.routes`
+is non-empty. Returns the plan with CA paths filled via
+`dataclasses.replace`.
+
+```
+_start_bundle(plan, network, loopback_ip, stack) → plan
+```
+Build the `BundleLaunchSpec`, resolve token env, start the bundle
+container, register teardown. Returns the plan with `bundle_spec` updated
+(or unchanged if no plan field carries it — callers consume `bundle_spec`
+directly from this call's return value if needed).
+
+```
+_discover_urls(plan, loopback_ip) → plan
+```
+Look up host-side ports for the published container ports; assemble
+`agent_proxy_url`, `agent_git_gate_host`, `agent_supervise_url`; stamp
+them onto the plan and into `guest_env`.
+
+```
+_launch_vm(plan, agent_from_path, stack) → None
+```
+`machine_create` + `force_allowlist` + `machine_start`. Register
+`machine_stop` and `machine_delete` teardown callbacks on the stack.
+
+```
+_init_vm(plan) → None
+```
+Filesystem-repair exec (`chown`/`chmod`) followed by
+`_wait_exec_ready()`.
+
+`launch()` reduces to:
+
+```python
+loopback_ip, network = _allocate_resources(plan, stack)
+plan = _mint_certs(plan)
+plan = _start_bundle(plan, network, loopback_ip, stack)
+plan = _discover_urls(plan, loopback_ip)
+agent_from_path = _ensure_smolmachine(plan.agent_image_ref,
+                                      dockerfile=plan.agent_dockerfile_path)
+_launch_vm(plan, agent_from_path, stack)
+_init_vm(plan)
+prompt_path = provision(plan, plan.machine_name)
+yield SmolmachinesBottle(...)
+```
+
+Each helper's inputs and outputs are explicit; each is independently
+testable with a minimal set of mocks.
+
+### 2. Replace `time.sleep(1.5)` with `_wait_exec_ready`
+
+Add to `smolvm.py`:
+
+```python
+def wait_exec_ready(name: str, *, timeout: float = 5.0) -> None:
+    """Poll until `machine exec true` exits 0 or `timeout` elapses.
+    Replaces a fixed sleep after machine_start for the libkrun
+    exec-channel warm-up race."""
+    deadline = time.monotonic() + timeout
+    delay = 0.1
+    while time.monotonic() < deadline:
+        r = machine_exec(name, ["true"])
+        if r.returncode == 0:
+            return
+        remaining = deadline - time.monotonic()
+        if remaining <= 0:
+            break
+        time.sleep(min(delay, remaining))
+        delay = min(delay * 2, 0.5)
+    die(
+        f"smolvm machine {name!r}: exec channel not ready after "
+        f"{timeout:.0f}s — VM may have failed to boot."
+    )
+```
+
+`_init_vm` calls `wait_exec_ready` after the chown/chmod exec instead of
+`time.sleep(1.5)`. The `time` import in `launch.py` is removed.
+
+### 3. File-lock loopback alias allocation
+
+Add to `loopback_alias.py`:
+
+```python
+import fcntl
+
+_ALLOC_LOCK_PATH = Path.home() / ".cache" / "bot-bottle" / "smolmachines.lock"
+
+def allocate(slug: str) -> str:
+    if not _is_macos():
+        return "127.0.0.1"
+    _ALLOC_LOCK_PATH.parent.mkdir(parents=True, exist_ok=True)
+    with open(_ALLOC_LOCK_PATH, "w") as lf:
+        fcntl.flock(lf, fcntl.LOCK_EX)
+        return _allocate_locked(slug)
+
+def _allocate_locked(slug: str) -> str:
+    in_use = _aliases_in_use()
+    for ip in _pool_addresses():
+        if ip not in in_use:
+            return ip
+    die(...)
+    return ""
+```
+
+The lock is held only for the duration of `_aliases_in_use()` + the
+`allocate` return. The bundle's `docker run` runs after the lock is
+released. This is sufficient: once `docker run` returns, the container
+is visible in docker state and future `allocate()` calls will see it.
+The remaining window (lock released → container appears in docker state)
+is narrowed from "the entire bringup sequence" to "a single subprocess
+call," making a collision between two concurrent launches effectively
+impossible in practice.
+
+The lock is a no-op on Linux (the `_is_macos()` early-return fires
+before the lock path is opened).
+
+## Test impact
+
+- Unit tests for each extracted helper can mock one subprocess boundary
+  at a time (smolvm, docker, pipelock TLS init) without wiring the full
+  `launch()` ExitStack.
+- `wait_exec_ready` needs a test with `machine_exec` stubbed to return
+  non-zero N times before 0 — verifies the backoff loop and the timeout
+  die path.
+- `allocate` tests are unchanged in shape; the lock is acquired and
+  released within the call so tests don't need to be aware of it.
+
+## Implementation chunks
+
+1. **PRD (this commit).** Sets the design.
+2. **Decompose `launch()`.**
+3. **Replace sleep with `wait_exec_ready`.**
+4. **File-lock `allocate()`.**
+5. **Tests.** Unit tests for each helper; `wait_exec_ready` backoff + timeout.
+
+## References
+
+- Issue #122: Decompose smolmachines launch and harden bringup sequencing.
+- Issue #117: Complexity hotspots — source of the smolmachines/launch.py finding.
+- Issue #111: Smolmachine sidecar doesn't reliably get refreshed (separate, not addressed here).