docs(prd): PRD 0032 — smolmachines launch decomposition
Split launch() into named per-step helpers, replace time.sleep(1.5) with a readiness poll, and file-lock loopback alias allocation. Addresses the three actionable items from the #117 hotspot review of smolmachines/launch.py.
This commit is contained in:
@@ -0,0 +1,221 @@
|
||||
# PRD 0032: Decompose smolmachines launch and harden bringup sequencing
|
||||
|
||||
- **Status:** Draft
|
||||
- **Author:** didericis-claude
|
||||
- **Created:** 2026-06-02
|
||||
- **Issue:** #122
|
||||
|
||||
## Summary
|
||||
|
||||
Split `launch()` into named per-step helpers, replace the empirical
|
||||
`time.sleep(1.5)` with a readiness poll, and file-lock loopback alias
|
||||
allocation. Addresses the three actionable issues from the #117 hotspot
|
||||
review of `smolmachines/launch.py`.
|
||||
|
||||
## Problem
|
||||
|
||||
### 1. `launch()` step ordering
|
||||
|
||||
`launch()` in `smolmachines/launch.py` is 207 lines. Seven sequenced
|
||||
steps are marked by numbered inline comments (`# 1. Reserve a loopback
|
||||
alias`, `# 2. Mint per-bottle CAs`, ...) — the sequencing is
|
||||
load-bearing (CA paths must be filled before the bundle spec is built;
|
||||
the bundle must be running before port discovery; the VM must be created
|
||||
before the allowlist is patched), but the dependencies are enforced only
|
||||
by linear ordering within one function. Adding a new daemon, changing
|
||||
the port-forward strategy, or debugging a bringup failure requires
|
||||
reading the whole function to understand what state each step produces.
|
||||
Each step is also not individually testable without mocking the entire
|
||||
surrounding context.
|
||||
|
||||
### 2. `time.sleep(1.5)` for libkrun exec-channel race
|
||||
|
||||
After `machine_start`, back-to-back `machine_exec` calls occasionally
|
||||
hit a SIGKILL in libkrun's exec channel at ~100ms. The sleep is
|
||||
documented as "1.5s is empirically enough; provisioning already takes
|
||||
seconds so the wait is amortized." The failure mode if the sleep is
|
||||
insufficient: the filesystem-repair exec (`chown -R node:node /home/node`)
|
||||
is SIGKILLed silently, and the agent later bails with `ENOENT`/`EPERM`
|
||||
when Claude Code tries to write to `~/.claude.json`. A poll-until-ready
|
||||
loop is more robust than a fixed duration: it exits as soon as the exec
|
||||
channel is up, fails loudly with a timeout if the VM never becomes
|
||||
responsive, and is self-documenting about what it is waiting for.
|
||||
|
||||
### 3. Loopback alias allocation is not concurrent-safe
|
||||
|
||||
`loopback_alias.allocate()` reads docker container state to determine
|
||||
which aliases are already in use, then returns the lowest free alias.
|
||||
There is no lock between that read and the bundle's `docker run` (which
|
||||
creates the container that will appear in future `docker ps` output). Two
|
||||
simultaneous bottle launches can both see the same alias as free and
|
||||
claim it, causing both bundles to bind on the same loopback IP. On macOS,
|
||||
where users occasionally start multiple agents in quick succession, this
|
||||
is a realistic failure mode.
|
||||
|
||||
## Non-goals
|
||||
|
||||
- Removing `force_allowlist` / the `--allow-cidr` DB patch. That is a
|
||||
workaround for a smolvm 0.8.0 bug; removal is a one-liner when smolvm
|
||||
honors the CLI flag upstream.
|
||||
- Changing the ephemeral registry / crane detour in `local_registry.py`.
|
||||
Required by Docker Desktop's network topology.
|
||||
- Changing `_ensure_smolmachine`'s cache design. Cache invalidation by
|
||||
docker image ID works; issue #111 tracks a separate stale-sidecar
|
||||
concern.
|
||||
|
||||
## Design
|
||||
|
||||
### 1. Decompose `launch()` into named helpers
|
||||
|
||||
Extract six focused helpers. `launch()` becomes a coordinator that calls
|
||||
them in order, passing the `ExitStack` for teardown registration:
|
||||
|
||||
```
|
||||
_allocate_resources(plan, stack) → (loopback_ip, network)
|
||||
```
|
||||
Reserve the loopback alias, create the docker bridge network, register
|
||||
teardown callbacks for both.
|
||||
|
||||
```
|
||||
_mint_certs(plan) → plan
|
||||
```
|
||||
Pipelock TLS init (always). Egress TLS init when `plan.egress_plan.routes`
|
||||
is non-empty. Returns the plan with CA paths filled via
|
||||
`dataclasses.replace`.
|
||||
|
||||
```
|
||||
_start_bundle(plan, network, loopback_ip, stack) → plan
|
||||
```
|
||||
Build the `BundleLaunchSpec`, resolve token env, start the bundle
|
||||
container, register teardown. Returns the plan with `bundle_spec` updated
|
||||
(or unchanged if no plan field carries it — callers consume `bundle_spec`
|
||||
directly from this call's return value if needed).
|
||||
|
||||
```
|
||||
_discover_urls(plan, loopback_ip) → plan
|
||||
```
|
||||
Look up host-side ports for the published container ports; assemble
|
||||
`agent_proxy_url`, `agent_git_gate_host`, `agent_supervise_url`; stamp
|
||||
them onto the plan and into `guest_env`.
|
||||
|
||||
```
|
||||
_launch_vm(plan, agent_from_path, stack) → None
|
||||
```
|
||||
`machine_create` + `force_allowlist` + `machine_start`. Register
|
||||
`machine_stop` and `machine_delete` teardown callbacks on the stack.
|
||||
|
||||
```
|
||||
_init_vm(plan) → None
|
||||
```
|
||||
Filesystem-repair exec (`chown`/`chmod`) followed by
|
||||
`_wait_exec_ready()`.
|
||||
|
||||
`launch()` reduces to:
|
||||
|
||||
```python
|
||||
loopback_ip, network = _allocate_resources(plan, stack)
|
||||
plan = _mint_certs(plan)
|
||||
plan = _start_bundle(plan, network, loopback_ip, stack)
|
||||
plan = _discover_urls(plan, loopback_ip)
|
||||
agent_from_path = _ensure_smolmachine(plan.agent_image_ref,
|
||||
dockerfile=plan.agent_dockerfile_path)
|
||||
_launch_vm(plan, agent_from_path, stack)
|
||||
_init_vm(plan)
|
||||
prompt_path = provision(plan, plan.machine_name)
|
||||
yield SmolmachinesBottle(...)
|
||||
```
|
||||
|
||||
Each helper's inputs and outputs are explicit; each is independently
|
||||
testable with a minimal set of mocks.
|
||||
|
||||
### 2. Replace `time.sleep(1.5)` with `_wait_exec_ready`
|
||||
|
||||
Add to `smolvm.py`:
|
||||
|
||||
```python
|
||||
def wait_exec_ready(name: str, *, timeout: float = 5.0) -> None:
|
||||
"""Poll until `machine exec true` exits 0 or `timeout` elapses.
|
||||
Replaces a fixed sleep after machine_start for the libkrun
|
||||
exec-channel warm-up race."""
|
||||
deadline = time.monotonic() + timeout
|
||||
delay = 0.1
|
||||
while time.monotonic() < deadline:
|
||||
r = machine_exec(name, ["true"])
|
||||
if r.returncode == 0:
|
||||
return
|
||||
remaining = deadline - time.monotonic()
|
||||
if remaining <= 0:
|
||||
break
|
||||
time.sleep(min(delay, remaining))
|
||||
delay = min(delay * 2, 0.5)
|
||||
die(
|
||||
f"smolvm machine {name!r}: exec channel not ready after "
|
||||
f"{timeout:.0f}s — VM may have failed to boot."
|
||||
)
|
||||
```
|
||||
|
||||
`_init_vm` calls `wait_exec_ready` after the chown/chmod exec instead of
|
||||
`time.sleep(1.5)`. The `time` import in `launch.py` is removed.
|
||||
|
||||
### 3. File-lock loopback alias allocation
|
||||
|
||||
Add to `loopback_alias.py`:
|
||||
|
||||
```python
|
||||
import fcntl
|
||||
|
||||
_ALLOC_LOCK_PATH = Path.home() / ".cache" / "bot-bottle" / "smolmachines.lock"
|
||||
|
||||
def allocate(slug: str) -> str:
|
||||
if not _is_macos():
|
||||
return "127.0.0.1"
|
||||
_ALLOC_LOCK_PATH.parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(_ALLOC_LOCK_PATH, "w") as lf:
|
||||
fcntl.flock(lf, fcntl.LOCK_EX)
|
||||
return _allocate_locked(slug)
|
||||
|
||||
def _allocate_locked(slug: str) -> str:
|
||||
in_use = _aliases_in_use()
|
||||
for ip in _pool_addresses():
|
||||
if ip not in in_use:
|
||||
return ip
|
||||
die(...)
|
||||
return ""
|
||||
```
|
||||
|
||||
The lock is held only for the duration of `_aliases_in_use()` + the
|
||||
`allocate` return. The bundle's `docker run` runs after the lock is
|
||||
released. This is sufficient: once `docker run` returns, the container
|
||||
is visible in docker state and future `allocate()` calls will see it.
|
||||
The remaining window (lock released → container appears in docker state)
|
||||
is narrowed from "the entire bringup sequence" to "a single subprocess
|
||||
call," making a collision between two concurrent launches effectively
|
||||
impossible in practice.
|
||||
|
||||
The lock is a no-op on Linux (the `_is_macos()` early-return fires
|
||||
before the lock path is opened).
|
||||
|
||||
## Test impact
|
||||
|
||||
- Unit tests for each extracted helper can mock one subprocess boundary
|
||||
at a time (smolvm, docker, pipelock TLS init) without wiring the full
|
||||
`launch()` ExitStack.
|
||||
- `wait_exec_ready` needs a test with `machine_exec` stubbed to return
|
||||
non-zero N times before 0 — verifies the backoff loop and the timeout
|
||||
die path.
|
||||
- `allocate` tests are unchanged in shape; the lock is acquired and
|
||||
released within the call so tests don't need to be aware of it.
|
||||
|
||||
## Implementation chunks
|
||||
|
||||
1. **PRD (this commit).** Sets the design.
|
||||
2. **Decompose `launch()`.**
|
||||
3. **Replace sleep with `wait_exec_ready`.**
|
||||
4. **File-lock `allocate()`.**
|
||||
5. **Tests.** Unit tests for each helper; `wait_exec_ready` backoff + timeout.
|
||||
|
||||
## References
|
||||
|
||||
- Issue #122: Decompose smolmachines launch and harden bringup sequencing.
|
||||
- Issue #117: Complexity hotspots — source of the smolmachines/launch.py finding.
|
||||
- Issue #111: Smolmachine sidecar doesn't reliably get refreshed (separate, not addressed here).
|
||||
Reference in New Issue
Block a user