222 lines
8.0 KiB
Markdown
222 lines
8.0 KiB
Markdown
# PRD 0032: Decompose smolmachines launch and harden bringup sequencing
|
|
|
|
- **Status:** Active
|
|
- **Author:** didericis-claude
|
|
- **Created:** 2026-06-02
|
|
- **Issue:** #122
|
|
|
|
## Summary
|
|
|
|
Split `launch()` into named per-step helpers, replace the empirical
|
|
`time.sleep(1.5)` with a readiness poll, and file-lock loopback alias
|
|
allocation. Addresses the three actionable issues from the #117 hotspot
|
|
review of `smolmachines/launch.py`.
|
|
|
|
## Problem
|
|
|
|
### 1. `launch()` step ordering
|
|
|
|
`launch()` in `smolmachines/launch.py` is 207 lines. Seven sequenced
|
|
steps are marked by numbered inline comments (`# 1. Reserve a loopback
|
|
alias`, `# 2. Mint per-bottle CAs`, ...) — the sequencing is
|
|
load-bearing (CA paths must be filled before the bundle spec is built;
|
|
the bundle must be running before port discovery; the VM must be created
|
|
before the allowlist is patched), but the dependencies are enforced only
|
|
by linear ordering within one function. Adding a new daemon, changing
|
|
the port-forward strategy, or debugging a bringup failure requires
|
|
reading the whole function to understand what state each step produces.
|
|
Each step is also not individually testable without mocking the entire
|
|
surrounding context.
|
|
|
|
### 2. `time.sleep(1.5)` for libkrun exec-channel race
|
|
|
|
After `machine_start`, back-to-back `machine_exec` calls occasionally
|
|
hit a SIGKILL in libkrun's exec channel at ~100ms. The sleep is
|
|
documented as "1.5s is empirically enough; provisioning already takes
|
|
seconds so the wait is amortized." The failure mode if the sleep is
|
|
insufficient: the filesystem-repair exec (`chown -R node:node /home/node`)
|
|
is SIGKILLed silently, and the agent later bails with `ENOENT`/`EPERM`
|
|
when Claude Code tries to write to `~/.claude.json`. A poll-until-ready
|
|
loop is more robust than a fixed duration: it exits as soon as the exec
|
|
channel is up, fails loudly with a timeout if the VM never becomes
|
|
responsive, and is self-documenting about what it is waiting for.
|
|
|
|
### 3. Loopback alias allocation is not concurrent-safe
|
|
|
|
`loopback_alias.allocate()` reads docker container state to determine
|
|
which aliases are already in use, then returns the lowest free alias.
|
|
There is no lock between that read and the bundle's `docker run` (which
|
|
creates the container that will appear in future `docker ps` output). Two
|
|
simultaneous bottle launches can both see the same alias as free and
|
|
claim it, causing both bundles to bind on the same loopback IP. On macOS,
|
|
where users occasionally start multiple agents in quick succession, this
|
|
is a realistic failure mode.
|
|
|
|
## Non-goals
|
|
|
|
- Removing `force_allowlist` / the `--allow-cidr` DB patch. That is a
|
|
workaround for a smolvm 0.8.0 bug; removal is a one-liner when smolvm
|
|
honors the CLI flag upstream.
|
|
- Changing the ephemeral registry / crane detour in `local_registry.py`.
|
|
Required by Docker Desktop's network topology.
|
|
- Changing `_ensure_smolmachine`'s cache design. Cache invalidation by
|
|
docker image ID works; issue #111 tracks a separate stale-sidecar
|
|
concern.
|
|
|
|
## Design
|
|
|
|
### 1. Decompose `launch()` into named helpers
|
|
|
|
Extract six focused helpers. `launch()` becomes a coordinator that calls
|
|
them in order, passing the `ExitStack` for teardown registration:
|
|
|
|
```
|
|
_allocate_resources(plan, stack) → (loopback_ip, network)
|
|
```
|
|
Reserve the loopback alias, create the docker bridge network, register
|
|
teardown callbacks for both.
|
|
|
|
```
|
|
_mint_certs(plan) → plan
|
|
```
|
|
Pipelock TLS init (always). Egress TLS init when `plan.egress_plan.routes`
|
|
is non-empty. Returns the plan with CA paths filled via
|
|
`dataclasses.replace`.
|
|
|
|
```
|
|
_start_bundle(plan, network, loopback_ip, stack) → plan
|
|
```
|
|
Build the `BundleLaunchSpec`, resolve token env, start the bundle
|
|
container, register teardown. Returns the plan with `bundle_spec` updated
|
|
(or unchanged if no plan field carries it — callers consume `bundle_spec`
|
|
directly from this call's return value if needed).
|
|
|
|
```
|
|
_discover_urls(plan, loopback_ip) → plan
|
|
```
|
|
Look up host-side ports for the published container ports; assemble
|
|
`agent_proxy_url`, `agent_git_gate_host`, `agent_supervise_url`; stamp
|
|
them onto the plan and into `guest_env`.
|
|
|
|
```
|
|
_launch_vm(plan, agent_from_path, stack) → None
|
|
```
|
|
`machine_create` + `force_allowlist` + `machine_start`. Register
|
|
`machine_stop` and `machine_delete` teardown callbacks on the stack.
|
|
|
|
```
|
|
_init_vm(plan) → None
|
|
```
|
|
Filesystem-repair exec (`chown`/`chmod`) followed by
|
|
`_wait_exec_ready()`.
|
|
|
|
`launch()` reduces to:
|
|
|
|
```python
|
|
loopback_ip, network = _allocate_resources(plan, stack)
|
|
plan = _mint_certs(plan)
|
|
plan = _start_bundle(plan, network, loopback_ip, stack)
|
|
plan = _discover_urls(plan, loopback_ip)
|
|
agent_from_path = _ensure_smolmachine(plan.agent_image_ref,
|
|
dockerfile=plan.agent_dockerfile_path)
|
|
_launch_vm(plan, agent_from_path, stack)
|
|
_init_vm(plan)
|
|
prompt_path = provision(plan, plan.machine_name)
|
|
yield SmolmachinesBottle(...)
|
|
```
|
|
|
|
Each helper's inputs and outputs are explicit; each is independently
|
|
testable with a minimal set of mocks.
|
|
|
|
### 2. Replace `time.sleep(1.5)` with `_wait_exec_ready`
|
|
|
|
Add to `smolvm.py`:
|
|
|
|
```python
|
|
def wait_exec_ready(name: str, *, timeout: float = 5.0) -> None:
|
|
"""Poll until `machine exec true` exits 0 or `timeout` elapses.
|
|
Replaces a fixed sleep after machine_start for the libkrun
|
|
exec-channel warm-up race."""
|
|
deadline = time.monotonic() + timeout
|
|
delay = 0.1
|
|
while time.monotonic() < deadline:
|
|
r = machine_exec(name, ["true"])
|
|
if r.returncode == 0:
|
|
return
|
|
remaining = deadline - time.monotonic()
|
|
if remaining <= 0:
|
|
break
|
|
time.sleep(min(delay, remaining))
|
|
delay = min(delay * 2, 0.5)
|
|
die(
|
|
f"smolvm machine {name!r}: exec channel not ready after "
|
|
f"{timeout:.0f}s — VM may have failed to boot."
|
|
)
|
|
```
|
|
|
|
`_init_vm` calls `wait_exec_ready` after the chown/chmod exec instead of
|
|
`time.sleep(1.5)`. The `time` import in `launch.py` is removed.
|
|
|
|
### 3. File-lock loopback alias allocation
|
|
|
|
Add to `loopback_alias.py`:
|
|
|
|
```python
|
|
import fcntl
|
|
|
|
_ALLOC_LOCK_PATH = Path.home() / ".cache" / "bot-bottle" / "smolmachines.lock"
|
|
|
|
def allocate(slug: str) -> str:
|
|
if not _is_macos():
|
|
return "127.0.0.1"
|
|
_ALLOC_LOCK_PATH.parent.mkdir(parents=True, exist_ok=True)
|
|
with open(_ALLOC_LOCK_PATH, "w") as lf:
|
|
fcntl.flock(lf, fcntl.LOCK_EX)
|
|
return _allocate_locked(slug)
|
|
|
|
def _allocate_locked(slug: str) -> str:
|
|
in_use = _aliases_in_use()
|
|
for ip in _pool_addresses():
|
|
if ip not in in_use:
|
|
return ip
|
|
die(...)
|
|
return ""
|
|
```
|
|
|
|
The lock is held only for the duration of `_aliases_in_use()` + the
|
|
`allocate` return. The bundle's `docker run` runs after the lock is
|
|
released. This is sufficient: once `docker run` returns, the container
|
|
is visible in docker state and future `allocate()` calls will see it.
|
|
The remaining window (lock released → container appears in docker state)
|
|
is narrowed from "the entire bringup sequence" to "a single subprocess
|
|
call," making a collision between two concurrent launches effectively
|
|
impossible in practice.
|
|
|
|
The lock is a no-op on Linux (the `_is_macos()` early-return fires
|
|
before the lock path is opened).
|
|
|
|
## Test impact
|
|
|
|
- Unit tests for each extracted helper can mock one subprocess boundary
|
|
at a time (smolvm, docker, pipelock TLS init) without wiring the full
|
|
`launch()` ExitStack.
|
|
- `wait_exec_ready` needs a test with `machine_exec` stubbed to return
|
|
non-zero N times before 0 — verifies the backoff loop and the timeout
|
|
die path.
|
|
- `allocate` tests are unchanged in shape; the lock is acquired and
|
|
released within the call so tests don't need to be aware of it.
|
|
|
|
## Implementation chunks
|
|
|
|
1. **PRD (this commit).** Sets the design.
|
|
2. **Decompose `launch()`.**
|
|
3. **Replace sleep with `wait_exec_ready`.**
|
|
4. **File-lock `allocate()`.**
|
|
5. **Tests.** Unit tests for each helper; `wait_exec_ready` backoff + timeout.
|
|
|
|
## References
|
|
|
|
- Issue #122: Decompose smolmachines launch and harden bringup sequencing.
|
|
- Issue #117: Complexity hotspots — source of the smolmachines/launch.py finding.
|
|
- Issue #111: Smolmachine sidecar doesn't reliably get refreshed (separate, not addressed here).
|