bot-bottle/docs/prds/0032-smolmachines-launch-decomposition.md

# PRD 0032: Decompose smolmachines launch and harden bringup sequencing

- **Status:** Active
- **Author:** didericis-claude
- **Created:** 2026-06-02
- **Issue:** #122

## Summary

Split `launch()` into named per-step helpers, replace the empirical
`time.sleep(1.5)` with a readiness poll, and file-lock loopback alias
allocation. Addresses the three actionable issues from the #117 hotspot
review of `smolmachines/launch.py`.

## Problem

### 1. `launch()` step ordering

`launch()` in `smolmachines/launch.py` is 207 lines. Seven sequenced
steps are marked by numbered inline comments (`# 1. Reserve a loopback
alias`, `# 2. Mint per-bottle CAs`, ...) — the sequencing is
load-bearing (CA paths must be filled before the bundle spec is built;
the bundle must be running before port discovery; the VM must be created
before the allowlist is patched), but the dependencies are enforced only
by linear ordering within one function. Adding a new daemon, changing
the port-forward strategy, or debugging a bringup failure requires
reading the whole function to understand what state each step produces.
Each step is also not individually testable without mocking the entire
surrounding context.

### 2. `time.sleep(1.5)` for libkrun exec-channel race

After `machine_start`, back-to-back `machine_exec` calls occasionally
hit a SIGKILL in libkrun's exec channel at ~100ms. The sleep is
documented as "1.5s is empirically enough; provisioning already takes
seconds so the wait is amortized." The failure mode if the sleep is
insufficient: the filesystem-repair exec (`chown -R node:node /home/node`)
is SIGKILLed silently, and the agent later bails with `ENOENT`/`EPERM`
when Claude Code tries to write to `~/.claude.json`. A poll-until-ready
loop is more robust than a fixed duration: it exits as soon as the exec
channel is up, fails loudly with a timeout if the VM never becomes
responsive, and is self-documenting about what it is waiting for.

### 3. Loopback alias allocation is not concurrent-safe

`loopback_alias.allocate()` reads docker container state to determine
which aliases are already in use, then returns the lowest free alias.
There is no lock between that read and the bundle's `docker run` (which
creates the container that will appear in future `docker ps` output). Two
simultaneous bottle launches can both see the same alias as free and
claim it, causing both bundles to bind on the same loopback IP. On macOS,
where users occasionally start multiple agents in quick succession, this
is a realistic failure mode.

## Non-goals

- Removing `force_allowlist` / the `--allow-cidr` DB patch. That is a
  workaround for a smolvm 0.8.0 bug; removal is a one-liner when smolvm
  honors the CLI flag upstream.
- Changing the ephemeral registry / crane detour in `local_registry.py`.
  Required by Docker Desktop's network topology.
- Changing `_ensure_smolmachine`'s cache design. Cache invalidation by
  docker image ID works; issue #111 tracks a separate stale-sidecar
  concern.

## Design

### 1. Decompose `launch()` into named helpers

Extract six focused helpers. `launch()` becomes a coordinator that calls
them in order, passing the `ExitStack` for teardown registration:

```
_allocate_resources(plan, stack) → (loopback_ip, network)
```
Reserve the loopback alias, create the docker bridge network, register
teardown callbacks for both.

```
_mint_certs(plan) → plan
```
Pipelock TLS init (always). Egress TLS init when `plan.egress_plan.routes`
is non-empty. Returns the plan with CA paths filled via
`dataclasses.replace`.

```
_start_bundle(plan, network, loopback_ip, stack) → plan
```
Build the `BundleLaunchSpec`, resolve token env, start the bundle
container, register teardown. Returns the plan with `bundle_spec` updated
(or unchanged if no plan field carries it — callers consume `bundle_spec`
directly from this call's return value if needed).

```
_discover_urls(plan, loopback_ip) → plan
```
Look up host-side ports for the published container ports; assemble
`agent_proxy_url`, `agent_git_gate_host`, `agent_supervise_url`; stamp
them onto the plan and into `guest_env`.

```
_launch_vm(plan, agent_from_path, stack) → None
```
`machine_create` + `force_allowlist` + `machine_start`. Register
`machine_stop` and `machine_delete` teardown callbacks on the stack.

```
_init_vm(plan) → None
```
Filesystem-repair exec (`chown`/`chmod`) followed by
`_wait_exec_ready()`.

`launch()` reduces to:

```python
loopback_ip, network = _allocate_resources(plan, stack)
plan = _mint_certs(plan)
plan = _start_bundle(plan, network, loopback_ip, stack)
plan = _discover_urls(plan, loopback_ip)
agent_from_path = _ensure_smolmachine(plan.agent_image_ref,
                                      dockerfile=plan.agent_dockerfile_path)
_launch_vm(plan, agent_from_path, stack)
_init_vm(plan)
prompt_path = provision(plan, plan.machine_name)
yield SmolmachinesBottle(...)
```

Each helper's inputs and outputs are explicit; each is independently
testable with a minimal set of mocks.

### 2. Replace `time.sleep(1.5)` with `_wait_exec_ready`

Add to `smolvm.py`:

```python
def wait_exec_ready(name: str, *, timeout: float = 5.0) -> None:
    """Poll until `machine exec true` exits 0 or `timeout` elapses.
    Replaces a fixed sleep after machine_start for the libkrun
    exec-channel warm-up race."""
    deadline = time.monotonic() + timeout
    delay = 0.1
    while time.monotonic() < deadline:
        r = machine_exec(name, ["true"])
        if r.returncode == 0:
            return
        remaining = deadline - time.monotonic()
        if remaining <= 0:
            break
        time.sleep(min(delay, remaining))
        delay = min(delay * 2, 0.5)
    die(
        f"smolvm machine {name!r}: exec channel not ready after "
        f"{timeout:.0f}s — VM may have failed to boot."
    )
```

`_init_vm` calls `wait_exec_ready` after the chown/chmod exec instead of
`time.sleep(1.5)`. The `time` import in `launch.py` is removed.

### 3. File-lock loopback alias allocation

Add to `loopback_alias.py`:

```python
import fcntl

_ALLOC_LOCK_PATH = Path.home() / ".cache" / "bot-bottle" / "smolmachines.lock"

def allocate(slug: str) -> str:
    if not _is_macos():
        return "127.0.0.1"
    _ALLOC_LOCK_PATH.parent.mkdir(parents=True, exist_ok=True)
    with open(_ALLOC_LOCK_PATH, "w") as lf:
        fcntl.flock(lf, fcntl.LOCK_EX)
        return _allocate_locked(slug)

def _allocate_locked(slug: str) -> str:
    in_use = _aliases_in_use()
    for ip in _pool_addresses():
        if ip not in in_use:
            return ip
    die(...)
    return ""
```

The lock is held only for the duration of `_aliases_in_use()` + the
`allocate` return. The bundle's `docker run` runs after the lock is
released. This is sufficient: once `docker run` returns, the container
is visible in docker state and future `allocate()` calls will see it.
The remaining window (lock released → container appears in docker state)
is narrowed from "the entire bringup sequence" to "a single subprocess
call," making a collision between two concurrent launches effectively
impossible in practice.

The lock is a no-op on Linux (the `_is_macos()` early-return fires
before the lock path is opened).

## Test impact

- Unit tests for each extracted helper can mock one subprocess boundary
  at a time (smolvm, docker, pipelock TLS init) without wiring the full
  `launch()` ExitStack.
- `wait_exec_ready` needs a test with `machine_exec` stubbed to return
  non-zero N times before 0 — verifies the backoff loop and the timeout
  die path.
- `allocate` tests are unchanged in shape; the lock is acquired and
  released within the call so tests don't need to be aware of it.

## Implementation chunks

1. **PRD (this commit).** Sets the design.
2. **Decompose `launch()`.**
3. **Replace sleep with `wait_exec_ready`.**
4. **File-lock `allocate()`.**
5. **Tests.** Unit tests for each helper; `wait_exec_ready` backoff + timeout.

## References

- Issue #122: Decompose smolmachines launch and harden bringup sequencing.
- Issue #117: Complexity hotspots — source of the smolmachines/launch.py finding.
- Issue #111: Smolmachine sidecar doesn't reliably get refreshed (separate, not addressed here).