Files
bot-bottle/docs/prds/0032-smolmachines-launch-decomposition.md
T
didericis-claude c39bbe265b
test / unit (pull_request) Successful in 39s
test / integration (pull_request) Successful in 58s
complete(prd): mark PRD 0032 active
All three issues implemented and 805 tests passing.
2026-06-02 06:23:46 +00:00

8.0 KiB

PRD 0032: Decompose smolmachines launch and harden bringup sequencing

  • Status: Active
  • Author: didericis-claude
  • Created: 2026-06-02
  • Issue: #122

Summary

Split launch() into named per-step helpers, replace the empirical time.sleep(1.5) with a readiness poll, and file-lock loopback alias allocation. Addresses the three actionable issues from the #117 hotspot review of smolmachines/launch.py.

Problem

1. launch() step ordering

launch() in smolmachines/launch.py is 207 lines. Seven sequenced steps are marked by numbered inline comments (# 1. Reserve a loopback alias, # 2. Mint per-bottle CAs, ...) — the sequencing is load-bearing (CA paths must be filled before the bundle spec is built; the bundle must be running before port discovery; the VM must be created before the allowlist is patched), but the dependencies are enforced only by linear ordering within one function. Adding a new daemon, changing the port-forward strategy, or debugging a bringup failure requires reading the whole function to understand what state each step produces. Each step is also not individually testable without mocking the entire surrounding context.

2. time.sleep(1.5) for libkrun exec-channel race

After machine_start, back-to-back machine_exec calls occasionally hit a SIGKILL in libkrun's exec channel at ~100ms. The sleep is documented as "1.5s is empirically enough; provisioning already takes seconds so the wait is amortized." The failure mode if the sleep is insufficient: the filesystem-repair exec (chown -R node:node /home/node) is SIGKILLed silently, and the agent later bails with ENOENT/EPERM when Claude Code tries to write to ~/.claude.json. A poll-until-ready loop is more robust than a fixed duration: it exits as soon as the exec channel is up, fails loudly with a timeout if the VM never becomes responsive, and is self-documenting about what it is waiting for.

3. Loopback alias allocation is not concurrent-safe

loopback_alias.allocate() reads docker container state to determine which aliases are already in use, then returns the lowest free alias. There is no lock between that read and the bundle's docker run (which creates the container that will appear in future docker ps output). Two simultaneous bottle launches can both see the same alias as free and claim it, causing both bundles to bind on the same loopback IP. On macOS, where users occasionally start multiple agents in quick succession, this is a realistic failure mode.

Non-goals

  • Removing force_allowlist / the --allow-cidr DB patch. That is a workaround for a smolvm 0.8.0 bug; removal is a one-liner when smolvm honors the CLI flag upstream.
  • Changing the ephemeral registry / crane detour in local_registry.py. Required by Docker Desktop's network topology.
  • Changing _ensure_smolmachine's cache design. Cache invalidation by docker image ID works; issue #111 tracks a separate stale-sidecar concern.

Design

1. Decompose launch() into named helpers

Extract six focused helpers. launch() becomes a coordinator that calls them in order, passing the ExitStack for teardown registration:

_allocate_resources(plan, stack) → (loopback_ip, network)

Reserve the loopback alias, create the docker bridge network, register teardown callbacks for both.

_mint_certs(plan) → plan

Pipelock TLS init (always). Egress TLS init when plan.egress_plan.routes is non-empty. Returns the plan with CA paths filled via dataclasses.replace.

_start_bundle(plan, network, loopback_ip, stack) → plan

Build the BundleLaunchSpec, resolve token env, start the bundle container, register teardown. Returns the plan with bundle_spec updated (or unchanged if no plan field carries it — callers consume bundle_spec directly from this call's return value if needed).

_discover_urls(plan, loopback_ip) → plan

Look up host-side ports for the published container ports; assemble agent_proxy_url, agent_git_gate_host, agent_supervise_url; stamp them onto the plan and into guest_env.

_launch_vm(plan, agent_from_path, stack) → None

machine_create + force_allowlist + machine_start. Register machine_stop and machine_delete teardown callbacks on the stack.

_init_vm(plan) → None

Filesystem-repair exec (chown/chmod) followed by _wait_exec_ready().

launch() reduces to:

loopback_ip, network = _allocate_resources(plan, stack)
plan = _mint_certs(plan)
plan = _start_bundle(plan, network, loopback_ip, stack)
plan = _discover_urls(plan, loopback_ip)
agent_from_path = _ensure_smolmachine(plan.agent_image_ref,
                                      dockerfile=plan.agent_dockerfile_path)
_launch_vm(plan, agent_from_path, stack)
_init_vm(plan)
prompt_path = provision(plan, plan.machine_name)
yield SmolmachinesBottle(...)

Each helper's inputs and outputs are explicit; each is independently testable with a minimal set of mocks.

2. Replace time.sleep(1.5) with _wait_exec_ready

Add to smolvm.py:

def wait_exec_ready(name: str, *, timeout: float = 5.0) -> None:
    """Poll until `machine exec true` exits 0 or `timeout` elapses.
    Replaces a fixed sleep after machine_start for the libkrun
    exec-channel warm-up race."""
    deadline = time.monotonic() + timeout
    delay = 0.1
    while time.monotonic() < deadline:
        r = machine_exec(name, ["true"])
        if r.returncode == 0:
            return
        remaining = deadline - time.monotonic()
        if remaining <= 0:
            break
        time.sleep(min(delay, remaining))
        delay = min(delay * 2, 0.5)
    die(
        f"smolvm machine {name!r}: exec channel not ready after "
        f"{timeout:.0f}s — VM may have failed to boot."
    )

_init_vm calls wait_exec_ready after the chown/chmod exec instead of time.sleep(1.5). The time import in launch.py is removed.

3. File-lock loopback alias allocation

Add to loopback_alias.py:

import fcntl

_ALLOC_LOCK_PATH = Path.home() / ".cache" / "bot-bottle" / "smolmachines.lock"

def allocate(slug: str) -> str:
    if not _is_macos():
        return "127.0.0.1"
    _ALLOC_LOCK_PATH.parent.mkdir(parents=True, exist_ok=True)
    with open(_ALLOC_LOCK_PATH, "w") as lf:
        fcntl.flock(lf, fcntl.LOCK_EX)
        return _allocate_locked(slug)

def _allocate_locked(slug: str) -> str:
    in_use = _aliases_in_use()
    for ip in _pool_addresses():
        if ip not in in_use:
            return ip
    die(...)
    return ""

The lock is held only for the duration of _aliases_in_use() + the allocate return. The bundle's docker run runs after the lock is released. This is sufficient: once docker run returns, the container is visible in docker state and future allocate() calls will see it. The remaining window (lock released → container appears in docker state) is narrowed from "the entire bringup sequence" to "a single subprocess call," making a collision between two concurrent launches effectively impossible in practice.

The lock is a no-op on Linux (the _is_macos() early-return fires before the lock path is opened).

Test impact

  • Unit tests for each extracted helper can mock one subprocess boundary at a time (smolvm, docker, pipelock TLS init) without wiring the full launch() ExitStack.
  • wait_exec_ready needs a test with machine_exec stubbed to return non-zero N times before 0 — verifies the backoff loop and the timeout die path.
  • allocate tests are unchanged in shape; the lock is acquired and released within the call so tests don't need to be aware of it.

Implementation chunks

  1. PRD (this commit). Sets the design.
  2. Decompose launch().
  3. Replace sleep with wait_exec_ready.
  4. File-lock allocate().
  5. Tests. Unit tests for each helper; wait_exec_ready backoff + timeout.

References

  • Issue #122: Decompose smolmachines launch and harden bringup sequencing.
  • Issue #117: Complexity hotspots — source of the smolmachines/launch.py finding.
  • Issue #111: Smolmachine sidecar doesn't reliably get refreshed (separate, not addressed here).