PRD 0032: Decompose smolmachines launch and harden bringup sequencing #123

Merged
didericis merged 4 commits from prd-0032 into main 2026-06-02 02:31:37 -04:00
Collaborator

Closes #122.

PRD: fe97b6014d/docs/prds/0032-smolmachines-launch-decomposition.md

Summary

  • Decompose the 207-line launch() function into six named per-step helpers with explicit inputs/outputs and individual testability.
  • Replace time.sleep(1.5) (empirical libkrun exec-channel warm-up) with a wait_exec_ready poll-until-ready loop in smolvm.py.
  • File-lock loopback_alias.allocate() to close the concurrent-launch alias-collision window.

Explicitly out of scope: force_allowlist DB patch (waiting on smolvm upstream fix), ephemeral registry dance, _ensure_smolmachine cache.

Changes (1 commit)

  • docs(prd): PRD 0032 — smolmachines launch decomposition
Closes #122. PRD: https://gitea.dideric.is/didericis/bot-bottle/src/commit/fe97b6014d44fc55cef107ed2ccd8b62c6bafd59/docs/prds/0032-smolmachines-launch-decomposition.md ## Summary - Decompose the 207-line `launch()` function into six named per-step helpers with explicit inputs/outputs and individual testability. - Replace `time.sleep(1.5)` (empirical libkrun exec-channel warm-up) with a `wait_exec_ready` poll-until-ready loop in `smolvm.py`. - File-lock `loopback_alias.allocate()` to close the concurrent-launch alias-collision window. Explicitly out of scope: `force_allowlist` DB patch (waiting on smolvm upstream fix), ephemeral registry dance, `_ensure_smolmachine` cache. ## Changes (1 commit) - `docs(prd): PRD 0032 — smolmachines launch decomposition`
didericis-claude added 1 commit 2026-06-02 02:14:29 -04:00
docs(prd): PRD 0032 — smolmachines launch decomposition
test / unit (pull_request) Successful in 33s
test / integration (pull_request) Successful in 44s
fe97b6014d
Split launch() into named per-step helpers, replace time.sleep(1.5) with
a readiness poll, and file-lock loopback alias allocation. Addresses the
three actionable items from the #117 hotspot review of smolmachines/launch.py.
didericis added 2 commits 2026-06-02 02:23:54 -04:00
Decompose the 207-line launch() into six named helpers: _allocate_resources,
_mint_certs, _start_bundle, _discover_urls, _launch_vm, _init_vm. Each has
explicit inputs/outputs and is independently testable.

Replace time.sleep(1.5) with smolvm.wait_exec_ready(), which polls
`machine exec true` with exponential backoff. Exits as soon as the exec
channel is ready; dies loudly with a timeout message instead of silently
leaving the VM in an unknown state.

File-lock loopback_alias.allocate() with fcntl.flock(LOCK_EX) so concurrent
bottle launches can't race on docker state and claim the same alias.
complete(prd): mark PRD 0032 active
test / unit (pull_request) Successful in 39s
test / integration (pull_request) Successful in 58s
c39bbe265b
All three issues implemented and 805 tests passing.
didericis reviewed 2026-06-02 02:27:04 -04:00
@@ -200,0 +218,4 @@
break
time.sleep(min(delay, remaining))
delay = min(delay * 2, 0.5)
die(
Owner

Will this crash the dashboard?

If so, this should be some kind of error we raise instead, and some other handler should decide whether or not it dies.

Will this crash the dashboard? If so, this should be some kind of error we raise instead, and some other handler should decide whether or not it dies.
Author
Collaborator

RE #123 (comment)

die() raises Die(SystemExit), so the dashboard won't hard-crash — Die is specifically designed to be caught and re-surfaced by the curses layer. But you're right that it's the wrong signal here: a timeout deep in bringup should be a plain exception so the caller decides whether it's fatal, rather than implying a process exit from inside a helper.

Will change to raise SmolvmError instead, which is what machine_start already raises and what the launch flow propagates normally.

RE https://gitea.dideric.is/didericis/bot-bottle/pulls/123#issuecomment-1035 `die()` raises `Die(SystemExit)`, so the dashboard won't hard-crash — `Die` is specifically designed to be caught and re-surfaced by the curses layer. But you're right that it's the wrong signal here: a timeout deep in bringup should be a plain exception so the caller decides whether it's fatal, rather than implying a process exit from inside a helper. Will change to raise `SmolvmError` instead, which is what `machine_start` already raises and what the launch flow propagates normally.
didericis added 1 commit 2026-06-02 02:29:12 -04:00
fix(smolmachines): raise SmolvmError instead of die() on wait_exec_ready timeout
test / unit (pull_request) Successful in 39s
test / integration (pull_request) Successful in 58s
test / unit (push) Successful in 38s
test / integration (push) Successful in 55s
a81f0ffa49
die() raises Die(SystemExit), which implies a process exit. A timeout in
wait_exec_ready is a bringup failure — raising SmolvmError lets the caller
decide whether it's fatal, consistent with how machine_start failures propagate.
didericis approved these changes 2026-06-02 02:31:30 -04:00
didericis merged commit a81f0ffa49 into main 2026-06-02 02:31:37 -04:00
didericis deleted branch prd-0032 2026-06-02 02:31:38 -04:00
Sign in to join this conversation.