docs: add Fly Machines case study to remote-docker-vm-isolation note

Concrete worked example covering image strategy (with the bake-the- claude-bottle-image-in optimization that elides 30-90s of in-VM build), cold/warm/hot boot-to-prompt timing, standby vs ephemeral cost breakdown, three workflow patterns, and Fly-specific gotchas (DinD kernel requirements, the y/N preflight blocking automated launch, pricing-may-have-moved hedge). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 01:18:08 -04:00
parent 43453c66ea
commit ec6261cd77
1 changed files with 106 additions and 0 deletions
@@ -139,6 +139,112 @@ For the "VPN pivot" failure mode, see
 remote VM back to your LAN. If the agent needs LAN resources, expose
 those through a narrow API instead.

+## Case study: Fly Machines
+
+Fly.io's Machines product is a useful concrete worked example because
+it satisfies all the provider requirements (root, Firecracker-backed
+isolation, scriptable lifecycle, per-second billing) and surfaces the
+gotchas the abstract pattern leaves implicit.
+
+### Image strategy
+
+Build a custom OCI image `FROM docker:dind` that bakes in:
+
+- The claude-bottle repository checkout.
+- A pre-built `claude-bottle:latest` image, saved via `docker save` on
+  your laptop and loaded in at image-build time
+  (`RUN docker load < claude-bottle.tar`) or pushed as a layer into
+  the dind storage. Without this step, the first in-VM `docker build`
+  runs `apt-get` and a global `npm install -g
+  @anthropic-ai/claude-code`, which adds 30–90 s to every cold start.
+- A `flyctl secrets`-injected `CLAUDE_BOTTLE_OAUTH_TOKEN`, exposed to
+  the VM's PID 1 as an env var.
+- An entrypoint that starts dockerd, waits for it to be healthy, then
+  either drops into a shell or directly runs `cli.py start <agent>`.
+
+Deploy with `flyctl deploy` or `flyctl machine run --image …`.
+
+### Boot-to-first-prompt timing
+
+Three scenarios, all assuming the custom image above (claude-bottle
+image baked in, token injected, no in-VM rebuild):
+
+| Phase | Cold (image not cached on Fly host) | Warm (image cached, `machine run` fresh) | Hot (`machine stop`ped, `machine start`) |
+| --- | --- | --- | --- |
+| Fly schedule + image fetch | 10–30 s | 2–3 s | ~1 s |
+| Firecracker kernel boot | ~1 s | ~1 s | ~1 s (resume) |
+| dockerd-in-VM startup | 2–4 s | 2–4 s | 0 s (already running) |
+| `cli.py start <agent>` housekeeping (network creates, pipelock sidecar, agent container, skill copy) | 4–6 s | 4–6 s | 4–6 s |
+| Claude prints first prompt | 1–3 s | 1–3 s | 1–3 s |
+| **End-to-end** | **~20–45 s** | **~10–17 s** | **~7–11 s** |
+
+For interactive sessions the warm path is the realistic baseline once
+the custom image is registered. The hot path trims only a few extra
+seconds — the question of whether to keep stopped Machines on standby
+is mostly about cost, not speed.
+
+### Cost of standby vs. create-per-session
+
+Stopped Fly Machines stop billing CPU/RAM but continue to bill for
+storage and any allocated IPv4. A reasonable claude-bottle Machine
+size (2 vCPU / 2 GB / ~3 GB rootfs) costs roughly:
+
+| Item | While stopped | Monthly |
+| --- | --- | --- |
+| CPU + RAM | not billed | $0 |
+| Rootfs storage | ~$0.15/GB-month | ~$0.45 |
+| Dedicated IPv4 (if allocated) | $2/month flat | $2.00 |
+| Dedicated IPv6 | free | $0 |
+| Bandwidth | usage-based | $0 |
+
+So **roughly $0.50–$2.50/month per standby Machine**, with the IPv4
+line dominating. Drop the dedicated v4 (use IPv6 or Fly's shared v4
+via WireGuard) and standby falls under $1/month.
+
+For comparison, running the same Machine 24/7 lands in the
+$15–$40/month range depending on size, and the create-and-destroy
+pattern (one Machine per session, destroyed on exit) is effectively
+$0 since you only pay for the seconds it ran.
+
+### Practical pattern
+
+Two reasonable workflows, plus one that's tempting but worse:
+
+1. **Pure ephemeral.** `flyctl machine run` at session start,
+   `flyctl machine destroy` on exit. ~20–45 s cold start, $0 idle.
+   Maximally isolated; nothing persists between sessions. Best fit
+   when sessions are infrequent or when state continuity across
+   sessions is itself a concern.
+2. **Standby pool.** A small fleet of pre-built Machines that get
+   `start`ed fresh and `destroy`ed (or wiped) per session. The
+   *Machine identity* is short-lived but the image is pre-cached on
+   Fly's hosts, keeping warm-path latency at ~10–17 s.
+   ~$0.50–$1/month per Machine in the pool without dedicated v4.
+3. **Permanently stopped Machine, just `start`/`stop`.** Saves a few
+   extra seconds (~7–11 s hot) but is the weakest of the three on
+   the isolation axis — the rootfs persists across sessions, so
+   anything a previous session wrote is still there. Avoid unless
+   the saved seconds matter more than the state-continuity concern.
+
+### Fly-specific caveats
+
+- **DinD requires kernel features.** Fly Machines historically had
+  some namespacing quirks for nested Docker; verify on a smoke-test
+  Machine before committing. The pattern is supported (Fly's own
+  Remote Builders use it), but kernel/runtime updates have shifted
+  the requirements over time.
+- **The launcher's interactive y/N preflight blocks automated remote
+  start.** `cli.py start` waits on `/dev/tty`. For an automated entry
+  point you need to pipe `y\n` into stdin, drive it from a pty, or
+  add a `--yes`/`--non-interactive` flag (a small patch). The
+  `--remote=user@host` ergonomics direction below would handle this
+  in passing.
+- **Pricing has been re-tariffed multiple times.** The structure
+  (per-second compute, GB-month storage, $2/v4) has been stable;
+  specific rates may have moved. Verify against
+  [fly.io/docs/about/pricing](https://fly.io/docs/about/pricing)
+  before committing numbers to any planning doc.
+
 ## Optional ergonomics direction

 A future addon — not architecturally necessary, just nicer: