docs: add Fly Machines case study to remote-docker-vm-isolation note
test / run tests/run_tests.py (push) Successful in 13s
test / run tests/run_tests.py (push) Successful in 13s
Concrete worked example covering image strategy (with the bake-the- claude-bottle-image-in optimization that elides 30-90s of in-VM build), cold/warm/hot boot-to-prompt timing, standby vs ephemeral cost breakdown, three workflow patterns, and Fly-specific gotchas (DinD kernel requirements, the y/N preflight blocking automated launch, pricing-may-have-moved hedge). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -139,6 +139,112 @@ For the "VPN pivot" failure mode, see
|
|||||||
remote VM back to your LAN. If the agent needs LAN resources, expose
|
remote VM back to your LAN. If the agent needs LAN resources, expose
|
||||||
those through a narrow API instead.
|
those through a narrow API instead.
|
||||||
|
|
||||||
|
## Case study: Fly Machines
|
||||||
|
|
||||||
|
Fly.io's Machines product is a useful concrete worked example because
|
||||||
|
it satisfies all the provider requirements (root, Firecracker-backed
|
||||||
|
isolation, scriptable lifecycle, per-second billing) and surfaces the
|
||||||
|
gotchas the abstract pattern leaves implicit.
|
||||||
|
|
||||||
|
### Image strategy
|
||||||
|
|
||||||
|
Build a custom OCI image `FROM docker:dind` that bakes in:
|
||||||
|
|
||||||
|
- The claude-bottle repository checkout.
|
||||||
|
- A pre-built `claude-bottle:latest` image, saved via `docker save` on
|
||||||
|
your laptop and loaded in at image-build time
|
||||||
|
(`RUN docker load < claude-bottle.tar`) or pushed as a layer into
|
||||||
|
the dind storage. Without this step, the first in-VM `docker build`
|
||||||
|
runs `apt-get` and a global `npm install -g
|
||||||
|
@anthropic-ai/claude-code`, which adds 30–90 s to every cold start.
|
||||||
|
- A `flyctl secrets`-injected `CLAUDE_BOTTLE_OAUTH_TOKEN`, exposed to
|
||||||
|
the VM's PID 1 as an env var.
|
||||||
|
- An entrypoint that starts dockerd, waits for it to be healthy, then
|
||||||
|
either drops into a shell or directly runs `cli.py start <agent>`.
|
||||||
|
|
||||||
|
Deploy with `flyctl deploy` or `flyctl machine run --image …`.
|
||||||
|
|
||||||
|
### Boot-to-first-prompt timing
|
||||||
|
|
||||||
|
Three scenarios, all assuming the custom image above (claude-bottle
|
||||||
|
image baked in, token injected, no in-VM rebuild):
|
||||||
|
|
||||||
|
| Phase | Cold (image not cached on Fly host) | Warm (image cached, `machine run` fresh) | Hot (`machine stop`ped, `machine start`) |
|
||||||
|
| --- | --- | --- | --- |
|
||||||
|
| Fly schedule + image fetch | 10–30 s | 2–3 s | ~1 s |
|
||||||
|
| Firecracker kernel boot | ~1 s | ~1 s | ~1 s (resume) |
|
||||||
|
| dockerd-in-VM startup | 2–4 s | 2–4 s | 0 s (already running) |
|
||||||
|
| `cli.py start <agent>` housekeeping (network creates, pipelock sidecar, agent container, skill copy) | 4–6 s | 4–6 s | 4–6 s |
|
||||||
|
| Claude prints first prompt | 1–3 s | 1–3 s | 1–3 s |
|
||||||
|
| **End-to-end** | **~20–45 s** | **~10–17 s** | **~7–11 s** |
|
||||||
|
|
||||||
|
For interactive sessions the warm path is the realistic baseline once
|
||||||
|
the custom image is registered. The hot path trims only a few extra
|
||||||
|
seconds — the question of whether to keep stopped Machines on standby
|
||||||
|
is mostly about cost, not speed.
|
||||||
|
|
||||||
|
### Cost of standby vs. create-per-session
|
||||||
|
|
||||||
|
Stopped Fly Machines stop billing CPU/RAM but continue to bill for
|
||||||
|
storage and any allocated IPv4. A reasonable claude-bottle Machine
|
||||||
|
size (2 vCPU / 2 GB / ~3 GB rootfs) costs roughly:
|
||||||
|
|
||||||
|
| Item | While stopped | Monthly |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| CPU + RAM | not billed | $0 |
|
||||||
|
| Rootfs storage | ~$0.15/GB-month | ~$0.45 |
|
||||||
|
| Dedicated IPv4 (if allocated) | $2/month flat | $2.00 |
|
||||||
|
| Dedicated IPv6 | free | $0 |
|
||||||
|
| Bandwidth | usage-based | $0 |
|
||||||
|
|
||||||
|
So **roughly $0.50–$2.50/month per standby Machine**, with the IPv4
|
||||||
|
line dominating. Drop the dedicated v4 (use IPv6 or Fly's shared v4
|
||||||
|
via WireGuard) and standby falls under $1/month.
|
||||||
|
|
||||||
|
For comparison, running the same Machine 24/7 lands in the
|
||||||
|
$15–$40/month range depending on size, and the create-and-destroy
|
||||||
|
pattern (one Machine per session, destroyed on exit) is effectively
|
||||||
|
$0 since you only pay for the seconds it ran.
|
||||||
|
|
||||||
|
### Practical pattern
|
||||||
|
|
||||||
|
Two reasonable workflows, plus one that's tempting but worse:
|
||||||
|
|
||||||
|
1. **Pure ephemeral.** `flyctl machine run` at session start,
|
||||||
|
`flyctl machine destroy` on exit. ~20–45 s cold start, $0 idle.
|
||||||
|
Maximally isolated; nothing persists between sessions. Best fit
|
||||||
|
when sessions are infrequent or when state continuity across
|
||||||
|
sessions is itself a concern.
|
||||||
|
2. **Standby pool.** A small fleet of pre-built Machines that get
|
||||||
|
`start`ed fresh and `destroy`ed (or wiped) per session. The
|
||||||
|
*Machine identity* is short-lived but the image is pre-cached on
|
||||||
|
Fly's hosts, keeping warm-path latency at ~10–17 s.
|
||||||
|
~$0.50–$1/month per Machine in the pool without dedicated v4.
|
||||||
|
3. **Permanently stopped Machine, just `start`/`stop`.** Saves a few
|
||||||
|
extra seconds (~7–11 s hot) but is the weakest of the three on
|
||||||
|
the isolation axis — the rootfs persists across sessions, so
|
||||||
|
anything a previous session wrote is still there. Avoid unless
|
||||||
|
the saved seconds matter more than the state-continuity concern.
|
||||||
|
|
||||||
|
### Fly-specific caveats
|
||||||
|
|
||||||
|
- **DinD requires kernel features.** Fly Machines historically had
|
||||||
|
some namespacing quirks for nested Docker; verify on a smoke-test
|
||||||
|
Machine before committing. The pattern is supported (Fly's own
|
||||||
|
Remote Builders use it), but kernel/runtime updates have shifted
|
||||||
|
the requirements over time.
|
||||||
|
- **The launcher's interactive y/N preflight blocks automated remote
|
||||||
|
start.** `cli.py start` waits on `/dev/tty`. For an automated entry
|
||||||
|
point you need to pipe `y\n` into stdin, drive it from a pty, or
|
||||||
|
add a `--yes`/`--non-interactive` flag (a small patch). The
|
||||||
|
`--remote=user@host` ergonomics direction below would handle this
|
||||||
|
in passing.
|
||||||
|
- **Pricing has been re-tariffed multiple times.** The structure
|
||||||
|
(per-second compute, GB-month storage, $2/v4) has been stable;
|
||||||
|
specific rates may have moved. Verify against
|
||||||
|
[fly.io/docs/about/pricing](https://fly.io/docs/about/pricing)
|
||||||
|
before committing numbers to any planning doc.
|
||||||
|
|
||||||
## Optional ergonomics direction
|
## Optional ergonomics direction
|
||||||
|
|
||||||
A future addon — not architecturally necessary, just nicer:
|
A future addon — not architecturally necessary, just nicer:
|
||||||
|
|||||||
Reference in New Issue
Block a user