bot-bottle/docs/research/remote-docker-vm-isolation.md

# Remote Docker VM as an isolation upgrade for claude-bottle

Note on the cheapest practical path to stronger isolation than local
Docker: run claude-bottle unchanged on a remote Linux VM that has
dockerd. Complements `stronger-isolation-alternatives.md` (which
surveys runtime swaps like gVisor, Kata, Firecracker, Apple Container)
and `local-vs-remote-agent-execution.md` (which surveys the
local-vs-remote decision broadly).

## Summary

If the goal is "stronger isolation than Docker-on-my-laptop without
rewriting the runtime," the cleanest answer is to keep claude-bottle
exactly as it is and run it on a remote Linux VM where you can install
dockerd. The v1 design — pipelock as a separate container on a
`--internal` network, ephemeral agent containers, OAuth-token
forwarding — works as-is. The only thing that changes is that the
"host" is now a disposable VM you provisioned for the session, not your
laptop.

This is structurally equivalent to a Firecracker rewrite (Rung 3 in
`stronger-isolation-alternatives.md`), but the cloud provider operates
the runtime for you. It is also strictly cheaper than adopting a cloud
sandbox SDK (Vercel Sandbox, E2B, Cloudflare Sandbox SDK) because you
keep the existing Docker-shaped abstractions instead of swapping them
for a vendor API.

## The argument

### What changes in the threat model

The agent's blast radius shrinks from "developer laptop + everything
on the LAN" to "this disposable VM." Concretely, what's no longer
reachable on container escape:

- `~/.ssh/`, `~/.aws/credentials`, `~/.config/gh`, the macOS Keychain
- Browser cookies and session state
- Other dev machines on the home/office LAN
- NAS, printers, smart-home devices, anything else on the local network

What replaces it on the remote side: only what the operator chose to
ship to the VM for the session. Typically the OAuth token, optional SSH
keys for the bottle, the manifest, and the workspace if the agent needs
one. None of which are on the laptop after the VM is destroyed.

### Why the boundary is equivalent to v1, not weaker

A natural objection — raised in the design discussion that produced
this note — is that running pipelock and the agent on the same VM
collapses a network boundary into a kernel-namespace boundary, which
sounds weaker. It is not, *if you reuse Docker for the inner topology.*

Docker on the remote VM gives the agent and pipelock their own network
namespaces by default, with the agent attached to a `--internal`
network and pipelock straddling it and an egress bridge. That is the
same v1 topology. Bypassing pipelock from the agent requires the same
class of attack as bypassing it on a laptop: a kernel-level netns
escape inside the VM. The only difference is that the kernel under
attack belongs to a disposable VM, not the developer's machine.

In other words: the "weaker because colocated" framing only applies if
you naively run agent and pipelock as two processes in the same
namespace. With Docker on the VM, you don't.

### Why this is cheaper than the alternatives

| Path | Effort | Where the VM-grade boundary comes from |
| --- | --- | --- |
| gVisor (`runsc`) per bottle | ~1–2 days | Userspace syscall barrier; not a full VM |
| Kata Containers per bottle | ~1–2 days, Linux-only | Kata's microVM-per-container |
| Firecracker rewrite | 2–4 weeks | Self-operated Firecracker |
| Apple Container (macOS) | ~1 week spike + integration | Apple's Virtualization.framework, per-container |
| Cloud sandbox SDK (Vercel, E2B, …) | Days–weeks of API rewrite + lock-in | Provider-operated Firecracker / equivalent |
| **Remote Docker VM (this note)** | **0 lines of code** | **Cloud-provider hypervisor under the VM** |

The "stronger isolation alternatives" doc concludes that gVisor is the
right today-step and Apple Container is probably the right v2.
This note adds a third option that sits orthogonal to both: don't
change the runtime, change the host. Use it when the failure mode you
care about is "agent compromises my laptop" specifically, rather than
"agent escapes Docker into a kernel I share with other workloads."

## What the provider has to give you

Not every cloud sandbox is suitable. The minimum for this approach to
work:

- Root or rootless-Docker capability inside the VM. Rules out
  Fargate-style locked-down container hosts and most "function" tier
  FaaS. Verify before committing — Vercel Sandbox specifically may or
  may not allow installing dockerd depending on tier; Fly Machines,
  EC2, GCE, Hetzner, Linode, and self-hosted hypervisors give you full
  control.
- Enough disk + RAM to host the claude-bottle image, the agent
  container, and the pipelock sidecar. Headroom of ~2–4 GB RAM and
  ~5 GB disk is comfortable; less works for short sessions.
- An interactive reach path. SSH is fine. The launcher uses
  `docker exec -it`, so any TTY-capable session works.

## What you give up

- **Typing latency.** Interactive Claude sessions over SSH have visible
  per-keystroke latency; usually fine on wired/fiber, less fine on
  Wi-Fi-to-cloud. Mosh helps if it's bothersome.
- **Token shipping.** `CLAUDE_BOTTLE_OAUTH_TOKEN` has to live on the
  remote box for the launcher to forward it into containers. Use the
  provider's secret-injection path (cloud-init user-data,
  `flyctl secrets`, Tailscale-served local file, etc.). Never echo the
  token onto the SSH command line; it ends up in the local shell
  history and possibly the SSH server's auth log.
- **Idle cost.** Unless the VM is torn down between sessions, you pay
  for it sitting idle. Ephemeral provisioning (one VM per session,
  destroyed on exit) is the cheaper and more secure pattern; see
  `local-vs-remote-agent-execution.md` on why ephemeral is also
  recommended for credential-concentration reasons.
- **Source code goes to the VM.** Same as any remote-execution
  topology. If the project is under NDA, the VM provider matters.
- **Provider trust.** Multi-tenancy side channels, supply-chain
  compromise of the provider, insider risk. Generally smaller than
  laptop-kernel-CVE risk, but the failure mode (provider-wide breach)
  is correlated across all your sandboxes.

## Operational shape

The minimum-viable workflow, no claude-bottle code changes:

1. `terraform apply` / `flyctl machine run` / `gcloud compute
   instances create` — provision a fresh Linux VM.
2. Install dockerd via the provider's image or a one-liner
   (`curl -fsSL https://get.docker.com | sh`).
3. SSH in.
4. `git clone` claude-bottle on the VM, drop a manifest in place,
   inject `CLAUDE_BOTTLE_OAUTH_TOKEN` via the provider's secrets path.
5. `./cli.py start <agent>` — the existing launcher handles the rest.
6. On exit: destroy the VM. No host artifacts persist.

For the "VPN pivot" failure mode, see
`local-vs-remote-agent-execution.md`. Short version: never VPN the
remote VM back to your LAN. If the agent needs LAN resources, expose
those through a narrow API instead.

## Case study: Fly Machines

Fly.io's Machines product is a useful concrete worked example because
it satisfies all the provider requirements (root, Firecracker-backed
isolation, scriptable lifecycle, per-second billing) and surfaces the
gotchas the abstract pattern leaves implicit.

### Image strategy

Build a custom OCI image `FROM docker:dind` that bakes in:

- The claude-bottle repository checkout.
- A pre-built `claude-bottle:latest` image, saved via `docker save` on
  your laptop and loaded in at image-build time
  (`RUN docker load < claude-bottle.tar`) or pushed as a layer into
  the dind storage. Without this step, the first in-VM `docker build`
  runs `apt-get` and a global `npm install -g
  @anthropic-ai/claude-code`, which adds 30–90 s to every cold start.
- A `flyctl secrets`-injected `CLAUDE_BOTTLE_OAUTH_TOKEN`, exposed to
  the VM's PID 1 as an env var.
- An entrypoint that starts dockerd, waits for it to be healthy, then
  either drops into a shell or directly runs `cli.py start <agent>`.

Deploy with `flyctl deploy` or `flyctl machine run --image …`.

### Boot-to-first-prompt timing

Three scenarios, all assuming the custom image above (claude-bottle
image baked in, token injected, no in-VM rebuild):

| Phase | Cold (image not cached on Fly host) | Warm (image cached, `machine run` fresh) | Hot (`machine stop`ped, `machine start`) |
| --- | --- | --- | --- |
| Fly schedule + image fetch | 10–30 s | 2–3 s | ~1 s |
| Firecracker kernel boot | ~1 s | ~1 s | ~1 s (resume) |
| dockerd-in-VM startup | 2–4 s | 2–4 s | 0 s (already running) |
| `cli.py start <agent>` housekeeping (network creates, pipelock sidecar, agent container, skill copy) | 4–6 s | 4–6 s | 4–6 s |
| Claude prints first prompt | 1–3 s | 1–3 s | 1–3 s |
| **End-to-end** | **~20–45 s** | **~10–17 s** | **~7–11 s** |

For interactive sessions the warm path is the realistic baseline once
the custom image is registered. The hot path trims only a few extra
seconds — the question of whether to keep stopped Machines on standby
is mostly about cost, not speed.

### Cost of standby vs. create-per-session

Stopped Fly Machines stop billing CPU/RAM but continue to bill for
storage and any allocated IPv4. A reasonable claude-bottle Machine
size (2 vCPU / 2 GB / ~3 GB rootfs) costs roughly:

| Item | While stopped | Monthly |
| --- | --- | --- |
| CPU + RAM | not billed | $0 |
| Rootfs storage | ~$0.15/GB-month | ~$0.45 |
| Dedicated IPv4 (if allocated) | $2/month flat | $2.00 |
| Dedicated IPv6 | free | $0 |
| Bandwidth | usage-based | $0 |

So **roughly $0.50–$2.50/month per standby Machine**, with the IPv4
line dominating. Drop the dedicated v4 (use IPv6 or Fly's shared v4
via WireGuard) and standby falls under $1/month.

For comparison, running the same Machine 24/7 lands in the
$15–$40/month range depending on size, and the create-and-destroy
pattern (one Machine per session, destroyed on exit) is effectively
$0 since you only pay for the seconds it ran.

### Practical pattern

Two reasonable workflows, plus one that's tempting but worse:

1. **Pure ephemeral.** `flyctl machine run` at session start,
   `flyctl machine destroy` on exit. ~20–45 s cold start, $0 idle.
   Maximally isolated; nothing persists between sessions. Best fit
   when sessions are infrequent or when state continuity across
   sessions is itself a concern.
2. **Standby pool.** A small fleet of pre-built Machines that get
   `start`ed fresh and `destroy`ed (or wiped) per session. The
   *Machine identity* is short-lived but the image is pre-cached on
   Fly's hosts, keeping warm-path latency at ~10–17 s.
   ~$0.50–$1/month per Machine in the pool without dedicated v4.
3. **Permanently stopped Machine, just `start`/`stop`.** Saves a few
   extra seconds (~7–11 s hot) but is the weakest of the three on
   the isolation axis — the rootfs persists across sessions, so
   anything a previous session wrote is still there. Avoid unless
   the saved seconds matter more than the state-continuity concern.

### Fly-specific caveats

- **DinD requires kernel features.** Fly Machines historically had
  some namespacing quirks for nested Docker; verify on a smoke-test
  Machine before committing. The pattern is supported (Fly's own
  Remote Builders use it), but kernel/runtime updates have shifted
  the requirements over time.
- **The launcher's interactive y/N preflight blocks automated remote
  start.** `cli.py start` waits on `/dev/tty`. For an automated entry
  point you need to pipe `y\n` into stdin, drive it from a pty, or
  add a `--yes`/`--non-interactive` flag (a small patch). The
  `--remote=user@host` ergonomics direction below would handle this
  in passing.
- **Pricing has been re-tariffed multiple times.** The structure
  (per-second compute, GB-month storage, $2/v4) has been stable;
  specific rates may have moved. Verify against
  [fly.io/docs/about/pricing](https://fly.io/docs/about/pricing)
  before committing numbers to any planning doc.

## Optional ergonomics direction

A future addon — not architecturally necessary, just nicer:

- `cli.py start --remote=user@host <agent>` that:
  - rsyncs the manifest and (optionally) cwd to the remote
  - SSHes in with the OAuth token forwarded via `SendEnv`
  - runs `cli.py start <agent>` on the remote
  - forwards the TTY for the interactive session
  - on exit, optionally tears down the remote VM via a provider hook
    (`flyctl machine destroy`, `terraform destroy`, etc.)

This is roughly a day of work and would make the remote pattern feel
like a single launcher invocation. It is the only piece of remote
support that would benefit from being upstreamed; everything else is
operator workflow.

## Recommendation

For users who want stronger isolation than local Docker without
rewriting the runtime, this is probably the right answer. Cleaner than
gVisor (which only adds a syscall barrier on the same kernel), cleaner
than a Firecracker rewrite (which is weeks of work), cleaner than
adopting a cloud-sandbox SDK (which trades the v1 design for a vendor
API). The pre-existing `local-vs-remote-agent-execution.md` decision
heuristics still apply for *whether* this is worth the operational
overhead in any given setting.

If we wanted to land this as a real project direction:

1. Add a short "Running claude-bottle on a remote Docker VM" section
   to the README pointing at this doc.
2. Optionally: prototype the `--remote=user@host` launcher subcommand.
3. Update `stronger-isolation-alternatives.md` to mention the remote
   Docker VM as a fourth path, since the survey is otherwise
   incomplete.

## Caveats

- "Just install Docker" isn't free on every provider; some lock down
  what kernel modules and caps the VM has. Spike-test before committing.
- Multi-tenant cloud hypervisors (EC2, GCE, Vercel) have their own
  side-channel and supply-chain risk surfaces, separately bounded from
  the laptop-kernel risk this approach addresses.
- The remote-VM topology still does not protect source code or secrets
  from the cloud provider — it protects them from a kernel exploit
  reaching the developer's laptop. Different fear, different fix.
- Research conducted 2026-05-10.