docs: add research note on stronger isolation alternatives
test / run tests/run_tests.py (push) Successful in 19s
test / run tests/run_tests.py (push) Successful in 19s
Surveys gVisor, Kata, Firecracker, and Apple Container as replacements or complements to Docker+runc, with concrete file-level migration notes for this codebase and a recommended rung-by-rung path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,243 @@
|
|||||||
|
# Stronger isolation alternatives: gVisor, Kata, Firecracker, Apple Container
|
||||||
|
|
||||||
|
Research into what it would take to replace or augment Docker (with `runc`)
|
||||||
|
as the agent runtime in claude-bottle, and what each option would actually
|
||||||
|
buy in security terms vs. cost in launcher rewrite.
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
There is a ladder, not a menu. Three realistic rungs, ordered by effort:
|
||||||
|
|
||||||
|
1. **gVisor (`runsc`)** — flip a runtime flag per bottle. ~1–2 days. Adds a
|
||||||
|
userspace syscall boundary; blocks most kernel-CVE escape classes.
|
||||||
|
2. **Kata Containers** — flip a runtime flag per bottle. Same Docker UX,
|
||||||
|
real microVM underneath. Linux-host only.
|
||||||
|
3. **Firecracker direct** — replace Docker as the runtime entirely. Weeks
|
||||||
|
of work. Strongest boundary, no macOS support.
|
||||||
|
|
||||||
|
A fourth option, **Apple Container**, is the right macOS-native answer to
|
||||||
|
"I want Kata's isolation model without giving up MacBooks as the dev
|
||||||
|
target." Probably the right v2 if claude-bottle keeps macOS in scope.
|
||||||
|
|
||||||
|
The pipelock egress design is portable across all four: every option can
|
||||||
|
provide a network primitive that means "no default route except through
|
||||||
|
the proxy" (Docker `--internal`, Kata's virtualized bridge, TAP-only
|
||||||
|
Firecracker, Apple Container's per-VM networking). Whichever rung is
|
||||||
|
chosen, the security-load-bearing part of the v1 design survives.
|
||||||
|
|
||||||
|
## Threat model recap
|
||||||
|
|
||||||
|
The current v1 boundary is a single `node:22-slim` container running as
|
||||||
|
uid 1000 under `runc`, sharing a kernel with the host. This protects
|
||||||
|
against:
|
||||||
|
|
||||||
|
- accidental host-filesystem access by Claude Code,
|
||||||
|
- network egress not approved by the pipelock allowlist,
|
||||||
|
- a misbehaving but uncoordinated agent.
|
||||||
|
|
||||||
|
It does not protect against:
|
||||||
|
|
||||||
|
- a kernel-level container escape (Dirty Pipe / runc CVE class),
|
||||||
|
- a coordinated attacker with code execution inside the container who
|
||||||
|
targets the host kernel,
|
||||||
|
- side channels accessible from the shared kernel.
|
||||||
|
|
||||||
|
Stronger isolation closes the second column. Whether that's worth the
|
||||||
|
effort depends on whether you trust the agent's code-execution surface
|
||||||
|
more or less than you trust the host kernel.
|
||||||
|
|
||||||
|
## Rung 1: gVisor (`runsc`)
|
||||||
|
|
||||||
|
gVisor is a userspace kernel that registers as a Docker runtime. The
|
||||||
|
agent's syscalls are intercepted and re-implemented in Go rather than
|
||||||
|
forwarded to the host kernel.
|
||||||
|
|
||||||
|
### What changes in this codebase
|
||||||
|
|
||||||
|
- `claude_bottle/cli/start.py` (where `docker run` is assembled): add
|
||||||
|
`--runtime=runsc` to the container args when the bottle requests it.
|
||||||
|
Make it configurable: `bottles.<name>.runtime: "runsc" | "runc"`,
|
||||||
|
default `runc`.
|
||||||
|
- `claude_bottle/docker.py`: add a `require_runsc()` check that runs
|
||||||
|
`docker info --format '{{.Runtimes}}'` once and dies with an install
|
||||||
|
pointer if `runsc` isn't registered.
|
||||||
|
- `network.py`, `pipelock.py`, `skills.py`, `ssh.py`: **no changes**.
|
||||||
|
Docker networks, `docker exec`, `docker cp`, volume mounts, the
|
||||||
|
pipelock sidecar — all of it still works because gVisor is invisible
|
||||||
|
at the Docker API layer.
|
||||||
|
|
||||||
|
### What you get
|
||||||
|
|
||||||
|
- A second syscall boundary between the agent and the host kernel.
|
||||||
|
Most container-escape CVEs (Dirty Pipe / runc-escape class) stop at
|
||||||
|
`runsc`.
|
||||||
|
- Roughly 2–10% perf hit on syscall-heavy workloads. `npm install` will
|
||||||
|
feel it; interactive `claude` typing will not.
|
||||||
|
|
||||||
|
### Caveats
|
||||||
|
|
||||||
|
- **macOS does not run `runsc` natively.** It needs a Linux kernel. On
|
||||||
|
Mac, gVisor would run inside Docker Desktop's Linux VM, so the
|
||||||
|
effective boundary becomes "agent ↔ runsc ↔ Docker Desktop's Linux VM
|
||||||
|
↔ hypervisor ↔ macOS". The hypervisor was already doing the heavy
|
||||||
|
lifting; on Mac, runsc is mostly defense-in-depth. On a Linux host
|
||||||
|
it's a real win.
|
||||||
|
- Some syscalls are unsupported (a small list — `io_uring` historically,
|
||||||
|
some `ptrace` shapes). For Claude Code + git + npm I expect zero
|
||||||
|
issues, but a smoke test (`claude --version && git status && npm
|
||||||
|
install`) inside the runsc image is worth it.
|
||||||
|
|
||||||
|
### Effort
|
||||||
|
|
||||||
|
~1–2 days, plus a paragraph in the README. Cleanest first step.
|
||||||
|
|
||||||
|
## Rung 2: Kata Containers
|
||||||
|
|
||||||
|
Kata also registers as a Docker/containerd runtime
|
||||||
|
(`--runtime=kata-runtime`), but each container actually runs inside its
|
||||||
|
own lightweight VM. The VMM under the hood is configurable: Firecracker,
|
||||||
|
Cloud Hypervisor, or QEMU.
|
||||||
|
|
||||||
|
### What changes in this codebase
|
||||||
|
|
||||||
|
Essentially the same as the gVisor path: flip a runtime flag, add a
|
||||||
|
require-check. **Pipelock keeps working unchanged**, because Kata
|
||||||
|
virtualizes the network at the VM level but exposes it as a normal
|
||||||
|
Docker network.
|
||||||
|
|
||||||
|
### Tradeoffs vs. gVisor
|
||||||
|
|
||||||
|
- Stronger boundary (real VM, not a syscall filter).
|
||||||
|
- Slower cold start (hundreds of ms vs. tens). For interactive Claude
|
||||||
|
this is fine; for ephemeral batch agents you would notice.
|
||||||
|
- Not natively supported on macOS at all — needs a Linux host or a Linux
|
||||||
|
VM you control. **This is the moment claude-bottle stops being "works
|
||||||
|
on a Mac dev laptop with Docker Desktop."**
|
||||||
|
|
||||||
|
### When this is the right rung
|
||||||
|
|
||||||
|
If the deployment target is "agents run on a small Linux server I
|
||||||
|
administer," Kata is the sweet spot. If the target stays "users run this
|
||||||
|
on their MacBook," skip to the Apple Container option.
|
||||||
|
|
||||||
|
## Rung 3: Firecracker directly
|
||||||
|
|
||||||
|
Firecracker is a VMM, not a container runtime. Adopting it means
|
||||||
|
replacing Docker, not adding to it.
|
||||||
|
|
||||||
|
### What you would lose / have to rebuild
|
||||||
|
|
||||||
|
| Today | With Firecracker |
|
||||||
|
| --- | --- |
|
||||||
|
| `Dockerfile` → `node:22-slim` image | A rootfs (ext4 image) + a kernel (vmlinux) you build and pin |
|
||||||
|
| `docker run --network …` | TAP devices on the host, connected to a Linux bridge or routed manually |
|
||||||
|
| `docker exec -it` for the interactive TTY | vsock + a small in-guest agent, or SSH into the microVM |
|
||||||
|
| `docker cp` for skills + pipelock YAML | Bake into the rootfs, mount a virtio-blk overlay, or 9p / virtiofs share |
|
||||||
|
| Pipelock as a sidecar on a `--internal` network | Pipelock as a separate microVM (or on the host) with a TAP-only path between the two; the agent VM gets no host route |
|
||||||
|
| `docker rm -f` on exit | A SIGTERM to firecracker + cleanup of TAPs, sockets, overlay disks |
|
||||||
|
|
||||||
|
### Files in this repo that would change
|
||||||
|
|
||||||
|
- `claude_bottle/docker.py` → replaced by a new `claude_bottle/firecracker.py`
|
||||||
|
that POSTs to the Firecracker API socket per microVM (`/boot-source`,
|
||||||
|
`/drives`, `/network-interfaces`, `/actions`).
|
||||||
|
- `claude_bottle/network.py` → a host-side networking module that creates
|
||||||
|
a Linux bridge per agent, two TAPs (agent-side, pipelock-side), and
|
||||||
|
either iptables rules or no host route at all so the agent VM
|
||||||
|
literally cannot reach anything except pipelock.
|
||||||
|
- `claude_bottle/pipelock.py` → instead of a sidecar container, run
|
||||||
|
pipelock as its own microVM (or on the host pinned to the bridge).
|
||||||
|
The hostname-allowlist semantics carry over; the implementation is
|
||||||
|
different.
|
||||||
|
- `claude_bottle/skills.py`, `ssh.py` → can no longer use `docker cp`.
|
||||||
|
Bake skills into the rootfs at build time, or mount a virtiofs share
|
||||||
|
read-only.
|
||||||
|
- `Dockerfile` → replaced by a rootfs builder. Realistically this means
|
||||||
|
using something like `firecracker-containerd` or building the rootfs
|
||||||
|
with `debootstrap` / `mkosi` and a kernel from upstream.
|
||||||
|
|
||||||
|
### What you would gain
|
||||||
|
|
||||||
|
- A real KVM boundary. The strongest isolation realistically achievable
|
||||||
|
on commodity hardware.
|
||||||
|
- Sub-second cold starts (Firecracker boots in ~125 ms; rootfs prep
|
||||||
|
dominates).
|
||||||
|
|
||||||
|
### What you would give up
|
||||||
|
|
||||||
|
- macOS support. Firecracker is KVM-only. The only way back to Mac is to
|
||||||
|
nest a Linux VM hosting Firecracker, at which point the security
|
||||||
|
argument gets thin again.
|
||||||
|
- Ecosystem ergonomics. No `docker logs`, no `docker exec`, no `docker
|
||||||
|
network inspect`. You write all of that yourself or adopt
|
||||||
|
`firecracker-containerd` or Ignite (which is unmaintained — verify
|
||||||
|
before committing).
|
||||||
|
|
||||||
|
### Effort
|
||||||
|
|
||||||
|
Realistically 2–4 weeks of focused work on the runtime layer. Forces
|
||||||
|
dropping "v1 works on Mac" as a goal. PRD-worthy, not a side quest.
|
||||||
|
|
||||||
|
## Rung 3.5: Apple Container (macOS-native VM-per-container)
|
||||||
|
|
||||||
|
Apple Container is Apple's `container` CLI, native on Apple Silicon.
|
||||||
|
Each container runs in its own Virtualization.framework VM. It is the
|
||||||
|
macOS-native answer to "I want Kata's isolation model on my MacBook."
|
||||||
|
|
||||||
|
### Why it matters here
|
||||||
|
|
||||||
|
The CLI surface mirrors Docker closely (`container run`, `container
|
||||||
|
network create`, etc.), so the launcher rewrite is far smaller than
|
||||||
|
Firecracker's. On Linux hosts you would still take the gVisor or Kata
|
||||||
|
path. The result is:
|
||||||
|
|
||||||
|
- macOS: Apple Container (per-container VM via Virtualization.framework),
|
||||||
|
- Linux: gVisor or Kata,
|
||||||
|
- one Python launcher that switches on host OS.
|
||||||
|
|
||||||
|
### Open questions before committing
|
||||||
|
|
||||||
|
- Does Apple Container support a `--internal`-equivalent network with
|
||||||
|
no default gateway, so the pipelock topology is reproducible?
|
||||||
|
- Image format: Apple Container uses OCI images, so the existing
|
||||||
|
`Dockerfile` should be reusable, but this needs verification.
|
||||||
|
- `exec`-equivalent semantics: the launcher relies on `docker exec` to
|
||||||
|
attach a TTY after the container is up. Confirm `container exec`
|
||||||
|
behaves equivalently for interactive use.
|
||||||
|
|
||||||
|
A short spike (~1 day) answering those three questions would unblock a
|
||||||
|
PRD-level decision.
|
||||||
|
|
||||||
|
## Recommendation
|
||||||
|
|
||||||
|
If this were my project today, given the README still names macOS as in
|
||||||
|
scope and the manifest example carries `/Users/didericis` paths:
|
||||||
|
|
||||||
|
1. **Today.** Add `bottles.<name>.runtime` with `runc` / `runsc` options.
|
||||||
|
Land it as a one-day PR. README gets a small "Linux hosts can opt
|
||||||
|
into gVisor for stronger isolation" note. Mac users get nothing new
|
||||||
|
but lose nothing.
|
||||||
|
2. **If VM-grade isolation on macOS becomes the goal.** Skip Firecracker
|
||||||
|
and look at Apple Container. Smaller launcher rewrite than
|
||||||
|
Firecracker; Linux stays on the gVisor / Kata path. Probably the
|
||||||
|
right v2.
|
||||||
|
3. **Firecracker only if** claude-bottle's deployment target settles on
|
||||||
|
self-hosted Linux, not laptops — at which point the "non-goal:
|
||||||
|
self-hosted VMs" line in `CLAUDE.md` flips and the project's
|
||||||
|
identity changes.
|
||||||
|
|
||||||
|
The pipelock egress design ports across all of these, so none of this
|
||||||
|
work threatens the existing security-load-bearing piece of v1.
|
||||||
|
|
||||||
|
## Caveats
|
||||||
|
|
||||||
|
- gVisor's unsupported-syscall list shifts release-to-release; verify
|
||||||
|
against the version pinned in any future image.
|
||||||
|
- Kata's default VMM is configurable; performance and CVE surface vary
|
||||||
|
by VMM choice.
|
||||||
|
- Firecracker tooling has churned (Ignite is effectively unmaintained;
|
||||||
|
`firecracker-containerd` is the active path). Re-survey before
|
||||||
|
committing.
|
||||||
|
- Apple Container is young; behavior around `--internal`-style networks
|
||||||
|
and `exec` semantics needs to be verified directly, not assumed.
|
||||||
|
- Research conducted 2026-05-10.
|
||||||
Reference in New Issue
Block a user