diff --git a/docs/research/stronger-isolation-alternatives.md b/docs/research/stronger-isolation-alternatives.md new file mode 100644 index 0000000..aabb09e --- /dev/null +++ b/docs/research/stronger-isolation-alternatives.md @@ -0,0 +1,243 @@ +# Stronger isolation alternatives: gVisor, Kata, Firecracker, Apple Container + +Research into what it would take to replace or augment Docker (with `runc`) +as the agent runtime in claude-bottle, and what each option would actually +buy in security terms vs. cost in launcher rewrite. + +## Summary + +There is a ladder, not a menu. Three realistic rungs, ordered by effort: + +1. **gVisor (`runsc`)** — flip a runtime flag per bottle. ~1–2 days. Adds a + userspace syscall boundary; blocks most kernel-CVE escape classes. +2. **Kata Containers** — flip a runtime flag per bottle. Same Docker UX, + real microVM underneath. Linux-host only. +3. **Firecracker direct** — replace Docker as the runtime entirely. Weeks + of work. Strongest boundary, no macOS support. + +A fourth option, **Apple Container**, is the right macOS-native answer to +"I want Kata's isolation model without giving up MacBooks as the dev +target." Probably the right v2 if claude-bottle keeps macOS in scope. + +The pipelock egress design is portable across all four: every option can +provide a network primitive that means "no default route except through +the proxy" (Docker `--internal`, Kata's virtualized bridge, TAP-only +Firecracker, Apple Container's per-VM networking). Whichever rung is +chosen, the security-load-bearing part of the v1 design survives. + +## Threat model recap + +The current v1 boundary is a single `node:22-slim` container running as +uid 1000 under `runc`, sharing a kernel with the host. This protects +against: + +- accidental host-filesystem access by Claude Code, +- network egress not approved by the pipelock allowlist, +- a misbehaving but uncoordinated agent. + +It does not protect against: + +- a kernel-level container escape (Dirty Pipe / runc CVE class), +- a coordinated attacker with code execution inside the container who + targets the host kernel, +- side channels accessible from the shared kernel. + +Stronger isolation closes the second column. Whether that's worth the +effort depends on whether you trust the agent's code-execution surface +more or less than you trust the host kernel. + +## Rung 1: gVisor (`runsc`) + +gVisor is a userspace kernel that registers as a Docker runtime. The +agent's syscalls are intercepted and re-implemented in Go rather than +forwarded to the host kernel. + +### What changes in this codebase + +- `claude_bottle/cli/start.py` (where `docker run` is assembled): add + `--runtime=runsc` to the container args when the bottle requests it. + Make it configurable: `bottles..runtime: "runsc" | "runc"`, + default `runc`. +- `claude_bottle/docker.py`: add a `require_runsc()` check that runs + `docker info --format '{{.Runtimes}}'` once and dies with an install + pointer if `runsc` isn't registered. +- `network.py`, `pipelock.py`, `skills.py`, `ssh.py`: **no changes**. + Docker networks, `docker exec`, `docker cp`, volume mounts, the + pipelock sidecar — all of it still works because gVisor is invisible + at the Docker API layer. + +### What you get + +- A second syscall boundary between the agent and the host kernel. + Most container-escape CVEs (Dirty Pipe / runc-escape class) stop at + `runsc`. +- Roughly 2–10% perf hit on syscall-heavy workloads. `npm install` will + feel it; interactive `claude` typing will not. + +### Caveats + +- **macOS does not run `runsc` natively.** It needs a Linux kernel. On + Mac, gVisor would run inside Docker Desktop's Linux VM, so the + effective boundary becomes "agent ↔ runsc ↔ Docker Desktop's Linux VM + ↔ hypervisor ↔ macOS". The hypervisor was already doing the heavy + lifting; on Mac, runsc is mostly defense-in-depth. On a Linux host + it's a real win. +- Some syscalls are unsupported (a small list — `io_uring` historically, + some `ptrace` shapes). For Claude Code + git + npm I expect zero + issues, but a smoke test (`claude --version && git status && npm + install`) inside the runsc image is worth it. + +### Effort + +~1–2 days, plus a paragraph in the README. Cleanest first step. + +## Rung 2: Kata Containers + +Kata also registers as a Docker/containerd runtime +(`--runtime=kata-runtime`), but each container actually runs inside its +own lightweight VM. The VMM under the hood is configurable: Firecracker, +Cloud Hypervisor, or QEMU. + +### What changes in this codebase + +Essentially the same as the gVisor path: flip a runtime flag, add a +require-check. **Pipelock keeps working unchanged**, because Kata +virtualizes the network at the VM level but exposes it as a normal +Docker network. + +### Tradeoffs vs. gVisor + +- Stronger boundary (real VM, not a syscall filter). +- Slower cold start (hundreds of ms vs. tens). For interactive Claude + this is fine; for ephemeral batch agents you would notice. +- Not natively supported on macOS at all — needs a Linux host or a Linux + VM you control. **This is the moment claude-bottle stops being "works + on a Mac dev laptop with Docker Desktop."** + +### When this is the right rung + +If the deployment target is "agents run on a small Linux server I +administer," Kata is the sweet spot. If the target stays "users run this +on their MacBook," skip to the Apple Container option. + +## Rung 3: Firecracker directly + +Firecracker is a VMM, not a container runtime. Adopting it means +replacing Docker, not adding to it. + +### What you would lose / have to rebuild + +| Today | With Firecracker | +| --- | --- | +| `Dockerfile` → `node:22-slim` image | A rootfs (ext4 image) + a kernel (vmlinux) you build and pin | +| `docker run --network …` | TAP devices on the host, connected to a Linux bridge or routed manually | +| `docker exec -it` for the interactive TTY | vsock + a small in-guest agent, or SSH into the microVM | +| `docker cp` for skills + pipelock YAML | Bake into the rootfs, mount a virtio-blk overlay, or 9p / virtiofs share | +| Pipelock as a sidecar on a `--internal` network | Pipelock as a separate microVM (or on the host) with a TAP-only path between the two; the agent VM gets no host route | +| `docker rm -f` on exit | A SIGTERM to firecracker + cleanup of TAPs, sockets, overlay disks | + +### Files in this repo that would change + +- `claude_bottle/docker.py` → replaced by a new `claude_bottle/firecracker.py` + that POSTs to the Firecracker API socket per microVM (`/boot-source`, + `/drives`, `/network-interfaces`, `/actions`). +- `claude_bottle/network.py` → a host-side networking module that creates + a Linux bridge per agent, two TAPs (agent-side, pipelock-side), and + either iptables rules or no host route at all so the agent VM + literally cannot reach anything except pipelock. +- `claude_bottle/pipelock.py` → instead of a sidecar container, run + pipelock as its own microVM (or on the host pinned to the bridge). + The hostname-allowlist semantics carry over; the implementation is + different. +- `claude_bottle/skills.py`, `ssh.py` → can no longer use `docker cp`. + Bake skills into the rootfs at build time, or mount a virtiofs share + read-only. +- `Dockerfile` → replaced by a rootfs builder. Realistically this means + using something like `firecracker-containerd` or building the rootfs + with `debootstrap` / `mkosi` and a kernel from upstream. + +### What you would gain + +- A real KVM boundary. The strongest isolation realistically achievable + on commodity hardware. +- Sub-second cold starts (Firecracker boots in ~125 ms; rootfs prep + dominates). + +### What you would give up + +- macOS support. Firecracker is KVM-only. The only way back to Mac is to + nest a Linux VM hosting Firecracker, at which point the security + argument gets thin again. +- Ecosystem ergonomics. No `docker logs`, no `docker exec`, no `docker + network inspect`. You write all of that yourself or adopt + `firecracker-containerd` or Ignite (which is unmaintained — verify + before committing). + +### Effort + +Realistically 2–4 weeks of focused work on the runtime layer. Forces +dropping "v1 works on Mac" as a goal. PRD-worthy, not a side quest. + +## Rung 3.5: Apple Container (macOS-native VM-per-container) + +Apple Container is Apple's `container` CLI, native on Apple Silicon. +Each container runs in its own Virtualization.framework VM. It is the +macOS-native answer to "I want Kata's isolation model on my MacBook." + +### Why it matters here + +The CLI surface mirrors Docker closely (`container run`, `container +network create`, etc.), so the launcher rewrite is far smaller than +Firecracker's. On Linux hosts you would still take the gVisor or Kata +path. The result is: + +- macOS: Apple Container (per-container VM via Virtualization.framework), +- Linux: gVisor or Kata, +- one Python launcher that switches on host OS. + +### Open questions before committing + +- Does Apple Container support a `--internal`-equivalent network with + no default gateway, so the pipelock topology is reproducible? +- Image format: Apple Container uses OCI images, so the existing + `Dockerfile` should be reusable, but this needs verification. +- `exec`-equivalent semantics: the launcher relies on `docker exec` to + attach a TTY after the container is up. Confirm `container exec` + behaves equivalently for interactive use. + +A short spike (~1 day) answering those three questions would unblock a +PRD-level decision. + +## Recommendation + +If this were my project today, given the README still names macOS as in +scope and the manifest example carries `/Users/didericis` paths: + +1. **Today.** Add `bottles..runtime` with `runc` / `runsc` options. + Land it as a one-day PR. README gets a small "Linux hosts can opt + into gVisor for stronger isolation" note. Mac users get nothing new + but lose nothing. +2. **If VM-grade isolation on macOS becomes the goal.** Skip Firecracker + and look at Apple Container. Smaller launcher rewrite than + Firecracker; Linux stays on the gVisor / Kata path. Probably the + right v2. +3. **Firecracker only if** claude-bottle's deployment target settles on + self-hosted Linux, not laptops — at which point the "non-goal: + self-hosted VMs" line in `CLAUDE.md` flips and the project's + identity changes. + +The pipelock egress design ports across all of these, so none of this +work threatens the existing security-load-bearing piece of v1. + +## Caveats + +- gVisor's unsupported-syscall list shifts release-to-release; verify + against the version pinned in any future image. +- Kata's default VMM is configurable; performance and CVE surface vary + by VMM choice. +- Firecracker tooling has churned (Ignite is effectively unmaintained; + `firecracker-containerd` is the active path). Re-survey before + committing. +- Apple Container is young; behavior around `--internal`-style networks + and `exec` semantics needs to be verified directly, not assumed. +- Research conducted 2026-05-10.