Files
bot-bottle/docs/research/stronger-isolation-alternatives.md
T
didericis-codex 18e3b62b72
test / unit (pull_request) Successful in 28s
test / integration (pull_request) Successful in 40s
test / unit (push) Successful in 31s
test / integration (push) Successful in 44s
docs: rename CLAUDE.md to AGENTS.md and rebrand provider-agnostic
Delete CLAUDE.md in favor of AGENTS.md as the orientation doc, rebrand
the project from Codex-bottle to provider-agnostic bot-bottle, and
repoint every CLAUDE.md reference across PRDs, research notes, the
implementer agent example, and the yaml_subset comment.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-28 20:36:47 -04:00

244 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Stronger isolation alternatives: gVisor, Kata, Firecracker, Apple Container
Research into what it would take to replace or augment Docker (with `runc`)
as the agent runtime in bot-bottle, and what each option would actually
buy in security terms vs. cost in launcher rewrite.
## Summary
There is a ladder, not a menu. Three realistic rungs, ordered by effort:
1. **gVisor (`runsc`)** — flip a runtime flag per bottle. ~12 days. Adds a
userspace syscall boundary; blocks most kernel-CVE escape classes.
2. **Kata Containers** — flip a runtime flag per bottle. Same Docker UX,
real microVM underneath. Linux-host only.
3. **Firecracker direct** — replace Docker as the runtime entirely. Weeks
of work. Strongest boundary, no macOS support.
A fourth option, **Apple Container**, is the right macOS-native answer to
"I want Kata's isolation model without giving up MacBooks as the dev
target." Probably the right v2 if bot-bottle keeps macOS in scope.
The pipelock egress design is portable across all four: every option can
provide a network primitive that means "no default route except through
the proxy" (Docker `--internal`, Kata's virtualized bridge, TAP-only
Firecracker, Apple Container's per-VM networking). Whichever rung is
chosen, the security-load-bearing part of the v1 design survives.
## Threat model recap
The current v1 boundary is a single `node:22-slim` container running as
uid 1000 under `runc`, sharing a kernel with the host. This protects
against:
- accidental host-filesystem access by Claude Code,
- network egress not approved by the pipelock allowlist,
- a misbehaving but uncoordinated agent.
It does not protect against:
- a kernel-level container escape (Dirty Pipe / runc CVE class),
- a coordinated attacker with code execution inside the container who
targets the host kernel,
- side channels accessible from the shared kernel.
Stronger isolation closes the second column. Whether that's worth the
effort depends on whether you trust the agent's code-execution surface
more or less than you trust the host kernel.
## Rung 1: gVisor (`runsc`)
gVisor is a userspace kernel that registers as a Docker runtime. The
agent's syscalls are intercepted and re-implemented in Go rather than
forwarded to the host kernel.
### What changes in this codebase
- `bot_bottle/cli/start.py` (where `docker run` is assembled): add
`--runtime=runsc` to the container args when the bottle requests it.
Make it configurable: `bottles.<name>.runtime: "runsc" | "runc"`,
default `runc`.
- `bot_bottle/docker.py`: add a `require_runsc()` check that runs
`docker info --format '{{.Runtimes}}'` once and dies with an install
pointer if `runsc` isn't registered.
- `network.py`, `pipelock.py`, `skills.py`, `ssh.py`: **no changes**.
Docker networks, `docker exec`, `docker cp`, volume mounts, the
pipelock sidecar — all of it still works because gVisor is invisible
at the Docker API layer.
### What you get
- A second syscall boundary between the agent and the host kernel.
Most container-escape CVEs (Dirty Pipe / runc-escape class) stop at
`runsc`.
- Roughly 210% perf hit on syscall-heavy workloads. `npm install` will
feel it; interactive `claude` typing will not.
### Caveats
- **macOS does not run `runsc` natively.** It needs a Linux kernel. On
Mac, gVisor would run inside Docker Desktop's Linux VM, so the
effective boundary becomes "agent ↔ runsc ↔ Docker Desktop's Linux VM
↔ hypervisor ↔ macOS". The hypervisor was already doing the heavy
lifting; on Mac, runsc is mostly defense-in-depth. On a Linux host
it's a real win.
- Some syscalls are unsupported (a small list — `io_uring` historically,
some `ptrace` shapes). For Claude Code + git + npm I expect zero
issues, but a smoke test (`claude --version && git status && npm
install`) inside the runsc image is worth it.
### Effort
~12 days, plus a paragraph in the README. Cleanest first step.
## Rung 2: Kata Containers
Kata also registers as a Docker/containerd runtime
(`--runtime=kata-runtime`), but each container actually runs inside its
own lightweight VM. The VMM under the hood is configurable: Firecracker,
Cloud Hypervisor, or QEMU.
### What changes in this codebase
Essentially the same as the gVisor path: flip a runtime flag, add a
require-check. **Pipelock keeps working unchanged**, because Kata
virtualizes the network at the VM level but exposes it as a normal
Docker network.
### Tradeoffs vs. gVisor
- Stronger boundary (real VM, not a syscall filter).
- Slower cold start (hundreds of ms vs. tens). For interactive Claude
this is fine; for ephemeral batch agents you would notice.
- Not natively supported on macOS at all — needs a Linux host or a Linux
VM you control. **This is the moment bot-bottle stops being "works
on a Mac dev laptop with Docker Desktop."**
### When this is the right rung
If the deployment target is "agents run on a small Linux server I
administer," Kata is the sweet spot. If the target stays "users run this
on their MacBook," skip to the Apple Container option.
## Rung 3: Firecracker directly
Firecracker is a VMM, not a container runtime. Adopting it means
replacing Docker, not adding to it.
### What you would lose / have to rebuild
| Today | With Firecracker |
| --- | --- |
| `Dockerfile``node:22-slim` image | A rootfs (ext4 image) + a kernel (vmlinux) you build and pin |
| `docker run --network …` | TAP devices on the host, connected to a Linux bridge or routed manually |
| `docker exec -it` for the interactive TTY | vsock + a small in-guest agent, or SSH into the microVM |
| `docker cp` for skills + pipelock YAML | Bake into the rootfs, mount a virtio-blk overlay, or 9p / virtiofs share |
| Pipelock as a sidecar on a `--internal` network | Pipelock as a separate microVM (or on the host) with a TAP-only path between the two; the agent VM gets no host route |
| `docker rm -f` on exit | A SIGTERM to firecracker + cleanup of TAPs, sockets, overlay disks |
### Files in this repo that would change
- `bot_bottle/docker.py` → replaced by a new `bot_bottle/firecracker.py`
that POSTs to the Firecracker API socket per microVM (`/boot-source`,
`/drives`, `/network-interfaces`, `/actions`).
- `bot_bottle/network.py` → a host-side networking module that creates
a Linux bridge per agent, two TAPs (agent-side, pipelock-side), and
either iptables rules or no host route at all so the agent VM
literally cannot reach anything except pipelock.
- `bot_bottle/pipelock.py` → instead of a sidecar container, run
pipelock as its own microVM (or on the host pinned to the bridge).
The hostname-allowlist semantics carry over; the implementation is
different.
- `bot_bottle/skills.py`, `ssh.py` → can no longer use `docker cp`.
Bake skills into the rootfs at build time, or mount a virtiofs share
read-only.
- `Dockerfile` → replaced by a rootfs builder. Realistically this means
using something like `firecracker-containerd` or building the rootfs
with `debootstrap` / `mkosi` and a kernel from upstream.
### What you would gain
- A real KVM boundary. The strongest isolation realistically achievable
on commodity hardware.
- Sub-second cold starts (Firecracker boots in ~125 ms; rootfs prep
dominates).
### What you would give up
- macOS support. Firecracker is KVM-only. The only way back to Mac is to
nest a Linux VM hosting Firecracker, at which point the security
argument gets thin again.
- Ecosystem ergonomics. No `docker logs`, no `docker exec`, no `docker
network inspect`. You write all of that yourself or adopt
`firecracker-containerd` or Ignite (which is unmaintained — verify
before committing).
### Effort
Realistically 24 weeks of focused work on the runtime layer. Forces
dropping "v1 works on Mac" as a goal. PRD-worthy, not a side quest.
## Rung 3.5: Apple Container (macOS-native VM-per-container)
Apple Container is Apple's `container` CLI, native on Apple Silicon.
Each container runs in its own Virtualization.framework VM. It is the
macOS-native answer to "I want Kata's isolation model on my MacBook."
### Why it matters here
The CLI surface mirrors Docker closely (`container run`, `container
network create`, etc.), so the launcher rewrite is far smaller than
Firecracker's. On Linux hosts you would still take the gVisor or Kata
path. The result is:
- macOS: Apple Container (per-container VM via Virtualization.framework),
- Linux: gVisor or Kata,
- one Python launcher that switches on host OS.
### Open questions before committing
- Does Apple Container support a `--internal`-equivalent network with
no default gateway, so the pipelock topology is reproducible?
- Image format: Apple Container uses OCI images, so the existing
`Dockerfile` should be reusable, but this needs verification.
- `exec`-equivalent semantics: the launcher relies on `docker exec` to
attach a TTY after the container is up. Confirm `container exec`
behaves equivalently for interactive use.
A short spike (~1 day) answering those three questions would unblock a
PRD-level decision.
## Recommendation
If this were my project today, given the README still names macOS as in
scope and the manifest example carries `/Users/didericis` paths:
1. **Today.** Add `bottles.<name>.runtime` with `runc` / `runsc` options.
Land it as a one-day PR. README gets a small "Linux hosts can opt
into gVisor for stronger isolation" note. Mac users get nothing new
but lose nothing.
2. **If VM-grade isolation on macOS becomes the goal.** Skip Firecracker
and look at Apple Container. Smaller launcher rewrite than
Firecracker; Linux stays on the gVisor / Kata path. Probably the
right v2.
3. **Firecracker only if** bot-bottle's deployment target settles on
self-hosted Linux, not laptops — at which point the "non-goal:
self-hosted VMs" line in `AGENTS.md` flips and the project's
identity changes.
The pipelock egress design ports across all of these, so none of this
work threatens the existing security-load-bearing piece of v1.
## Caveats
- gVisor's unsupported-syscall list shifts release-to-release; verify
against the version pinned in any future image.
- Kata's default VMM is configurable; performance and CVE surface vary
by VMM choice.
- Firecracker tooling has churned (Ignite is effectively unmaintained;
`firecracker-containerd` is the active path). Re-survey before
committing.
- Apple Container is young; behavior around `--internal`-style networks
and `exec` semantics needs to be verified directly, not assumed.
- Research conducted 2026-05-10.