Files
bot-bottle/docs/research/stronger-isolation-alternatives.md
didericis-codex 18e3b62b72
test / unit (pull_request) Successful in 28s
test / integration (pull_request) Successful in 40s
test / unit (push) Successful in 31s
test / integration (push) Successful in 44s
docs: rename CLAUDE.md to AGENTS.md and rebrand provider-agnostic
Delete CLAUDE.md in favor of AGENTS.md as the orientation doc, rebrand
the project from Codex-bottle to provider-agnostic bot-bottle, and
repoint every CLAUDE.md reference across PRDs, research notes, the
implementer agent example, and the yaml_subset comment.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-28 20:36:47 -04:00

10 KiB
Raw Permalink Blame History

Stronger isolation alternatives: gVisor, Kata, Firecracker, Apple Container

Research into what it would take to replace or augment Docker (with runc) as the agent runtime in bot-bottle, and what each option would actually buy in security terms vs. cost in launcher rewrite.

Summary

There is a ladder, not a menu. Three realistic rungs, ordered by effort:

  1. gVisor (runsc) — flip a runtime flag per bottle. ~12 days. Adds a userspace syscall boundary; blocks most kernel-CVE escape classes.
  2. Kata Containers — flip a runtime flag per bottle. Same Docker UX, real microVM underneath. Linux-host only.
  3. Firecracker direct — replace Docker as the runtime entirely. Weeks of work. Strongest boundary, no macOS support.

A fourth option, Apple Container, is the right macOS-native answer to "I want Kata's isolation model without giving up MacBooks as the dev target." Probably the right v2 if bot-bottle keeps macOS in scope.

The pipelock egress design is portable across all four: every option can provide a network primitive that means "no default route except through the proxy" (Docker --internal, Kata's virtualized bridge, TAP-only Firecracker, Apple Container's per-VM networking). Whichever rung is chosen, the security-load-bearing part of the v1 design survives.

Threat model recap

The current v1 boundary is a single node:22-slim container running as uid 1000 under runc, sharing a kernel with the host. This protects against:

  • accidental host-filesystem access by Claude Code,
  • network egress not approved by the pipelock allowlist,
  • a misbehaving but uncoordinated agent.

It does not protect against:

  • a kernel-level container escape (Dirty Pipe / runc CVE class),
  • a coordinated attacker with code execution inside the container who targets the host kernel,
  • side channels accessible from the shared kernel.

Stronger isolation closes the second column. Whether that's worth the effort depends on whether you trust the agent's code-execution surface more or less than you trust the host kernel.

Rung 1: gVisor (runsc)

gVisor is a userspace kernel that registers as a Docker runtime. The agent's syscalls are intercepted and re-implemented in Go rather than forwarded to the host kernel.

What changes in this codebase

  • bot_bottle/cli/start.py (where docker run is assembled): add --runtime=runsc to the container args when the bottle requests it. Make it configurable: bottles.<name>.runtime: "runsc" | "runc", default runc.
  • bot_bottle/docker.py: add a require_runsc() check that runs docker info --format '{{.Runtimes}}' once and dies with an install pointer if runsc isn't registered.
  • network.py, pipelock.py, skills.py, ssh.py: no changes. Docker networks, docker exec, docker cp, volume mounts, the pipelock sidecar — all of it still works because gVisor is invisible at the Docker API layer.

What you get

  • A second syscall boundary between the agent and the host kernel. Most container-escape CVEs (Dirty Pipe / runc-escape class) stop at runsc.
  • Roughly 210% perf hit on syscall-heavy workloads. npm install will feel it; interactive claude typing will not.

Caveats

  • macOS does not run runsc natively. It needs a Linux kernel. On Mac, gVisor would run inside Docker Desktop's Linux VM, so the effective boundary becomes "agent ↔ runsc ↔ Docker Desktop's Linux VM ↔ hypervisor ↔ macOS". The hypervisor was already doing the heavy lifting; on Mac, runsc is mostly defense-in-depth. On a Linux host it's a real win.
  • Some syscalls are unsupported (a small list — io_uring historically, some ptrace shapes). For Claude Code + git + npm I expect zero issues, but a smoke test (claude --version && git status && npm install) inside the runsc image is worth it.

Effort

~12 days, plus a paragraph in the README. Cleanest first step.

Rung 2: Kata Containers

Kata also registers as a Docker/containerd runtime (--runtime=kata-runtime), but each container actually runs inside its own lightweight VM. The VMM under the hood is configurable: Firecracker, Cloud Hypervisor, or QEMU.

What changes in this codebase

Essentially the same as the gVisor path: flip a runtime flag, add a require-check. Pipelock keeps working unchanged, because Kata virtualizes the network at the VM level but exposes it as a normal Docker network.

Tradeoffs vs. gVisor

  • Stronger boundary (real VM, not a syscall filter).
  • Slower cold start (hundreds of ms vs. tens). For interactive Claude this is fine; for ephemeral batch agents you would notice.
  • Not natively supported on macOS at all — needs a Linux host or a Linux VM you control. This is the moment bot-bottle stops being "works on a Mac dev laptop with Docker Desktop."

When this is the right rung

If the deployment target is "agents run on a small Linux server I administer," Kata is the sweet spot. If the target stays "users run this on their MacBook," skip to the Apple Container option.

Rung 3: Firecracker directly

Firecracker is a VMM, not a container runtime. Adopting it means replacing Docker, not adding to it.

What you would lose / have to rebuild

Today With Firecracker
Dockerfilenode:22-slim image A rootfs (ext4 image) + a kernel (vmlinux) you build and pin
docker run --network … TAP devices on the host, connected to a Linux bridge or routed manually
docker exec -it for the interactive TTY vsock + a small in-guest agent, or SSH into the microVM
docker cp for skills + pipelock YAML Bake into the rootfs, mount a virtio-blk overlay, or 9p / virtiofs share
Pipelock as a sidecar on a --internal network Pipelock as a separate microVM (or on the host) with a TAP-only path between the two; the agent VM gets no host route
docker rm -f on exit A SIGTERM to firecracker + cleanup of TAPs, sockets, overlay disks

Files in this repo that would change

  • bot_bottle/docker.py → replaced by a new bot_bottle/firecracker.py that POSTs to the Firecracker API socket per microVM (/boot-source, /drives, /network-interfaces, /actions).
  • bot_bottle/network.py → a host-side networking module that creates a Linux bridge per agent, two TAPs (agent-side, pipelock-side), and either iptables rules or no host route at all so the agent VM literally cannot reach anything except pipelock.
  • bot_bottle/pipelock.py → instead of a sidecar container, run pipelock as its own microVM (or on the host pinned to the bridge). The hostname-allowlist semantics carry over; the implementation is different.
  • bot_bottle/skills.py, ssh.py → can no longer use docker cp. Bake skills into the rootfs at build time, or mount a virtiofs share read-only.
  • Dockerfile → replaced by a rootfs builder. Realistically this means using something like firecracker-containerd or building the rootfs with debootstrap / mkosi and a kernel from upstream.

What you would gain

  • A real KVM boundary. The strongest isolation realistically achievable on commodity hardware.
  • Sub-second cold starts (Firecracker boots in ~125 ms; rootfs prep dominates).

What you would give up

  • macOS support. Firecracker is KVM-only. The only way back to Mac is to nest a Linux VM hosting Firecracker, at which point the security argument gets thin again.
  • Ecosystem ergonomics. No docker logs, no docker exec, no docker network inspect. You write all of that yourself or adopt firecracker-containerd or Ignite (which is unmaintained — verify before committing).

Effort

Realistically 24 weeks of focused work on the runtime layer. Forces dropping "v1 works on Mac" as a goal. PRD-worthy, not a side quest.

Rung 3.5: Apple Container (macOS-native VM-per-container)

Apple Container is Apple's container CLI, native on Apple Silicon. Each container runs in its own Virtualization.framework VM. It is the macOS-native answer to "I want Kata's isolation model on my MacBook."

Why it matters here

The CLI surface mirrors Docker closely (container run, container network create, etc.), so the launcher rewrite is far smaller than Firecracker's. On Linux hosts you would still take the gVisor or Kata path. The result is:

  • macOS: Apple Container (per-container VM via Virtualization.framework),
  • Linux: gVisor or Kata,
  • one Python launcher that switches on host OS.

Open questions before committing

  • Does Apple Container support a --internal-equivalent network with no default gateway, so the pipelock topology is reproducible?
  • Image format: Apple Container uses OCI images, so the existing Dockerfile should be reusable, but this needs verification.
  • exec-equivalent semantics: the launcher relies on docker exec to attach a TTY after the container is up. Confirm container exec behaves equivalently for interactive use.

A short spike (~1 day) answering those three questions would unblock a PRD-level decision.

Recommendation

If this were my project today, given the README still names macOS as in scope and the manifest example carries /Users/didericis paths:

  1. Today. Add bottles.<name>.runtime with runc / runsc options. Land it as a one-day PR. README gets a small "Linux hosts can opt into gVisor for stronger isolation" note. Mac users get nothing new but lose nothing.
  2. If VM-grade isolation on macOS becomes the goal. Skip Firecracker and look at Apple Container. Smaller launcher rewrite than Firecracker; Linux stays on the gVisor / Kata path. Probably the right v2.
  3. Firecracker only if bot-bottle's deployment target settles on self-hosted Linux, not laptops — at which point the "non-goal: self-hosted VMs" line in AGENTS.md flips and the project's identity changes.

The pipelock egress design ports across all of these, so none of this work threatens the existing security-load-bearing piece of v1.

Caveats

  • gVisor's unsupported-syscall list shifts release-to-release; verify against the version pinned in any future image.
  • Kata's default VMM is configurable; performance and CVE surface vary by VMM choice.
  • Firecracker tooling has churned (Ignite is effectively unmaintained; firecracker-containerd is the active path). Re-survey before committing.
  • Apple Container is young; behavior around --internal-style networks and exec semantics needs to be verified directly, not assumed.
  • Research conducted 2026-05-10.