docs(prd-0018): one compose project per bottle instance

Draft a PRD that replaces the chain of per-sidecar docker SDK calls in `claude-bottle start` with a single `docker compose` project per instance. Each `state/<slug>/` dir gets a self-describing set of artifacts: metadata.json, docker-compose.yml, compose.log, and the existing transcript/ + live-config/.
2026-05-25 22:15:32 -04:00
parent a77545dc91
commit 3251ee1394
1 changed files with 390 additions and 0 deletions
@@ -0,0 +1,390 @@
+# PRD 0018: One Compose project per bottle instance
+
+- **Status:** Draft
+- **Author:** didericis
+- **Created:** 2026-05-25
+
+## Summary
+
+Replace the current pattern of orchestrating each sidecar with its own
+`docker` SDK calls with **one `docker compose` project per bottle
+instance**. The compose project is generated at `start` time, written
+to disk under the instance's state dir, and brought up with
+`docker compose up`. Tearing the instance down is `docker compose
+down`. Logs come from `docker compose logs` and land in a single file
+per instance, so reading what happened in a session is one `less`
+away.
+
+State for each instance (`~/.claude-bottle/state/<slug>/`) becomes a
+self-describing folder:
+
+```
+metadata.json           # agent_name, cwd, started_at, compose project name, ...
+docker-compose.yml      # the exact compose spec used to start this instance
+compose.log             # full dump of `docker compose logs --no-color`
+transcript/             # snapshotted agent conversation (existing)
+live-config/            # routes.yaml, allowlist — bind-mounted into sidecars (existing)
+```
+
+Anything that needs to look at "what did instance X actually run?" can
+read those four artifacts. The compose file plus the metadata
+together fully describe the container topology.
+
+## Problem
+
+Today `start` builds each sidecar (`pipelock`, `egress`, `git-gate`,
+`supervise`) and the agent container with a chain of individual SDK
+calls in `claude_bottle/backend/docker/launch.py`:
+
+- A per-sidecar `Docker{Sidecar}.start()` method does
+  `docker create` → `docker cp` (stage files) → `docker network
+  connect` → `docker start`.
+- Two networks are created up front (`network_create` calls).
+- The agent container starts last via its own `docker run`.
+
+This is fine, but it has three rough edges:
+
+1. **No single artifact describes the topology.** To understand what
+   ran for instance `<slug>`, you have to read the Python that built
+   the SDK calls. Nothing is on disk you can `cat`.
+
+2. **Logs are scattered.** Each container's logs sit in Docker's per-
+   container journal. To debug a session post-mortem you have to
+   remember to run `docker logs claude-bottle-pipelock-<slug>` etc.
+   before the containers age out, and there's no merged view.
+
+3. **Teardown is bespoke.** Each sidecar's `stop()` is its own
+   method, ordered carefully in `start.py`'s `ExitStack`. A leftover
+   container or network from a crash takes the `cleanup` CLI to find.
+
+Compose is purpose-built for this shape: declarative spec, one
+project name per environment, merged logs, atomic up/down.
+
+## Goals / Success Criteria
+
+1. `claude-bottle start <agent>` writes
+   `~/.claude-bottle/state/<slug>/docker-compose.yml` and brings the
+   project up with `docker compose -p <project> up`.
+2. The compose file is the source of truth for the container
+   topology — every sidecar that runs is declared as a `services:`
+   entry, every network is a `networks:` entry, every bind mount is
+   a `volumes:` entry.
+3. `~/.claude-bottle/state/<slug>/compose.log` contains the full
+   merged stdout/stderr of every service for the session, in
+   `docker compose logs --no-color` format.
+4. `metadata.json` records the compose project name alongside the
+   existing fields (`agent_name`, `cwd`, `started_at`), so other
+   tools can derive `docker compose -p <project> ...` invocations
+   without re-deriving the slug.
+5. Session teardown is `docker compose -p <project> down`. The
+   existing per-sidecar `stop()` lifecycle methods come out.
+6. The `cleanup` CLI uses `docker compose ls` (filtered to
+   `claude-bottle-*` projects) instead of name-prefix scans across
+   `docker ps -a` and `docker network ls`.
+7. The existing remediation flows (`pipelock-block`,
+   `egress-block`, `capability-block`) keep working without
+   protocol changes — they write to host paths under
+   `state/<slug>/live-config/`, sidecars `SIGHUP`-reload from the
+   bind mount, no compose-side restart needed.
+
+## Non-goals
+
+- **Multi-host compose.** No swarm, no remote contexts. Each instance
+  is one local Docker daemon.
+- **Replacing the manifest format.** Manifests stay; compose is an
+  implementation detail of the Docker backend.
+- **Replacing the backend abstraction (PRD 0003).** `Backend` stays
+  abstract; only the Docker implementation changes.
+- **A long-lived "claude-bottle daemon."** Each `start` invocation
+  still owns a single compose project for the lifetime of the
+  session. No persistent service.
+- **Image pre-building.** Compose's `build:` directive triggers
+  builds on first `up`, same as today; no separate build step.
+- **Backwards compatibility with running instances at upgrade.** If
+  an instance was started by the pre-compose code, the user kills
+  it and starts a new one. There's no migration path for live
+  containers.
+
+## Scope
+
+### In scope
+
+- New module `claude_bottle/backend/docker/compose.py` that renders a
+  compose dict from a `BottlePlan` and writes it to
+  `state/<slug>/docker-compose.yml`.
+- `DockerBackend.start` rewritten to:
+  1. Build the plan (existing `prepare`).
+  2. Stage bind-mount inputs (CAs, routes.yaml, env file, hooks)
+     into host paths under `state/<slug>/`.
+  3. Render + write the compose file.
+  4. Exec `docker compose -p <project> up -d`.
+  5. `docker attach claude-bottle-<slug>` for the agent's TTY.
+  6. On exit: `docker compose -p <project> logs --no-color`
+     → `state/<slug>/compose.log`, then `docker compose -p
+     <project> down --volumes`.
+- Sidecar stage files move from `docker cp`-into-container to
+  bind-mounts from `state/<slug>/`. This deletes a lot of code
+  in `pipelock.py`, `git_gate.py`, `egress.py`, `supervise.py`.
+- `metadata.json` gains a `compose_project` field.
+- `cleanup` CLI rewritten to use `docker compose ls` for discovery.
+- The per-sidecar `Docker{Sidecar}.start/stop` lifecycle methods
+  collapse into `Docker{Sidecar}.compose_service()` returning a
+  service-dict fragment. Their apply / introspection helpers (
+  `egress_apply.py`, `supervise.py`'s handlers) are unchanged.
+
+### Out of scope
+
+- Changing the manifest layer (`claude_bottle/manifest.py`,
+  `egress.py`'s plan dataclasses, `pipelock.py`'s plan dataclasses).
+- Changing the agent's runtime contract (proxy env vars, CA bundle
+  paths, current-config mount path).
+- Changing audit-log shape or location (
+  `~/.claude-bottle/audit/<component>-<slug>.log` stays).
+- Changing the MCP server's tool list or wire format.
+- Dropping the `--rm` semantics for the agent: the agent container
+  is still ephemeral; compose's `down --volumes` handles cleanup.
+
+## Proposed design
+
+### Project name
+
+`compose_project = f"claude-bottle-{slug}"`. The slug stays the
+existing `slugify(agent_name)-<5-char-random-base36>` from
+`bottle_state.py`. Compose adds its own prefix to networks
+(`<project>_<network>`) and to default container names — which is
+why each service gets an explicit `container_name:` (below).
+
+### Service / container naming
+
+Service names inside the compose file are short (`agent`,
+`pipelock`, `egress`, `git-gate`, `supervise`). Each service sets
+an explicit `container_name:` matching today's pattern:
+
+```yaml
+services:
+  pipelock:
+    container_name: claude-bottle-pipelock-<slug>
+  egress:
+    container_name: claude-bottle-egress-<slug>
+  # ...
+```
+
+This keeps the dashboard's container-discovery output stable for
+operators who've memorized the names. The compose project name
+(`claude-bottle-<slug>`) is the only new identifier.
+
+### Networks
+
+The two existing networks (`claude-bottle-net-<slug>` internal +
+`claude-bottle-egress-<slug>` upstream-bridge) become compose
+networks:
+
+```yaml
+networks:
+  internal:
+    name: claude-bottle-net-<slug>
+    internal: true
+  egress:
+    name: claude-bottle-egress-<slug>
+```
+
+Each service's `networks:` list mirrors today's wiring.
+
+### Bind mounts replace `docker cp`
+
+The current pattern of `docker create` → `docker cp file
+container:/path` → `docker start` (used by every sidecar to land
+routes.yaml, CAs, hooks) becomes host bind-mounts. The host paths
+live under `state/<slug>/`:
+
+```
+state/<slug>/
+  live-config/
+    routes.yaml
+    allowlist
+  pipelock-ca/
+    ca.pem
+    ca-key.pem
+  egress-ca/
+    ca.pem
+    ca-key.pem
+  git-gate/
+    entrypoint.sh
+    hooks/
+    ...
+  env/
+    agent.env
+```
+
+Each sidecar service mounts the relevant sub-tree read-only at the
+in-container path it expects. Permissions on the host paths are
+locked to 0600/0700 at write time (existing `mode=0o600` discipline
+in `prepare.py` extends naturally).
+
+### Conditional services
+
+The compose renderer takes the same `BottlePlan` the SDK calls
+read today and only emits services for sidecars that apply:
+
+- `pipelock` — always.
+- `egress` — only if `bottle.egress.routes` is non-empty.
+- `git-gate` — only if `bottle.git` is non-empty.
+- `supervise` — only if `bottle.supervise` is true.
+- `agent` — always.
+
+Conditional `depends_on:` edges keep the agent waiting on
+sidecars that exist.
+
+### Logging
+
+`docker compose up -d` starts everything detached. The agent is
+attached for the user's TTY via `docker attach claude-bottle-
+<slug>`. Sidecars stream into Docker's per-container journals
+during the session, exactly as today, and `docker compose logs -f`
+gives a merged tail if the user wants it (the dashboard can shell
+to this).
+
+At session end (success or crash), `start.py`'s ExitStack runs:
+
+1. `snapshot_transcript(slug)` (unchanged).
+2. `docker compose -p <project> logs --no-color --timestamps` →
+   `state/<slug>/compose.log`.
+3. `docker compose -p <project> down --volumes`.
+4. `cleanup_state(slug)` (unchanged — still removes the state dir
+   unless `.preserve` was written).
+
+The log dump is best-effort; a failure there shouldn't block
+teardown.
+
+### metadata.json shape
+
+Add one field; everything else is unchanged.
+
+```json
+{
+  "agent_name": "implementer",
+  "cwd": "/Users/.../some-project",
+  "started_at": "2026-05-25T20:13:04Z",
+  "compose_project": "claude-bottle-implementer-a7k3f"
+}
+```
+
+### Per-sidecar class shape
+
+Today's `DockerPipelock`, `DockerGitGate`, `DockerEgress`,
+`DockerSupervise` each carry `start()` + `stop()` lifecycle plus
+helper logic (image building, route validation, apply handlers).
+
+After this PRD:
+
+- The `start()`/`stop()` methods come out.
+- A new method per class, `compose_service(plan) -> dict`, returns
+  the service-stanza fragment (image / build / container_name /
+  networks / volumes / env / depends_on).
+- The image-build flow becomes `build:` in the compose file, so
+  the per-sidecar `docker build` calls go away too.
+- The apply/introspection helpers (`egress_apply.add_route`,
+  `supervise.py`'s capability handlers, etc.) are untouched — they
+  read/write host paths under `state/<slug>/live-config/` and the
+  bind-mounted sidecars `SIGHUP`-reload.
+
+### Cleanup CLI
+
+`./cli.py cleanup` switches from "list every container with prefix
+`claude-bottle-` and every network with prefix `claude-bottle-net-`
+or `claude-bottle-egress-`" to:
+
+1. `docker compose ls --all --format json` → filter to projects
+   whose name starts with `claude-bottle-`.
+2. For each: `docker compose -p <project> down --volumes`.
+3. Reap any state dirs under `~/.claude-bottle/state/` whose
+   `compose_project` no longer appears in `compose ls`.
+
+Strays from pre-compose code-paths can be mopped up by keeping the
+existing prefix scan as a fallback for one release.
+
+## Open questions
+
+1. **`docker compose` vs `docker-compose` v1.** Compose v2 ships
+   with Docker Desktop as `docker compose` (subcommand) and is what
+   `tea pr create` users will already have. Assume v2; if v1 is
+   detected, die with a pointer to upgrade.
+
+2. **Foreground vs detached + attach for the agent.** Two viable
+   shapes:
+   - **(a)** `docker compose up -d` everything, then
+     `docker attach claude-bottle-<slug>` for the agent's TTY.
+   - **(b)** `docker compose up -d` sidecars only, then a
+     separate `docker run` for the agent in the foreground using
+     the same project's networks (`--network claude-bottle-net-
+     <slug> --network claude-bottle-egress-<slug>`).
+
+   (a) keeps the agent inside the compose project (so `compose ls`,
+   `compose logs`, `compose down` all see it). (b) avoids the
+   `docker attach` gotcha that `Ctrl-P Ctrl-Q` detaches without
+   tearing down, but loses the unified compose surface.
+   Leaning toward (a); flagging for review.
+
+3. **TTY allocation under compose.** The agent needs `tty: true` +
+   `stdin_open: true` in the service stanza, and `docker attach`
+   has to be invoked with `-it`. Verify this works cleanly on
+   macOS Docker Desktop and Colima before committing.
+
+4. **`docker compose logs` ordering.** The dumped log file
+   interleaves services by timestamp. Confirm `--timestamps` is
+   enough to keep it readable; otherwise consider per-service
+   subfiles (`compose.log.pipelock`, etc.).
+
+5. **Image build caching.** `build:` in compose rebuilds on first
+   `up` unless the image is already tagged. The per-sidecar images
+   (`claude-bottle-pipelock`, `claude-bottle-egress`,
+   `claude-bottle-git-gate`, `claude-bottle-supervise`) should
+   stay tagged on the daemon between runs so we don't rebuild on
+   every start. Verify compose's behavior matches.
+
+6. **`docker compose down --volumes` and bind-mount data.** `down
+   --volumes` removes named volumes but leaves bind-mount source
+   paths alone (they're host paths under our state dir, which we
+   manage explicitly). Confirm — and if there's a footgun, drop
+   `--volumes` and rely on the state-dir cleanup step.
+
+7. **Dashboard discovery.** `cli/dashboard.py` enumerates instances
+   by scanning containers. Should it switch to `docker compose ls`
+   too, or read `metadata.json` files under `state/`? Reading state
+   dirs is faster and survives docker daemon restarts; compose ls
+   is the truth about what's actually running. Probably both: list
+   from state dirs, mark "running" by cross-referencing compose
+   ls.
+
+## Implementation chunks
+
+Sized for one PR each, in order.
+
+1. **Compose renderer.** Pure function:
+   `bottle_plan_to_compose(plan) -> dict`. No I/O. Full unit-test
+   coverage for the conditional-service matrix (every combination
+   of git on/off, egress on/off, supervise on/off). No `start.py`
+   changes yet.
+2. **Stage-file move to host paths.** Refactor each sidecar's
+   stage-file production (today: write to host stage dir → `docker
+   cp` after create) to write directly into `state/<slug>/`
+   sub-trees with bind-mount-ready perms. SDK path still does
+   `docker cp`; this is a no-op rearrangement that sets up chunk 3.
+3. **Switch `start.py` to compose.** Wire up the renderer +
+   `docker compose up -d` + attach + teardown. Per-sidecar `start()`/
+   `stop()` lifecycle methods deleted in the same chunk. Compose-
+   log dump on teardown added.
+4. **Cleanup CLI on compose.** Switch `./cli.py cleanup` to
+   `docker compose ls`-based discovery; keep prefix-scan as
+   fallback for one release.
+5. **Dashboard.** Decide on the discovery question (open question
+   #7), implement.
+
+## References
+
+- PRD 0003 — bottle backend abstraction (what stays / what
+  changes underneath it)
+- PRD 0010 / 0017 — cred-proxy → egress; the sidecar lifecycle
+  this PRD collapses into compose
+- PRD 0014 / 0015 / 0016 — apply flows that bind-mount-+-SIGHUP
+  has to keep working without protocol change