diff --git a/docs/prds/0018-compose-per-instance.md b/docs/prds/0018-compose-per-instance.md new file mode 100644 index 0000000..279a719 --- /dev/null +++ b/docs/prds/0018-compose-per-instance.md @@ -0,0 +1,390 @@ +# PRD 0018: One Compose project per bottle instance + +- **Status:** Draft +- **Author:** didericis +- **Created:** 2026-05-25 + +## Summary + +Replace the current pattern of orchestrating each sidecar with its own +`docker` SDK calls with **one `docker compose` project per bottle +instance**. The compose project is generated at `start` time, written +to disk under the instance's state dir, and brought up with +`docker compose up`. Tearing the instance down is `docker compose +down`. Logs come from `docker compose logs` and land in a single file +per instance, so reading what happened in a session is one `less` +away. + +State for each instance (`~/.claude-bottle/state//`) becomes a +self-describing folder: + +``` +metadata.json # agent_name, cwd, started_at, compose project name, ... +docker-compose.yml # the exact compose spec used to start this instance +compose.log # full dump of `docker compose logs --no-color` +transcript/ # snapshotted agent conversation (existing) +live-config/ # routes.yaml, allowlist — bind-mounted into sidecars (existing) +``` + +Anything that needs to look at "what did instance X actually run?" can +read those four artifacts. The compose file plus the metadata +together fully describe the container topology. + +## Problem + +Today `start` builds each sidecar (`pipelock`, `egress`, `git-gate`, +`supervise`) and the agent container with a chain of individual SDK +calls in `claude_bottle/backend/docker/launch.py`: + +- A per-sidecar `Docker{Sidecar}.start()` method does + `docker create` → `docker cp` (stage files) → `docker network + connect` → `docker start`. +- Two networks are created up front (`network_create` calls). +- The agent container starts last via its own `docker run`. + +This is fine, but it has three rough edges: + +1. **No single artifact describes the topology.** To understand what + ran for instance ``, you have to read the Python that built + the SDK calls. Nothing is on disk you can `cat`. + +2. **Logs are scattered.** Each container's logs sit in Docker's per- + container journal. To debug a session post-mortem you have to + remember to run `docker logs claude-bottle-pipelock-` etc. + before the containers age out, and there's no merged view. + +3. **Teardown is bespoke.** Each sidecar's `stop()` is its own + method, ordered carefully in `start.py`'s `ExitStack`. A leftover + container or network from a crash takes the `cleanup` CLI to find. + +Compose is purpose-built for this shape: declarative spec, one +project name per environment, merged logs, atomic up/down. + +## Goals / Success Criteria + +1. `claude-bottle start ` writes + `~/.claude-bottle/state//docker-compose.yml` and brings the + project up with `docker compose -p up`. +2. The compose file is the source of truth for the container + topology — every sidecar that runs is declared as a `services:` + entry, every network is a `networks:` entry, every bind mount is + a `volumes:` entry. +3. `~/.claude-bottle/state//compose.log` contains the full + merged stdout/stderr of every service for the session, in + `docker compose logs --no-color` format. +4. `metadata.json` records the compose project name alongside the + existing fields (`agent_name`, `cwd`, `started_at`), so other + tools can derive `docker compose -p ...` invocations + without re-deriving the slug. +5. Session teardown is `docker compose -p down`. The + existing per-sidecar `stop()` lifecycle methods come out. +6. The `cleanup` CLI uses `docker compose ls` (filtered to + `claude-bottle-*` projects) instead of name-prefix scans across + `docker ps -a` and `docker network ls`. +7. The existing remediation flows (`pipelock-block`, + `egress-block`, `capability-block`) keep working without + protocol changes — they write to host paths under + `state//live-config/`, sidecars `SIGHUP`-reload from the + bind mount, no compose-side restart needed. + +## Non-goals + +- **Multi-host compose.** No swarm, no remote contexts. Each instance + is one local Docker daemon. +- **Replacing the manifest format.** Manifests stay; compose is an + implementation detail of the Docker backend. +- **Replacing the backend abstraction (PRD 0003).** `Backend` stays + abstract; only the Docker implementation changes. +- **A long-lived "claude-bottle daemon."** Each `start` invocation + still owns a single compose project for the lifetime of the + session. No persistent service. +- **Image pre-building.** Compose's `build:` directive triggers + builds on first `up`, same as today; no separate build step. +- **Backwards compatibility with running instances at upgrade.** If + an instance was started by the pre-compose code, the user kills + it and starts a new one. There's no migration path for live + containers. + +## Scope + +### In scope + +- New module `claude_bottle/backend/docker/compose.py` that renders a + compose dict from a `BottlePlan` and writes it to + `state//docker-compose.yml`. +- `DockerBackend.start` rewritten to: + 1. Build the plan (existing `prepare`). + 2. Stage bind-mount inputs (CAs, routes.yaml, env file, hooks) + into host paths under `state//`. + 3. Render + write the compose file. + 4. Exec `docker compose -p up -d`. + 5. `docker attach claude-bottle-` for the agent's TTY. + 6. On exit: `docker compose -p logs --no-color` + → `state//compose.log`, then `docker compose -p + down --volumes`. +- Sidecar stage files move from `docker cp`-into-container to + bind-mounts from `state//`. This deletes a lot of code + in `pipelock.py`, `git_gate.py`, `egress.py`, `supervise.py`. +- `metadata.json` gains a `compose_project` field. +- `cleanup` CLI rewritten to use `docker compose ls` for discovery. +- The per-sidecar `Docker{Sidecar}.start/stop` lifecycle methods + collapse into `Docker{Sidecar}.compose_service()` returning a + service-dict fragment. Their apply / introspection helpers ( + `egress_apply.py`, `supervise.py`'s handlers) are unchanged. + +### Out of scope + +- Changing the manifest layer (`claude_bottle/manifest.py`, + `egress.py`'s plan dataclasses, `pipelock.py`'s plan dataclasses). +- Changing the agent's runtime contract (proxy env vars, CA bundle + paths, current-config mount path). +- Changing audit-log shape or location ( + `~/.claude-bottle/audit/-.log` stays). +- Changing the MCP server's tool list or wire format. +- Dropping the `--rm` semantics for the agent: the agent container + is still ephemeral; compose's `down --volumes` handles cleanup. + +## Proposed design + +### Project name + +`compose_project = f"claude-bottle-{slug}"`. The slug stays the +existing `slugify(agent_name)-<5-char-random-base36>` from +`bottle_state.py`. Compose adds its own prefix to networks +(`_`) and to default container names — which is +why each service gets an explicit `container_name:` (below). + +### Service / container naming + +Service names inside the compose file are short (`agent`, +`pipelock`, `egress`, `git-gate`, `supervise`). Each service sets +an explicit `container_name:` matching today's pattern: + +```yaml +services: + pipelock: + container_name: claude-bottle-pipelock- + egress: + container_name: claude-bottle-egress- + # ... +``` + +This keeps the dashboard's container-discovery output stable for +operators who've memorized the names. The compose project name +(`claude-bottle-`) is the only new identifier. + +### Networks + +The two existing networks (`claude-bottle-net-` internal + +`claude-bottle-egress-` upstream-bridge) become compose +networks: + +```yaml +networks: + internal: + name: claude-bottle-net- + internal: true + egress: + name: claude-bottle-egress- +``` + +Each service's `networks:` list mirrors today's wiring. + +### Bind mounts replace `docker cp` + +The current pattern of `docker create` → `docker cp file +container:/path` → `docker start` (used by every sidecar to land +routes.yaml, CAs, hooks) becomes host bind-mounts. The host paths +live under `state//`: + +``` +state// + live-config/ + routes.yaml + allowlist + pipelock-ca/ + ca.pem + ca-key.pem + egress-ca/ + ca.pem + ca-key.pem + git-gate/ + entrypoint.sh + hooks/ + ... + env/ + agent.env +``` + +Each sidecar service mounts the relevant sub-tree read-only at the +in-container path it expects. Permissions on the host paths are +locked to 0600/0700 at write time (existing `mode=0o600` discipline +in `prepare.py` extends naturally). + +### Conditional services + +The compose renderer takes the same `BottlePlan` the SDK calls +read today and only emits services for sidecars that apply: + +- `pipelock` — always. +- `egress` — only if `bottle.egress.routes` is non-empty. +- `git-gate` — only if `bottle.git` is non-empty. +- `supervise` — only if `bottle.supervise` is true. +- `agent` — always. + +Conditional `depends_on:` edges keep the agent waiting on +sidecars that exist. + +### Logging + +`docker compose up -d` starts everything detached. The agent is +attached for the user's TTY via `docker attach claude-bottle- +`. Sidecars stream into Docker's per-container journals +during the session, exactly as today, and `docker compose logs -f` +gives a merged tail if the user wants it (the dashboard can shell +to this). + +At session end (success or crash), `start.py`'s ExitStack runs: + +1. `snapshot_transcript(slug)` (unchanged). +2. `docker compose -p logs --no-color --timestamps` → + `state//compose.log`. +3. `docker compose -p down --volumes`. +4. `cleanup_state(slug)` (unchanged — still removes the state dir + unless `.preserve` was written). + +The log dump is best-effort; a failure there shouldn't block +teardown. + +### metadata.json shape + +Add one field; everything else is unchanged. + +```json +{ + "agent_name": "implementer", + "cwd": "/Users/.../some-project", + "started_at": "2026-05-25T20:13:04Z", + "compose_project": "claude-bottle-implementer-a7k3f" +} +``` + +### Per-sidecar class shape + +Today's `DockerPipelock`, `DockerGitGate`, `DockerEgress`, +`DockerSupervise` each carry `start()` + `stop()` lifecycle plus +helper logic (image building, route validation, apply handlers). + +After this PRD: + +- The `start()`/`stop()` methods come out. +- A new method per class, `compose_service(plan) -> dict`, returns + the service-stanza fragment (image / build / container_name / + networks / volumes / env / depends_on). +- The image-build flow becomes `build:` in the compose file, so + the per-sidecar `docker build` calls go away too. +- The apply/introspection helpers (`egress_apply.add_route`, + `supervise.py`'s capability handlers, etc.) are untouched — they + read/write host paths under `state//live-config/` and the + bind-mounted sidecars `SIGHUP`-reload. + +### Cleanup CLI + +`./cli.py cleanup` switches from "list every container with prefix +`claude-bottle-` and every network with prefix `claude-bottle-net-` +or `claude-bottle-egress-`" to: + +1. `docker compose ls --all --format json` → filter to projects + whose name starts with `claude-bottle-`. +2. For each: `docker compose -p down --volumes`. +3. Reap any state dirs under `~/.claude-bottle/state/` whose + `compose_project` no longer appears in `compose ls`. + +Strays from pre-compose code-paths can be mopped up by keeping the +existing prefix scan as a fallback for one release. + +## Open questions + +1. **`docker compose` vs `docker-compose` v1.** Compose v2 ships + with Docker Desktop as `docker compose` (subcommand) and is what + `tea pr create` users will already have. Assume v2; if v1 is + detected, die with a pointer to upgrade. + +2. **Foreground vs detached + attach for the agent.** Two viable + shapes: + - **(a)** `docker compose up -d` everything, then + `docker attach claude-bottle-` for the agent's TTY. + - **(b)** `docker compose up -d` sidecars only, then a + separate `docker run` for the agent in the foreground using + the same project's networks (`--network claude-bottle-net- + --network claude-bottle-egress-`). + + (a) keeps the agent inside the compose project (so `compose ls`, + `compose logs`, `compose down` all see it). (b) avoids the + `docker attach` gotcha that `Ctrl-P Ctrl-Q` detaches without + tearing down, but loses the unified compose surface. + Leaning toward (a); flagging for review. + +3. **TTY allocation under compose.** The agent needs `tty: true` + + `stdin_open: true` in the service stanza, and `docker attach` + has to be invoked with `-it`. Verify this works cleanly on + macOS Docker Desktop and Colima before committing. + +4. **`docker compose logs` ordering.** The dumped log file + interleaves services by timestamp. Confirm `--timestamps` is + enough to keep it readable; otherwise consider per-service + subfiles (`compose.log.pipelock`, etc.). + +5. **Image build caching.** `build:` in compose rebuilds on first + `up` unless the image is already tagged. The per-sidecar images + (`claude-bottle-pipelock`, `claude-bottle-egress`, + `claude-bottle-git-gate`, `claude-bottle-supervise`) should + stay tagged on the daemon between runs so we don't rebuild on + every start. Verify compose's behavior matches. + +6. **`docker compose down --volumes` and bind-mount data.** `down + --volumes` removes named volumes but leaves bind-mount source + paths alone (they're host paths under our state dir, which we + manage explicitly). Confirm — and if there's a footgun, drop + `--volumes` and rely on the state-dir cleanup step. + +7. **Dashboard discovery.** `cli/dashboard.py` enumerates instances + by scanning containers. Should it switch to `docker compose ls` + too, or read `metadata.json` files under `state/`? Reading state + dirs is faster and survives docker daemon restarts; compose ls + is the truth about what's actually running. Probably both: list + from state dirs, mark "running" by cross-referencing compose + ls. + +## Implementation chunks + +Sized for one PR each, in order. + +1. **Compose renderer.** Pure function: + `bottle_plan_to_compose(plan) -> dict`. No I/O. Full unit-test + coverage for the conditional-service matrix (every combination + of git on/off, egress on/off, supervise on/off). No `start.py` + changes yet. +2. **Stage-file move to host paths.** Refactor each sidecar's + stage-file production (today: write to host stage dir → `docker + cp` after create) to write directly into `state//` + sub-trees with bind-mount-ready perms. SDK path still does + `docker cp`; this is a no-op rearrangement that sets up chunk 3. +3. **Switch `start.py` to compose.** Wire up the renderer + + `docker compose up -d` + attach + teardown. Per-sidecar `start()`/ + `stop()` lifecycle methods deleted in the same chunk. Compose- + log dump on teardown added. +4. **Cleanup CLI on compose.** Switch `./cli.py cleanup` to + `docker compose ls`-based discovery; keep prefix-scan as + fallback for one release. +5. **Dashboard.** Decide on the discovery question (open question + #7), implement. + +## References + +- PRD 0003 — bottle backend abstraction (what stays / what + changes underneath it) +- PRD 0010 / 0017 — cred-proxy → egress; the sidecar lifecycle + this PRD collapses into compose +- PRD 0014 / 0015 / 0016 — apply flows that bind-mount-+-SIGHUP + has to keep working without protocol change