5e8ca21669
The project started life as bash scripts and got rewritten to Python (documented in docs/research/bash-vs-python-vs-go.md). Several docs still carried the old "bash-first" framing — misleading for anyone reading them now (8.7k lines of Python vs. ~130 lines of bash, all in scripts/demo*.sh). - CLAUDE.md "What this is" + "Conventions": orchestrator is Python, posture is stdlib-first. - docs/prds/0010-cred-proxy.md, docs/research/manifest-format-and- grouping.md: quoted CLAUDE.md's old wording — re-quote. - docs/research/built-in-supervisor-design.md, landscape-containerized- claude.md, agent-sandbox-landscape.md, pipelock-assessment.md, network-egress-guard.md: drop "bash-first" claims about the project, keep accurate descriptions of external tools' bash usage. Leaves untouched: bash code-fence syntax in examples, README's literal `bash scripts/demo.sh` invocation (the demo IS bash), Claude Code's "Bash tool" references, IVIJL/devbox bash description (that project actually is bash), and the bash-vs-python-vs-go research note that records the rewrite decision. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
97 lines
9.0 KiB
Markdown
97 lines
9.0 KiB
Markdown
# Built-in Supervisor Design
|
||
|
||
## Question
|
||
|
||
Can claude-bottle grow a built-in supervisor — TUI inventory plus PR-feedback routing — without breaking the per-bottle isolation model, and without departing from the Python-stdlib-first, low-dependency posture?
|
||
|
||
## Context
|
||
|
||
claude-bottle today is a fleet *executor*: `./cli.py start <agent>` brings up one bottle (agent container + pipelock + optional git-gate + optional cred-proxy on a per-bottle internal network), and `cli.py` tears it down when the session ends. There is no inventory view, no idle-detection, no automated reaction to PR or CI events. In parallel use, a human is the supervisor — opening one terminal per bottle, switching between them, and watching upstream PR state by hand.
|
||
|
||
A separate survey of the broader ecosystem ([agent control dashboards research, mid-2026](https://gitea.dideric.is/didericis/consilium-research/src/branch/main/developer-workflow/agent-control-dashboards-2026-05-24.md)) sorts dashboards into five tiers (session managers, parallel runners, Kanban boards, mission-control SPAs, observability backends). The earlier first-pass conclusion was that a full SPA tier conflicts with claude-bottle's isolation model. This doc reconsiders the smaller question: a TUI supervisor in the existing Python CLI.
|
||
|
||
## What I got wrong the first time
|
||
|
||
The earlier framing treated "add a supervisor" as synonymous with "adopt something Composio-AO-shaped" — a Next.js SPA with plugins, dashboards, and a long-running web server. On that framing, the answer is correctly "no, that's too heavy and breaks isolation."
|
||
|
||
But the framing collapses two different costs that aren't actually coupled:
|
||
|
||
1. The runtime cost of *each bottle* (already paid: container + 1–3 sidecars + 2 networks).
|
||
2. The runtime cost of a *supervisor* that watches and controls bottles.
|
||
|
||
A supervisor doesn't have to be heavy. A TUI built into the existing Python CLI, reading `docker ps` and host-side log files, is closer in spirit to `tmux-agent-status` than to Mission Control. The trust analysis below is what actually matters.
|
||
|
||
## Proposed design
|
||
|
||
Three layers, each independently useful, in order of ambition:
|
||
|
||
### 1. `./cli.py status` — read-only inventory
|
||
|
||
Reads `docker ps` filtered by a bottle label and tails each bottle's session log. Reports per bottle: name, agent, uptime, last-activity timestamp, token spend if available, associated PR/branch if recorded.
|
||
|
||
No new daemons. No new ports. No new credentials. ~100 lines.
|
||
|
||
### 2. `./cli.py watch` — TUI over the same data
|
||
|
||
Same data as `status`, rendered with auto-refresh and keyboard shortcuts that shell out to the existing `cli.py attach / stop / start` commands.
|
||
|
||
Library choice: prefer the stdlib `curses` module to stay stdlib-first; fall back to `rich` or `textual` only if the curses path proves painful. Both `rich` and `textual` are single-purpose, pure-Python deps with no transitive bloat, but they are still new deps and per the project conventions warrant a deliberate decision.
|
||
|
||
This is the Claude Squad / tmux-agent-status pattern, applied to bottles instead of tmux sessions. The whole category exists *because* a TUI is the lightweight shape that doesn't require what the SPA tier requires.
|
||
|
||
### 3. `./cli.py supervise` — PR feedback router
|
||
|
||
The optional, more ambitious layer. The bottle manifest gains an optional field:
|
||
|
||
```yaml
|
||
pr_watch:
|
||
upstream: gitea.dideric.is/didericis/myproject
|
||
branch: agent/task-42
|
||
```
|
||
|
||
`./cli.py supervise` polls the named upstream for new review comments and CI failures on `branch`. When one fires, it surfaces as a desktop notification or a flash in the TUI. The human decides what to do with the feedback — there is no autonomous loop that feeds the comment back into a bottle's next prompt (see "Where to be conservative" for why).
|
||
|
||
The polling token is a **host** token (the same `GH_PAT` / Gitea token the host already keeps in shell env), not a bottle credential. The supervisor never holds bottle secrets.
|
||
|
||
## Why this doesn't break the trust model
|
||
|
||
The load-bearing question is whether the supervisor introduces the privileged-channel-into-every-bottle problem that disqualifies the SPA tier. It does not, for four reasons:
|
||
|
||
| Concern | Mitigation |
|
||
|---|---|
|
||
| Reaching into running bottles | Supervisor reads `docker ps` and host-side log files. The host already sees both — Docker is the trust boundary, the supervisor is on the host side of it. |
|
||
| Holding bottle credentials | The polling token is a host token. The supervisor never receives `bottle.cred_proxy.routes` entries; it has no path to them. |
|
||
| Bridging between bottles | The supervisor does not relay state from bottle A to bottle B. It relays *upstream PR state* to a bottle's next prompt — and only if the manifest opts in. |
|
||
| New attack surface | All "control" actions go through `./cli.py start <agent>`, which already enforces the manifest. The supervisor is an automated caller of the existing CLI, not a parallel control plane. |
|
||
|
||
The boundary stays at the bottle wall. The supervisor looks outward at git/PR state and downward at Docker; it does not look *inward* through pipelock.
|
||
|
||
This also doesn't conflict with the "lean on git history for auditing" non-goal. The supervisor is using git/PR state as the *input* to its loop, not constructing a separate audit log. Git history remains the source of truth for what happened.
|
||
|
||
## Where to be conservative
|
||
|
||
A few design defaults worth holding:
|
||
|
||
- **No auto-respawn.** The supervisor surfaces PR feedback to a human, never to the bottle's next prompt. The autonomous flow (review-comment → tear down → relaunch with the comment prepended) was considered and rejected: in a public-ish repo, any commenter could inject content that the next launch would treat as system instructions, with the agent's full bottle privileges. Available mitigations — commenter allowlists, prompt-injection regex screens, private-repo defaults — are all soft. The load-bearing defense is to keep the human between the review comment and any agent prompt. Notify-only is the only mode.
|
||
- **Idle detection is harder than it looks.** Last-log-line-age works ~80% of the time. Codeman's Ralph Loop tracker (watching for `<promise>` tags) is more accurate but adds complexity and tooling-coupling. Start with the dumb version; add heuristics only when actual confusion arises.
|
||
- **No web UI.** A browser UI reintroduces the privileged-channel problem — the browser talks to a server that talks to all bottles. TUI sidesteps it because the supervisor runs in the user's own shell context, not as a long-running daemon serving multiple consumers.
|
||
- **State file in `~/.claude-bottle/`, not inside any bottle.** The mapping of bottle → PR → status lives next to the manifest. Nothing about the supervisor's bookkeeping enters a bottle.
|
||
- **No new credentials on bottles.** PR-watch is a host-side concern. A bottle's manifest *names* the upstream/branch to watch; it does not grant the bottle the ability to read PR state itself.
|
||
|
||
## Trust-model edge cases worth flagging
|
||
|
||
- **Cross-host supervisor.** If the supervisor ever grows to coordinate bottles on multiple hosts, the trust analysis changes — the polling token now has to travel, and the "host" boundary is no longer one machine. Out of scope for v1; flag in any future design doc that contemplates it.
|
||
- **Native Claude Code OpenTelemetry as an alternative observability path.** Setting `CLAUDE_CODE_ENABLE_TELEMETRY=1` inside a bottle would emit OTel data — but the bottle's pipelock allowlist must then include the OTel collector's host, and the trace data is generated by the (untrusted) agent rather than by the (trusted) host-side perimeter. The supervisor-via-Docker-plus-pipelock-logs path is the cleaner observability spine for this project's threat model; in-bottle OTel is an extra signal worth weighing case-by-case, not the default.
|
||
|
||
## Scope estimate
|
||
|
||
The full `status` / `watch` / `supervise` trio is plausibly ~500 lines of Python on top of the existing CLI, no new runtimes, no new daemons, no new ports, and (with `curses`) no new deps. That fits "Low dependencies by default. The project is Python, stdlib-first; ask before adding new tools, runtimes, or package managers" without requiring an exception.
|
||
|
||
Phased: `status` first (purely additive, no design decisions), then `watch` (the design decisions are mostly UX, not architecture), then `supervise` (the only layer that introduces a new behavioral default and warrants a PRD of its own).
|
||
|
||
## Conclusion
|
||
|
||
A supervisor that respects the bottle wall is a small natural extension of what claude-bottle already is, not a category shift toward Mission Control / Codeman / Composio AO. The mistake in earlier framing was treating "supervisor" as synonymous with "dashboard SPA." The trust-model question that disqualifies the SPA tier (privileged channel into every bottle) does not apply to a TUI that reads host-side signals and shells out to the existing CLI.
|
||
|
||
Recommendation: build `status` and `watch` opportunistically when the pain is felt; treat `supervise` as a separate PRD before implementation, scoped to notify-only (no autonomous loop from review comment to next agent prompt — see "Where to be conservative").
|