Files
bot-bottle/docs/research/built-in-supervisor-design.md
T
didericis 5e8ca21669
test / unit (pull_request) Successful in 16s
test / integration (pull_request) Successful in 1m32s
docs: replace stale bash-first framing with Python-stdlib-first
The project started life as bash scripts and got rewritten to Python
(documented in docs/research/bash-vs-python-vs-go.md). Several docs
still carried the old "bash-first" framing — misleading for anyone
reading them now (8.7k lines of Python vs. ~130 lines of bash, all
in scripts/demo*.sh).

- CLAUDE.md "What this is" + "Conventions": orchestrator is Python,
  posture is stdlib-first.
- docs/prds/0010-cred-proxy.md, docs/research/manifest-format-and-
  grouping.md: quoted CLAUDE.md's old wording — re-quote.
- docs/research/built-in-supervisor-design.md, landscape-containerized-
  claude.md, agent-sandbox-landscape.md, pipelock-assessment.md,
  network-egress-guard.md: drop "bash-first" claims about the project,
  keep accurate descriptions of external tools' bash usage.

Leaves untouched: bash code-fence syntax in examples, README's
literal `bash scripts/demo.sh` invocation (the demo IS bash),
Claude Code's "Bash tool" references, IVIJL/devbox bash description
(that project actually is bash), and the bash-vs-python-vs-go
research note that records the rewrite decision.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 06:32:42 -04:00

97 lines
9.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Built-in Supervisor Design
## Question
Can claude-bottle grow a built-in supervisor — TUI inventory plus PR-feedback routing — without breaking the per-bottle isolation model, and without departing from the Python-stdlib-first, low-dependency posture?
## Context
claude-bottle today is a fleet *executor*: `./cli.py start <agent>` brings up one bottle (agent container + pipelock + optional git-gate + optional cred-proxy on a per-bottle internal network), and `cli.py` tears it down when the session ends. There is no inventory view, no idle-detection, no automated reaction to PR or CI events. In parallel use, a human is the supervisor — opening one terminal per bottle, switching between them, and watching upstream PR state by hand.
A separate survey of the broader ecosystem ([agent control dashboards research, mid-2026](https://gitea.dideric.is/didericis/consilium-research/src/branch/main/developer-workflow/agent-control-dashboards-2026-05-24.md)) sorts dashboards into five tiers (session managers, parallel runners, Kanban boards, mission-control SPAs, observability backends). The earlier first-pass conclusion was that a full SPA tier conflicts with claude-bottle's isolation model. This doc reconsiders the smaller question: a TUI supervisor in the existing Python CLI.
## What I got wrong the first time
The earlier framing treated "add a supervisor" as synonymous with "adopt something Composio-AO-shaped" — a Next.js SPA with plugins, dashboards, and a long-running web server. On that framing, the answer is correctly "no, that's too heavy and breaks isolation."
But the framing collapses two different costs that aren't actually coupled:
1. The runtime cost of *each bottle* (already paid: container + 13 sidecars + 2 networks).
2. The runtime cost of a *supervisor* that watches and controls bottles.
A supervisor doesn't have to be heavy. A TUI built into the existing Python CLI, reading `docker ps` and host-side log files, is closer in spirit to `tmux-agent-status` than to Mission Control. The trust analysis below is what actually matters.
## Proposed design
Three layers, each independently useful, in order of ambition:
### 1. `./cli.py status` — read-only inventory
Reads `docker ps` filtered by a bottle label and tails each bottle's session log. Reports per bottle: name, agent, uptime, last-activity timestamp, token spend if available, associated PR/branch if recorded.
No new daemons. No new ports. No new credentials. ~100 lines.
### 2. `./cli.py watch` — TUI over the same data
Same data as `status`, rendered with auto-refresh and keyboard shortcuts that shell out to the existing `cli.py attach / stop / start` commands.
Library choice: prefer the stdlib `curses` module to stay stdlib-first; fall back to `rich` or `textual` only if the curses path proves painful. Both `rich` and `textual` are single-purpose, pure-Python deps with no transitive bloat, but they are still new deps and per the project conventions warrant a deliberate decision.
This is the Claude Squad / tmux-agent-status pattern, applied to bottles instead of tmux sessions. The whole category exists *because* a TUI is the lightweight shape that doesn't require what the SPA tier requires.
### 3. `./cli.py supervise` — PR feedback router
The optional, more ambitious layer. The bottle manifest gains an optional field:
```yaml
pr_watch:
upstream: gitea.dideric.is/didericis/myproject
branch: agent/task-42
```
`./cli.py supervise` polls the named upstream for new review comments and CI failures on `branch`. When one fires, it surfaces as a desktop notification or a flash in the TUI. The human decides what to do with the feedback — there is no autonomous loop that feeds the comment back into a bottle's next prompt (see "Where to be conservative" for why).
The polling token is a **host** token (the same `GH_PAT` / Gitea token the host already keeps in shell env), not a bottle credential. The supervisor never holds bottle secrets.
## Why this doesn't break the trust model
The load-bearing question is whether the supervisor introduces the privileged-channel-into-every-bottle problem that disqualifies the SPA tier. It does not, for four reasons:
| Concern | Mitigation |
|---|---|
| Reaching into running bottles | Supervisor reads `docker ps` and host-side log files. The host already sees both — Docker is the trust boundary, the supervisor is on the host side of it. |
| Holding bottle credentials | The polling token is a host token. The supervisor never receives `bottle.cred_proxy.routes` entries; it has no path to them. |
| Bridging between bottles | The supervisor does not relay state from bottle A to bottle B. It relays *upstream PR state* to a bottle's next prompt — and only if the manifest opts in. |
| New attack surface | All "control" actions go through `./cli.py start <agent>`, which already enforces the manifest. The supervisor is an automated caller of the existing CLI, not a parallel control plane. |
The boundary stays at the bottle wall. The supervisor looks outward at git/PR state and downward at Docker; it does not look *inward* through pipelock.
This also doesn't conflict with the "lean on git history for auditing" non-goal. The supervisor is using git/PR state as the *input* to its loop, not constructing a separate audit log. Git history remains the source of truth for what happened.
## Where to be conservative
A few design defaults worth holding:
- **No auto-respawn.** The supervisor surfaces PR feedback to a human, never to the bottle's next prompt. The autonomous flow (review-comment → tear down → relaunch with the comment prepended) was considered and rejected: in a public-ish repo, any commenter could inject content that the next launch would treat as system instructions, with the agent's full bottle privileges. Available mitigations — commenter allowlists, prompt-injection regex screens, private-repo defaults — are all soft. The load-bearing defense is to keep the human between the review comment and any agent prompt. Notify-only is the only mode.
- **Idle detection is harder than it looks.** Last-log-line-age works ~80% of the time. Codeman's Ralph Loop tracker (watching for `<promise>` tags) is more accurate but adds complexity and tooling-coupling. Start with the dumb version; add heuristics only when actual confusion arises.
- **No web UI.** A browser UI reintroduces the privileged-channel problem — the browser talks to a server that talks to all bottles. TUI sidesteps it because the supervisor runs in the user's own shell context, not as a long-running daemon serving multiple consumers.
- **State file in `~/.claude-bottle/`, not inside any bottle.** The mapping of bottle → PR → status lives next to the manifest. Nothing about the supervisor's bookkeeping enters a bottle.
- **No new credentials on bottles.** PR-watch is a host-side concern. A bottle's manifest *names* the upstream/branch to watch; it does not grant the bottle the ability to read PR state itself.
## Trust-model edge cases worth flagging
- **Cross-host supervisor.** If the supervisor ever grows to coordinate bottles on multiple hosts, the trust analysis changes — the polling token now has to travel, and the "host" boundary is no longer one machine. Out of scope for v1; flag in any future design doc that contemplates it.
- **Native Claude Code OpenTelemetry as an alternative observability path.** Setting `CLAUDE_CODE_ENABLE_TELEMETRY=1` inside a bottle would emit OTel data — but the bottle's pipelock allowlist must then include the OTel collector's host, and the trace data is generated by the (untrusted) agent rather than by the (trusted) host-side perimeter. The supervisor-via-Docker-plus-pipelock-logs path is the cleaner observability spine for this project's threat model; in-bottle OTel is an extra signal worth weighing case-by-case, not the default.
## Scope estimate
The full `status` / `watch` / `supervise` trio is plausibly ~500 lines of Python on top of the existing CLI, no new runtimes, no new daemons, no new ports, and (with `curses`) no new deps. That fits "Low dependencies by default. The project is Python, stdlib-first; ask before adding new tools, runtimes, or package managers" without requiring an exception.
Phased: `status` first (purely additive, no design decisions), then `watch` (the design decisions are mostly UX, not architecture), then `supervise` (the only layer that introduces a new behavioral default and warrants a PRD of its own).
## Conclusion
A supervisor that respects the bottle wall is a small natural extension of what claude-bottle already is, not a category shift toward Mission Control / Codeman / Composio AO. The mistake in earlier framing was treating "supervisor" as synonymous with "dashboard SPA." The trust-model question that disqualifies the SPA tier (privileged channel into every bottle) does not apply to a TUI that reads host-side signals and shells out to the existing CLI.
Recommendation: build `status` and `watch` opportunistically when the pain is felt; treat `supervise` as a separate PRD before implementation, scoped to notify-only (no autonomous loop from review comment to next agent prompt — see "Where to be conservative").