docs(prd-0012): name the three stuck categories and add pipelock path

Introduces cred-proxy block, pipelock block, and capability gap as the
three named categories of stuck. Adds pipelock-edit support (restart-
based for v1) parallel to the existing cred-proxy routes-edit path,
plus a pipelock audit log. Broadens Goals to cover all three paths.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-25 01:47:24 -04:00
parent a6222aaa57
commit 66fc29c72e
+24 -10
View File
@@ -6,7 +6,7 @@
## Summary
When an agent running inside a claude-bottle container gets blocked, it signals via the per-bottle cred-proxy sidecar's `/supervise/notify` endpoint. The supervisor sees the message in a host-side TUI and responds with either a text hint (resolves the block in-place, the agent continues), a cred-proxy config swap (supervisor edits `routes.json`, SIGHUP-reloads cred-proxy, replies with a "try again" hint), or — for the heavier case where the bottle's manifest itself needs to change — an approved manifest diff that triggers a rebuild of the bottle on the same branch. The supervisor never opens a live channel into a running bottle; all signal flow goes through the existing internal-network endpoint that cred-proxy already terminates.
When an agent running inside a claude-bottle container gets blocked, it signals via the per-bottle cred-proxy sidecar's `/supervise/notify` endpoint. The supervisor sees the message in a host-side TUI and responds with one of four shapes: a text hint (no infrastructure change, the agent continues); a cred-proxy routes edit (SIGHUP-reload of cred-proxy, agent retries); a pipelock allowlist edit (restart pipelock, agent retries); or an approved manifest diff that triggers a full rebuild of the bottle on the same branch. These map to three categories of stuck — **cred-proxy block**, **pipelock block**, and **capability gap** — described below. The supervisor never opens a live channel into a running bottle; all signal flow goes through the existing internal-network endpoint that cred-proxy already terminates.
## Problem
@@ -14,7 +14,7 @@ Running parallel agents in isolated bottles makes it cheap to spin up work in pa
## Goals / Success Criteria
A real stuck agent recovers end-to-end through the flow: the agent hits a missing permission, posts a PR comment describing the ask, the user reviews the request in a dashboard, approves a manifest diff, and a fresh bottle picks up on the same branch and continues. The whole loop completes without anyone running `docker attach` or opening any live channel into the original container.
A real stuck agent recovers end-to-end in each of the three categories: a **cred-proxy block** is fixed by a `routes edit` + SIGHUP and a "retry now" reply without restarting anything; a **pipelock block** is fixed by an allowlist edit + pipelock restart and a "retry now" reply; a **capability gap** triggers a manifest-diff approval and a bottle rebuild that picks up on the same branch. All three complete without anyone running `docker attach` or opening any live channel into the original container.
## Non-goals
@@ -33,8 +33,9 @@ A real stuck agent recovers end-to-end through the flow: the agent hits a missin
- A TUI dashboard that lists running bottles and pending stuck-notifications. Two verbs:
- `r <id> <text>` — reply to a pending stuck message (delivers text to the waiting agent).
- `routes edit <bottle>` — open the bottle's `routes.json` in `$EDITOR`, SIGHUP cred-proxy on save. Not gated on a pending message; the supervisor can edit routes anytime.
- A host-side audit log at `~/.claude-bottle/audit/cred-proxy-<slug>.log` that records every `routes.json` edit: timestamp, diff before/after, the operator's reply text if the edit was tied to a `/stuck` reply. Logs route shape, never secret values.
- A rebuild path for the heavier case where the bottle's *manifest* (not just routes) must change. Orchestrator tears down the bottle, applies the approved manifest diff, and starts a replacement bottle on the same branch.
- Equivalent support for pipelock: a `pipelock edit <bottle>` TUI verb that opens pipelock's allowlist in `$EDITOR` and restarts pipelock on save. (v1 uses restart, not SIGHUP — see Open questions.)
- Host-side audit logs at `~/.claude-bottle/audit/cred-proxy-<slug>.log` and `~/.claude-bottle/audit/pipelock-<slug>.log` that record every config edit: timestamp, diff before/after, the operator's reply text if the edit was tied to a `/stuck` reply. Records config shape, never secret values.
- A rebuild path for the **capability gap** case where the bottle's *manifest* (not just routes or pipelock allowlist) must change. Orchestrator tears down the bottle, applies the approved manifest diff, and starts a replacement bottle on the same branch.
- A state-preservation helper for the rebuild path: working tree push is mandatory; transcript / reasoning context is best-effort.
### Out of scope
@@ -47,27 +48,39 @@ A real stuck agent recovers end-to-end through the flow: the agent hits a missin
## Proposed Design
### Stuck categories
Three named categories, ordered by remediation cost:
- **cred-proxy block.** The agent's request was refused by cred-proxy — missing route, expired token, wrong scope. The bottle is otherwise healthy. *Remediation:* operator runs `routes edit <bottle>`, edits `routes.json`, saves. cred-proxy SIGHUP-reloads; in-flight connections are not dropped. Operator replies to the `/stuck` message with a "retry now" hint. The agent retries against the (now-reloaded) cred-proxy and proceeds.
- **pipelock block.** The agent's outbound request was refused by pipelock — host not in the allowlist, protocol not permitted, etc. The bottle is otherwise healthy, but the egress perimeter is wrong. *Remediation:* operator runs `pipelock edit <bottle>`, edits the allowlist, saves. pipelock restarts; the agent's in-flight outbound calls may drop and need retry. Operator replies to the `/stuck` message with a "retry now" hint. (v1 uses restart; SIGHUP reload for pipelock is an Open question.)
- **capability gap.** The bottle is missing something the agent needs that lives in the manifest itself — a tool, a skill, a permission grant, an env var. Routes and pipelock are correct; the agent container just doesn't have the capability. *Remediation:* operator approves a manifest diff in the TUI. The rebuild orchestrator tears down the bottle, applies the diff, and starts a replacement bottle on the same branch via the state-preservation helper. The replacement agent picks up where the original was, now with the missing capability.
The wire protocol does not change between categories: the agent POSTs free text to `/supervise/notify` and receives `{text: "..."}`. The category is the operator's mental model for triage, not a field on the request. The agent does not need to know which category its message will fall into.
### New services / components
- **`/stuck` slash command.** Shipped as a skill mounted into bottles. POSTs the agent's free-text message to cred-proxy's `/supervise/notify` and blocks awaiting a text reply. Reply text is handed back to the agent verbatim — the agent doesn't need to know whether the supervisor edited routes, opened an editor, or did anything else before composing the reply.
- **cred-proxy `/supervise/notify` endpoint.** Receives the agent's message, persists it to a host-mounted queue, and holds the agent's connection open until the supervisor responds. The wire protocol is text-only in both directions; the supervisor's side-effects (routes edit, manifest diff, no-op) are invisible to the agent.
- **cred-proxy SIGHUP reload.** New behavior on the existing process: SIGHUP re-reads `routes.json` without dropping connections or breaking in-flight calls. ~30 lines added to the server.
- **TUI dashboard.** A `claude-bottle dashboard` (or similarly named) command. Lists running bottles, surfaces pending stuck-notifications, exposes the `r <id> <text>` and `routes edit <bottle>` verbs, and (for the rebuild path) shows proposed manifest diffs with approve/reject input. Targets stdlib only; a TUI library is added only if the experience truly demands it.
- **Routes-edit audit log.** `~/.claude-bottle/audit/cred-proxy-<slug>.log`. Every `routes.json` edit appends: timestamp, diff of routes before/after, operator's reply text if tied to a `/stuck` reply. Records what the bottle's credential surface looked like at time T without storing the secret values themselves.
- **Rebuild orchestrator (heavy path).** Used when the manifest itself must change, not just routes. On approval, tears down the existing bottle, applies the approved manifest diff, snapshots state via the state-preservation helper, and starts a fresh bottle on the same branch.
- **cred-proxy SIGHUP reload.** New behavior on the existing process: SIGHUP re-reads `routes.json` without dropping connections or breaking in-flight calls. ~30 lines added to the server. Used by the **cred-proxy block** category.
- **pipelock edit + restart.** v1 ships restart-based reload for pipelock: on `pipelock edit <bottle>` save, the supervisor writes the new allowlist and restarts the pipelock container. The agent's in-flight outbound calls drop and rely on retry. Used by the **pipelock block** category.
- **TUI dashboard.** A `claude-bottle dashboard` (or similarly named) command. Lists running bottles, surfaces pending stuck-notifications, exposes the `r <id> <text>`, `routes edit <bottle>`, and `pipelock edit <bottle>` verbs, and (for the **capability gap** category) shows proposed manifest diffs with approve/reject input. Targets stdlib only; a TUI library is added only if the experience truly demands it.
- **Config-edit audit logs.** `~/.claude-bottle/audit/cred-proxy-<slug>.log` and `~/.claude-bottle/audit/pipelock-<slug>.log`. Every edit appends: timestamp, diff before/after, operator's reply text if tied to a `/stuck` reply. Records what the bottle's credential surface and egress perimeter looked like at time T without storing secret values.
- **Rebuild orchestrator (capability-gap path).** Used when the manifest itself must change, not just routes or the pipelock allowlist. On approval, tears down the existing bottle, applies the approved manifest diff, snapshots state via the state-preservation helper, and starts a fresh bottle on the same branch.
- **State-preservation helper.** Mandatory: ensures the working tree is pushed before teardown. Best-effort: carries forward the agent's transcript / reasoning context into the replacement container so the new agent starts warm rather than cold.
### Existing code touched
- **cred-proxy** (PRD 0010) — gains the `/supervise/notify` endpoint, the host-mounted notification queue, and SIGHUP reload of `routes.json`.
- **`cli.py`** — gains the dashboard subcommand (with `r` and `routes edit` verbs) and the rebuild path.
- **pipelock** — gains a clean restart path that picks up the new allowlist on container restart. No code changes likely needed if pipelock already reads its config on startup; the orchestration is supervisor-side.
- **`cli.py`** — gains the dashboard subcommand (with `r`, `routes edit`, and `pipelock edit` verbs) and the rebuild path.
- **Bottle lifecycle scripts** — extended for orchestrated teardown + rebuild with state hand-off, distinct from a fresh-spawn.
- **Bottle manifest schema** — may need to record the originating manifest version / change history per agent run, so the dashboard can show "what changed" rather than "what is."
### Data model changes
- A per-bottle pending-notification queue: cred-proxy holds the agent's open connection; the queue holds the metadata (id, bottle slug, message body, arrival timestamp) the TUI needs to render the ask.
- A per-bottle `routes.json` audit log file at `~/.claude-bottle/audit/cred-proxy-<slug>.log`, append-only.
- Per-bottle config audit log files at `~/.claude-bottle/audit/cred-proxy-<slug>.log` and `~/.claude-bottle/audit/pipelock-<slug>.log`, append-only.
- A per-agent-run record sufficient to map a running bottle back to its PR / branch, so the rebuild orchestrator knows where to post the comment and which branch to resume on.
### External dependencies
@@ -78,6 +91,7 @@ A real stuck agent recovers end-to-end through the flow: the agent hits a missin
## Open questions
- SIGHUP race window. An agent that retries within msec of the SIGHUP may hit old routes once before the reload completes, fail, and retry against the new routes. Assumption is that normal HTTP retry semantics absorb this; worth confirming under real usage rather than designing around it preemptively.
- SIGHUP reload for pipelock. v1 ships restart-based reload, which drops in-flight outbound calls. Should pipelock gain SIGHUP support so **pipelock block** is as cheap as **cred-proxy block**? Depends on how often the operator edits the allowlist mid-task and how disruptive a pipelock bounce actually is.
- Multiple pending notifications from the same bottle. If the agent calls `/stuck` again before the prior message is answered, what does the queue do — replace, append, or refuse? Append feels safest; replace is wrong (loses context); refuse forces the agent to handle a new error mode.
- Verb naming under load. `r <id> <text>` optimizes for muscle memory mid-incident; `reply <id> <text>` reads better cold. Worth picking once and committing.
- Best-effort transcript preservation on the rebuild path. Mount the agent's state directory, snapshot on teardown, remount in the replacement? How much fidelity is "good enough" for the new agent to pick up?