From a6222aaa57e0c4a4abcd5465cdf07883e5124f2f Mon Sep 17 00:00:00 2001 From: didericis Date: Mon, 25 May 2026 01:36:29 -0400 Subject: [PATCH] docs(prd-0012): adopt text-only notify protocol + SIGHUP routes reload Rewrites Scope, Proposed Design, Data model, and Open questions to match the model where /supervise/notify is text-in/text-out, routes edits + SIGHUP reload are supervisor-side tooling, and manifest rebuilds are the heavy path. Adds the per-bottle routes-edit audit log. Co-Authored-By: Claude Opus 4.7 --- docs/prds/0012-stuck-agent-recovery-flow.md | 48 +++++++++++++-------- 1 file changed, 29 insertions(+), 19 deletions(-) diff --git a/docs/prds/0012-stuck-agent-recovery-flow.md b/docs/prds/0012-stuck-agent-recovery-flow.md index 41a142a..db493a4 100644 --- a/docs/prds/0012-stuck-agent-recovery-flow.md +++ b/docs/prds/0012-stuck-agent-recovery-flow.md @@ -6,7 +6,7 @@ ## Summary -When an agent running inside a claude-bottle container gets blocked by a missing permission, tool, or skill, it asks for help via a PR comment; the user approves a manifest change in a TUI dashboard; the orchestrator rebuilds the container from the new manifest and resumes work on the same branch — without ever opening a live channel into the running bottle. +When an agent running inside a claude-bottle container gets blocked, it signals via the per-bottle cred-proxy sidecar's `/supervise/notify` endpoint. The supervisor sees the message in a host-side TUI and responds with either a text hint (resolves the block in-place, the agent continues), a cred-proxy config swap (supervisor edits `routes.json`, SIGHUP-reloads cred-proxy, replies with a "try again" hint), or — for the heavier case where the bottle's manifest itself needs to change — an approved manifest diff that triggers a rebuild of the bottle on the same branch. The supervisor never opens a live channel into a running bottle; all signal flow goes through the existing internal-network endpoint that cred-proxy already terminates. ## Problem @@ -27,11 +27,15 @@ A real stuck agent recovers end-to-end through the flow: the agent hits a missin ### In scope -- A `/request-bottle-change` slash command the agent invokes when it knows it's blocked. -- A TUI dashboard that lists running bottles and pending change requests, and takes approve/reject input from the user. -- A rebuild orchestrator that tears down the old bottle, applies the approved manifest change, and starts a replacement bottle on the same branch. -- A state-preservation helper that carries forward what it can across the rebuild (working tree is mandatory; transcript / reasoning context is best-effort). -- A stuck-signal mechanism that does not require a forge token inside the bottle: the agent's slash command sends the request to the existing cred-proxy endpoint, which (with a host-mounted volume) writes the sentinel artifact on the host side. The orchestrator polls that artifact and posts the PR comment from outside the bottle. +- A `/stuck` slash command the agent invokes when blocked. POSTs free-text to cred-proxy's `/supervise/notify` and blocks awaiting a text reply. +- A `/supervise/notify` endpoint on cred-proxy that persists the agent's message host-side and holds the agent's connection open until the supervisor responds. Wire protocol is text-only: request is the agent's message; response is `{text: "..."}`. +- SIGHUP-based hot reload of `routes.json` on cred-proxy, so the supervisor can change the agent's credential surface without restarting the proxy or dropping in-flight calls. +- A TUI dashboard that lists running bottles and pending stuck-notifications. Two verbs: + - `r ` — reply to a pending stuck message (delivers text to the waiting agent). + - `routes edit ` — open the bottle's `routes.json` in `$EDITOR`, SIGHUP cred-proxy on save. Not gated on a pending message; the supervisor can edit routes anytime. +- A host-side audit log at `~/.claude-bottle/audit/cred-proxy-.log` that records every `routes.json` edit: timestamp, diff before/after, the operator's reply text if the edit was tied to a `/stuck` reply. Logs route shape, never secret values. +- A rebuild path for the heavier case where the bottle's *manifest* (not just routes) must change. Orchestrator tears down the bottle, applies the approved manifest diff, and starts a replacement bottle on the same branch. +- A state-preservation helper for the rebuild path: working tree push is mandatory; transcript / reasoning context is best-effort. ### Out of scope @@ -45,22 +49,26 @@ A real stuck agent recovers end-to-end through the flow: the agent hits a missin ### New services / components -- **`/request-bottle-change` slash command.** Shipped as a skill mounted into bottles. When the agent invokes it, the command POSTs a structured request (what's needed, why, what was tried) to the cred-proxy endpoint and halts the agent. The agent never touches the host filesystem. -- **TUI dashboard.** A `claude-bottle dashboard` (or similarly named) command that lists running bottles, surfaces pending change requests, shows the proposed manifest diff, and accepts approve/reject input. Targets stdlib only; a TUI library is added only if the experience truly demands it. -- **Rebuild orchestrator.** The plumbing that, on approval, tears down the existing bottle, applies the approved manifest change, snapshots state via the state-preservation helper, and starts a fresh bottle on the same branch. +- **`/stuck` slash command.** Shipped as a skill mounted into bottles. POSTs the agent's free-text message to cred-proxy's `/supervise/notify` and blocks awaiting a text reply. Reply text is handed back to the agent verbatim — the agent doesn't need to know whether the supervisor edited routes, opened an editor, or did anything else before composing the reply. +- **cred-proxy `/supervise/notify` endpoint.** Receives the agent's message, persists it to a host-mounted queue, and holds the agent's connection open until the supervisor responds. The wire protocol is text-only in both directions; the supervisor's side-effects (routes edit, manifest diff, no-op) are invisible to the agent. +- **cred-proxy SIGHUP reload.** New behavior on the existing process: SIGHUP re-reads `routes.json` without dropping connections or breaking in-flight calls. ~30 lines added to the server. +- **TUI dashboard.** A `claude-bottle dashboard` (or similarly named) command. Lists running bottles, surfaces pending stuck-notifications, exposes the `r ` and `routes edit ` verbs, and (for the rebuild path) shows proposed manifest diffs with approve/reject input. Targets stdlib only; a TUI library is added only if the experience truly demands it. +- **Routes-edit audit log.** `~/.claude-bottle/audit/cred-proxy-.log`. Every `routes.json` edit appends: timestamp, diff of routes before/after, operator's reply text if tied to a `/stuck` reply. Records what the bottle's credential surface looked like at time T without storing the secret values themselves. +- **Rebuild orchestrator (heavy path).** Used when the manifest itself must change, not just routes. On approval, tears down the existing bottle, applies the approved manifest diff, snapshots state via the state-preservation helper, and starts a fresh bottle on the same branch. - **State-preservation helper.** Mandatory: ensures the working tree is pushed before teardown. Best-effort: carries forward the agent's transcript / reasoning context into the replacement container so the new agent starts warm rather than cold. ### Existing code touched -- **cred-proxy** (PRD 0010) — extended with an endpoint that accepts stuck-requests from inside a bottle and writes the sentinel artifact to a host-mounted volume. -- **`cli.py`** — gains the dashboard subcommand and the rebuild path. +- **cred-proxy** (PRD 0010) — gains the `/supervise/notify` endpoint, the host-mounted notification queue, and SIGHUP reload of `routes.json`. +- **`cli.py`** — gains the dashboard subcommand (with `r` and `routes edit` verbs) and the rebuild path. - **Bottle lifecycle scripts** — extended for orchestrated teardown + rebuild with state hand-off, distinct from a fresh-spawn. - **Bottle manifest schema** — may need to record the originating manifest version / change history per agent run, so the dashboard can show "what changed" rather than "what is." ### Data model changes -- A new stuck-request artifact (probably JSON) written by the cred-proxy on behalf of the agent, with whatever fields the dashboard needs to render the ask. -- A per-agent-run record sufficient to map a running bottle back to its PR / branch, so the orchestrator knows where to post the comment and which branch to resume on. +- A per-bottle pending-notification queue: cred-proxy holds the agent's open connection; the queue holds the metadata (id, bottle slug, message body, arrival timestamp) the TUI needs to render the ask. +- A per-bottle `routes.json` audit log file at `~/.claude-bottle/audit/cred-proxy-.log`, append-only. +- A per-agent-run record sufficient to map a running bottle back to its PR / branch, so the rebuild orchestrator knows where to post the comment and which branch to resume on. ### External dependencies @@ -69,12 +77,14 @@ A real stuck agent recovers end-to-end through the flow: the agent hits a missin ## Open questions -- What exactly does best-effort transcript preservation look like? Mount the agent's state directory, snapshot on teardown, remount in the replacement? How much fidelity is "good enough" for the new agent to pick up? -- Should v1 also ship the tool-denial hook (auto-detect stuck), or strictly the agent-initiated slash command? Currently deferred, but the line is worth confirming during implementation. -- How does the dashboard handle rejection? Does the agent get a comment back saying "denied, here's why," or does the bottle just stay torn down? -- How does the orchestrator know which PR / branch a given bottle maps to — recorded at bottle-spawn time, derived from the working tree, or specified in the manifest? -- Concurrency: if multiple bottles request changes simultaneously, what does the dashboard surface and in what order? -- How does the flow handle one-off exceptions to gitlock / pipelock denials — e.g. a commit that includes docs with intentionally-bogus tokens that the secret scanner correctly flags? The shape (agent blocked → ask via PR comment → user approves → continue) is the same as a manifest-change request, but the *resolution* is different: a per-operation override or a scoped allowlist entry, not a new manifest. Does this fold into the same `/request-bottle-change` slash command with a different request type, or is it a separate slash command (e.g. `/request-gate-exception`)? And how is an "exception" expressed safely — by commit SHA, by content hash, by a narrow allowlist rule? Either way, the approval must be auditable so a future reader can see what was waived and why. See `docs/research/git-gate-commit-approval.md` for a survey of gitleaks's native allowlist primitives and a recommendation. +- SIGHUP race window. An agent that retries within msec of the SIGHUP may hit old routes once before the reload completes, fail, and retry against the new routes. Assumption is that normal HTTP retry semantics absorb this; worth confirming under real usage rather than designing around it preemptively. +- Multiple pending notifications from the same bottle. If the agent calls `/stuck` again before the prior message is answered, what does the queue do — replace, append, or refuse? Append feels safest; replace is wrong (loses context); refuse forces the agent to handle a new error mode. +- Verb naming under load. `r ` optimizes for muscle memory mid-incident; `reply ` reads better cold. Worth picking once and committing. +- Best-effort transcript preservation on the rebuild path. Mount the agent's state directory, snapshot on teardown, remount in the replacement? How much fidelity is "good enough" for the new agent to pick up? +- Tool-denial auto-detection. Should v1 also ship a tool-denial hook that auto-invokes `/stuck` without the agent's involvement, or strictly the agent-initiated form? Currently deferred; line worth confirming during implementation. +- Rejection semantics on the rebuild path. Does the agent receive a `/stuck` reply explaining the rejection, or does the bottle just stay torn down? +- Bottle → PR/branch mapping. Recorded at bottle-spawn time, derived from the working tree, or specified in the manifest? +- How does the flow handle one-off exceptions to gitlock / pipelock denials — e.g. a commit that includes docs with intentionally-bogus tokens that the secret scanner correctly flags? The shape (agent blocked → `/stuck` → operator decides → reply) is the same, but the *resolution* differs: a per-operation override or a scoped allowlist entry, not a routes edit or a manifest change. Does the operator express the exception by commit SHA, by content hash, or by a narrow allowlist rule? Either way, the approval must be auditable so a future reader can see what was waived and why. See `docs/research/git-gate-commit-approval.md` for a survey of gitleaks's native allowlist primitives and a recommendation. ## References