Files
bot-bottle/docs/prds/0012-stuck-agent-recovery-flow.md
T
didericis 49082dfadf
test / unit (pull_request) Successful in 12s
test / integration (pull_request) Successful in 23s
docs(prd-0012): adopt text-only notify protocol + SIGHUP routes reload
Rewrites Scope, Proposed Design, Data model, and Open questions to
match the model where /supervise/notify is text-in/text-out, routes
edits + SIGHUP reload are supervisor-side tooling, and manifest
rebuilds are the heavy path. Adds the per-bottle routes-edit audit log.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 01:36:29 -04:00

10 KiB

PRD 0012: Stuck-agent recovery flow

  • Status: Draft
  • Author: didericis
  • Created: 2026-05-24

Summary

When an agent running inside a claude-bottle container gets blocked, it signals via the per-bottle cred-proxy sidecar's /supervise/notify endpoint. The supervisor sees the message in a host-side TUI and responds with either a text hint (resolves the block in-place, the agent continues), a cred-proxy config swap (supervisor edits routes.json, SIGHUP-reloads cred-proxy, replies with a "try again" hint), or — for the heavier case where the bottle's manifest itself needs to change — an approved manifest diff that triggers a rebuild of the bottle on the same branch. The supervisor never opens a live channel into a running bottle; all signal flow goes through the existing internal-network endpoint that cred-proxy already terminates.

Problem

Running parallel agents in isolated bottles makes it cheap to spin up work in parallel, but expensive to recover when an agent gets stuck. Today, if a bottle is missing a permission or a tool the agent needs to make progress, the only options are to kill the container and start over (losing work) or open a live channel into the bottle to fix it in place (breaking the sandbox property that makes bottles trustworthy in the first place). The user feels this directly whenever a parallel run blocks on something the manifest didn't anticipate.

Goals / Success Criteria

A real stuck agent recovers end-to-end through the flow: the agent hits a missing permission, posts a PR comment describing the ask, the user reviews the request in a dashboard, approves a manifest diff, and a fresh bottle picks up on the same branch and continues. The whole loop completes without anyone running docker attach or opening any live channel into the original container.

Non-goals

  • Live attach or in-place mutation of running containers. The whole design exists to avoid this.
  • Agent-to-agent communication. Re-stated from the project's existing non-goals; the recovery flow is human→agent only.
  • Auditing or forensic replay of agent runs. Git/forge history is the audit log; this PRD does not add a separate run log.
  • Reducing time-to-unstuck below some target. Faster than kill-and-restart is implicit, but no specific SLO is in scope.

Scope

In scope

  • A /stuck slash command the agent invokes when blocked. POSTs free-text to cred-proxy's /supervise/notify and blocks awaiting a text reply.
  • A /supervise/notify endpoint on cred-proxy that persists the agent's message host-side and holds the agent's connection open until the supervisor responds. Wire protocol is text-only: request is the agent's message; response is {text: "..."}.
  • SIGHUP-based hot reload of routes.json on cred-proxy, so the supervisor can change the agent's credential surface without restarting the proxy or dropping in-flight calls.
  • A TUI dashboard that lists running bottles and pending stuck-notifications. Two verbs:
    • r <id> <text> — reply to a pending stuck message (delivers text to the waiting agent).
    • routes edit <bottle> — open the bottle's routes.json in $EDITOR, SIGHUP cred-proxy on save. Not gated on a pending message; the supervisor can edit routes anytime.
  • A host-side audit log at ~/.claude-bottle/audit/cred-proxy-<slug>.log that records every routes.json edit: timestamp, diff before/after, the operator's reply text if the edit was tied to a /stuck reply. Logs route shape, never secret values.
  • A rebuild path for the heavier case where the bottle's manifest (not just routes) must change. Orchestrator tears down the bottle, applies the approved manifest diff, and starts a replacement bottle on the same branch.
  • A state-preservation helper for the rebuild path: working tree push is mandatory; transcript / reasoning context is best-effort.

Out of scope

  • A tool-denial hook that auto-detects "stuck" without the agent's involvement. Deferred to a follow-up; v1 is opt-in via the slash command.
  • A web dashboard. TUI only in v1.
  • Live channel into running containers (see Non-goals).
  • Agent-to-agent communication (see Non-goals).
  • Auditing / forensic replay (see Non-goals).

Proposed Design

New services / components

  • /stuck slash command. Shipped as a skill mounted into bottles. POSTs the agent's free-text message to cred-proxy's /supervise/notify and blocks awaiting a text reply. Reply text is handed back to the agent verbatim — the agent doesn't need to know whether the supervisor edited routes, opened an editor, or did anything else before composing the reply.
  • cred-proxy /supervise/notify endpoint. Receives the agent's message, persists it to a host-mounted queue, and holds the agent's connection open until the supervisor responds. The wire protocol is text-only in both directions; the supervisor's side-effects (routes edit, manifest diff, no-op) are invisible to the agent.
  • cred-proxy SIGHUP reload. New behavior on the existing process: SIGHUP re-reads routes.json without dropping connections or breaking in-flight calls. ~30 lines added to the server.
  • TUI dashboard. A claude-bottle dashboard (or similarly named) command. Lists running bottles, surfaces pending stuck-notifications, exposes the r <id> <text> and routes edit <bottle> verbs, and (for the rebuild path) shows proposed manifest diffs with approve/reject input. Targets stdlib only; a TUI library is added only if the experience truly demands it.
  • Routes-edit audit log. ~/.claude-bottle/audit/cred-proxy-<slug>.log. Every routes.json edit appends: timestamp, diff of routes before/after, operator's reply text if tied to a /stuck reply. Records what the bottle's credential surface looked like at time T without storing the secret values themselves.
  • Rebuild orchestrator (heavy path). Used when the manifest itself must change, not just routes. On approval, tears down the existing bottle, applies the approved manifest diff, snapshots state via the state-preservation helper, and starts a fresh bottle on the same branch.
  • State-preservation helper. Mandatory: ensures the working tree is pushed before teardown. Best-effort: carries forward the agent's transcript / reasoning context into the replacement container so the new agent starts warm rather than cold.

Existing code touched

  • cred-proxy (PRD 0010) — gains the /supervise/notify endpoint, the host-mounted notification queue, and SIGHUP reload of routes.json.
  • cli.py — gains the dashboard subcommand (with r and routes edit verbs) and the rebuild path.
  • Bottle lifecycle scripts — extended for orchestrated teardown + rebuild with state hand-off, distinct from a fresh-spawn.
  • Bottle manifest schema — may need to record the originating manifest version / change history per agent run, so the dashboard can show "what changed" rather than "what is."

Data model changes

  • A per-bottle pending-notification queue: cred-proxy holds the agent's open connection; the queue holds the metadata (id, bottle slug, message body, arrival timestamp) the TUI needs to render the ask.
  • A per-bottle routes.json audit log file at ~/.claude-bottle/audit/cred-proxy-<slug>.log, append-only.
  • A per-agent-run record sufficient to map a running bottle back to its PR / branch, so the rebuild orchestrator knows where to post the comment and which branch to resume on.

External dependencies

  • The Gitea API / tea CLI is already in the toolbox (the project is on Gitea); no new auth surface beyond what the orchestrator already needs to read/post on PRs.
  • A TUI library is a maybe — only if stdlib can't carry the dashboard experience. Default to no new dependency.

Open questions

  • SIGHUP race window. An agent that retries within msec of the SIGHUP may hit old routes once before the reload completes, fail, and retry against the new routes. Assumption is that normal HTTP retry semantics absorb this; worth confirming under real usage rather than designing around it preemptively.
  • Multiple pending notifications from the same bottle. If the agent calls /stuck again before the prior message is answered, what does the queue do — replace, append, or refuse? Append feels safest; replace is wrong (loses context); refuse forces the agent to handle a new error mode.
  • Verb naming under load. r <id> <text> optimizes for muscle memory mid-incident; reply <id> <text> reads better cold. Worth picking once and committing.
  • Best-effort transcript preservation on the rebuild path. Mount the agent's state directory, snapshot on teardown, remount in the replacement? How much fidelity is "good enough" for the new agent to pick up?
  • Tool-denial auto-detection. Should v1 also ship a tool-denial hook that auto-invokes /stuck without the agent's involvement, or strictly the agent-initiated form? Currently deferred; line worth confirming during implementation.
  • Rejection semantics on the rebuild path. Does the agent receive a /stuck reply explaining the rejection, or does the bottle just stay torn down?
  • Bottle → PR/branch mapping. Recorded at bottle-spawn time, derived from the working tree, or specified in the manifest?
  • How does the flow handle one-off exceptions to gitlock / pipelock denials — e.g. a commit that includes docs with intentionally-bogus tokens that the secret scanner correctly flags? The shape (agent blocked → /stuck → operator decides → reply) is the same, but the resolution differs: a per-operation override or a scoped allowlist entry, not a routes edit or a manifest change. Does the operator express the exception by commit SHA, by content hash, or by a narrow allowlist rule? Either way, the approval must be auditable so a future reader can see what was waived and why. See docs/research/git-gate-commit-approval.md for a survey of gitleaks's native allowlist primitives and a recommendation.

References

  • PRD 0010 — cred-proxy (the endpoint extended to carry stuck-requests).
  • CLAUDE.md — project non-goal on agent-to-agent communication; this PRD stays on the human→agent side of that line.