diff --git a/docs/prds/0012-stuck-agent-recovery-flow.md b/docs/prds/0012-stuck-agent-recovery-flow.md new file mode 100644 index 0000000..dd3867b --- /dev/null +++ b/docs/prds/0012-stuck-agent-recovery-flow.md @@ -0,0 +1,81 @@ +# PRD 0012: Stuck-agent recovery flow + +- **Status:** Draft +- **Author:** didericis +- **Created:** 2026-05-24 + +## Summary + +When an agent running inside a claude-bottle container gets blocked by a missing permission, tool, or skill, it asks for help via a PR comment; the user approves a manifest change in a TUI dashboard; the orchestrator rebuilds the container from the new manifest and resumes work on the same branch — without ever opening a live channel into the running bottle. + +## Problem + +Running parallel agents in isolated bottles makes it cheap to spin up work in parallel, but expensive to recover when an agent gets stuck. Today, if a bottle is missing a permission or a tool the agent needs to make progress, the only options are to kill the container and start over (losing work) or open a live channel into the bottle to fix it in place (breaking the sandbox property that makes bottles trustworthy in the first place). The user feels this directly whenever a parallel run blocks on something the manifest didn't anticipate. + +## Goals / Success Criteria + +A real stuck agent recovers end-to-end through the flow: the agent hits a missing permission, posts a PR comment describing the ask, the user reviews the request in a dashboard, approves a manifest diff, and a fresh bottle picks up on the same branch and continues. The whole loop completes without anyone running `docker attach` or opening any live channel into the original container. + +## Non-goals + +- Live attach or in-place mutation of running containers. The whole design exists to avoid this. +- Agent-to-agent communication. Re-stated from the project's existing non-goals; the recovery flow is human→agent only. +- Auditing or forensic replay of agent runs. Git/forge history is the audit log; this PRD does not add a separate run log. +- Reducing time-to-unstuck below some target. Faster than kill-and-restart is implicit, but no specific SLO is in scope. + +## Scope + +### In scope + +- A `/request-bottle-change` slash command the agent invokes when it knows it's blocked. +- A TUI dashboard that lists running bottles and pending change requests, and takes approve/reject input from the user. +- A rebuild orchestrator that tears down the old bottle, applies the approved manifest change, and starts a replacement bottle on the same branch. +- A state-preservation helper that carries forward what it can across the rebuild (working tree is mandatory; transcript / reasoning context is best-effort). +- A stuck-signal mechanism that does not require a forge token inside the bottle: the agent's slash command sends the request to the existing cred-proxy endpoint, which (with a host-mounted volume) writes the sentinel artifact on the host side. The orchestrator polls that artifact and posts the PR comment from outside the bottle. + +### Out of scope + +- A tool-denial hook that auto-detects "stuck" without the agent's involvement. Deferred to a follow-up; v1 is opt-in via the slash command. +- A web dashboard. TUI only in v1. +- Live channel into running containers (see Non-goals). +- Agent-to-agent communication (see Non-goals). +- Auditing / forensic replay (see Non-goals). + +## Proposed Design + +### New services / components + +- **`/request-bottle-change` slash command.** Shipped as a skill mounted into bottles. When the agent invokes it, the command POSTs a structured request (what's needed, why, what was tried) to the cred-proxy endpoint and halts the agent. The agent never touches the host filesystem. +- **TUI dashboard.** A `claude-bottle dashboard` (or similarly named) command that lists running bottles, surfaces pending change requests, shows the proposed manifest diff, and accepts approve/reject input. Targets stdlib only; a TUI library is added only if the experience truly demands it. +- **Rebuild orchestrator.** The plumbing that, on approval, tears down the existing bottle, applies the approved manifest change, snapshots state via the state-preservation helper, and starts a fresh bottle on the same branch. +- **State-preservation helper.** Mandatory: ensures the working tree is pushed before teardown. Best-effort: carries forward the agent's transcript / reasoning context into the replacement container so the new agent starts warm rather than cold. + +### Existing code touched + +- **cred-proxy** (PRD 0010) — extended with an endpoint that accepts stuck-requests from inside a bottle and writes the sentinel artifact to a host-mounted volume. +- **`cli.py`** — gains the dashboard subcommand and the rebuild path. +- **Bottle lifecycle scripts** — extended for orchestrated teardown + rebuild with state hand-off, distinct from a fresh-spawn. +- **Bottle manifest schema** — may need to record the originating manifest version / change history per agent run, so the dashboard can show "what changed" rather than "what is." + +### Data model changes + +- A new stuck-request artifact (probably JSON) written by the cred-proxy on behalf of the agent, with whatever fields the dashboard needs to render the ask. +- A per-agent-run record sufficient to map a running bottle back to its PR / branch, so the orchestrator knows where to post the comment and which branch to resume on. + +### External dependencies + +- The Gitea API / `tea` CLI is already in the toolbox (the project is on Gitea); no new auth surface beyond what the orchestrator already needs to read/post on PRs. +- A TUI library is a *maybe* — only if stdlib can't carry the dashboard experience. Default to no new dependency. + +## Open questions + +- What exactly does best-effort transcript preservation look like? Mount the agent's state directory, snapshot on teardown, remount in the replacement? How much fidelity is "good enough" for the new agent to pick up? +- Should v1 also ship the tool-denial hook (auto-detect stuck), or strictly the agent-initiated slash command? Currently deferred, but the line is worth confirming during implementation. +- How does the dashboard handle rejection? Does the agent get a comment back saying "denied, here's why," or does the bottle just stay torn down? +- How does the orchestrator know which PR / branch a given bottle maps to — recorded at bottle-spawn time, derived from the working tree, or specified in the manifest? +- Concurrency: if multiple bottles request changes simultaneously, what does the dashboard surface and in what order? + +## References + +- PRD 0010 — cred-proxy (the endpoint extended to carry stuck-requests). +- `CLAUDE.md` — project non-goal on agent-to-agent communication; this PRD stays on the human→agent side of that line.