docs: add PRD 0012 — stuck-agent recovery flow
test / unit (pull_request) Successful in 12s
test / integration (pull_request) Successful in 22s

This commit is contained in:
2026-05-24 23:10:30 -04:00
parent b0581e60d7
commit b4c9e149b0
@@ -0,0 +1,81 @@
# PRD 0012: Stuck-agent recovery flow
- **Status:** Draft
- **Author:** didericis
- **Created:** 2026-05-24
## Summary
When an agent running inside a claude-bottle container gets blocked by a missing permission, tool, or skill, it asks for help via a PR comment; the user approves a manifest change in a TUI dashboard; the orchestrator rebuilds the container from the new manifest and resumes work on the same branch — without ever opening a live channel into the running bottle.
## Problem
Running parallel agents in isolated bottles makes it cheap to spin up work in parallel, but expensive to recover when an agent gets stuck. Today, if a bottle is missing a permission or a tool the agent needs to make progress, the only options are to kill the container and start over (losing work) or open a live channel into the bottle to fix it in place (breaking the sandbox property that makes bottles trustworthy in the first place). The user feels this directly whenever a parallel run blocks on something the manifest didn't anticipate.
## Goals / Success Criteria
A real stuck agent recovers end-to-end through the flow: the agent hits a missing permission, posts a PR comment describing the ask, the user reviews the request in a dashboard, approves a manifest diff, and a fresh bottle picks up on the same branch and continues. The whole loop completes without anyone running `docker attach` or opening any live channel into the original container.
## Non-goals
- Live attach or in-place mutation of running containers. The whole design exists to avoid this.
- Agent-to-agent communication. Re-stated from the project's existing non-goals; the recovery flow is human→agent only.
- Auditing or forensic replay of agent runs. Git/forge history is the audit log; this PRD does not add a separate run log.
- Reducing time-to-unstuck below some target. Faster than kill-and-restart is implicit, but no specific SLO is in scope.
## Scope
### In scope
- A `/request-bottle-change` slash command the agent invokes when it knows it's blocked.
- A TUI dashboard that lists running bottles and pending change requests, and takes approve/reject input from the user.
- A rebuild orchestrator that tears down the old bottle, applies the approved manifest change, and starts a replacement bottle on the same branch.
- A state-preservation helper that carries forward what it can across the rebuild (working tree is mandatory; transcript / reasoning context is best-effort).
- A stuck-signal mechanism that does not require a forge token inside the bottle: the agent's slash command sends the request to the existing cred-proxy endpoint, which (with a host-mounted volume) writes the sentinel artifact on the host side. The orchestrator polls that artifact and posts the PR comment from outside the bottle.
### Out of scope
- A tool-denial hook that auto-detects "stuck" without the agent's involvement. Deferred to a follow-up; v1 is opt-in via the slash command.
- A web dashboard. TUI only in v1.
- Live channel into running containers (see Non-goals).
- Agent-to-agent communication (see Non-goals).
- Auditing / forensic replay (see Non-goals).
## Proposed Design
### New services / components
- **`/request-bottle-change` slash command.** Shipped as a skill mounted into bottles. When the agent invokes it, the command POSTs a structured request (what's needed, why, what was tried) to the cred-proxy endpoint and halts the agent. The agent never touches the host filesystem.
- **TUI dashboard.** A `claude-bottle dashboard` (or similarly named) command that lists running bottles, surfaces pending change requests, shows the proposed manifest diff, and accepts approve/reject input. Targets stdlib only; a TUI library is added only if the experience truly demands it.
- **Rebuild orchestrator.** The plumbing that, on approval, tears down the existing bottle, applies the approved manifest change, snapshots state via the state-preservation helper, and starts a fresh bottle on the same branch.
- **State-preservation helper.** Mandatory: ensures the working tree is pushed before teardown. Best-effort: carries forward the agent's transcript / reasoning context into the replacement container so the new agent starts warm rather than cold.
### Existing code touched
- **cred-proxy** (PRD 0010) — extended with an endpoint that accepts stuck-requests from inside a bottle and writes the sentinel artifact to a host-mounted volume.
- **`cli.py`** — gains the dashboard subcommand and the rebuild path.
- **Bottle lifecycle scripts** — extended for orchestrated teardown + rebuild with state hand-off, distinct from a fresh-spawn.
- **Bottle manifest schema** — may need to record the originating manifest version / change history per agent run, so the dashboard can show "what changed" rather than "what is."
### Data model changes
- A new stuck-request artifact (probably JSON) written by the cred-proxy on behalf of the agent, with whatever fields the dashboard needs to render the ask.
- A per-agent-run record sufficient to map a running bottle back to its PR / branch, so the orchestrator knows where to post the comment and which branch to resume on.
### External dependencies
- The Gitea API / `tea` CLI is already in the toolbox (the project is on Gitea); no new auth surface beyond what the orchestrator already needs to read/post on PRs.
- A TUI library is a *maybe* — only if stdlib can't carry the dashboard experience. Default to no new dependency.
## Open questions
- What exactly does best-effort transcript preservation look like? Mount the agent's state directory, snapshot on teardown, remount in the replacement? How much fidelity is "good enough" for the new agent to pick up?
- Should v1 also ship the tool-denial hook (auto-detect stuck), or strictly the agent-initiated slash command? Currently deferred, but the line is worth confirming during implementation.
- How does the dashboard handle rejection? Does the agent get a comment back saying "denied, here's why," or does the bottle just stay torn down?
- How does the orchestrator know which PR / branch a given bottle maps to — recorded at bottle-spawn time, derived from the working tree, or specified in the manifest?
- Concurrency: if multiple bottles request changes simultaneously, what does the dashboard surface and in what order?
## References
- PRD 0010 — cred-proxy (the endpoint extended to carry stuck-requests).
- `CLAUDE.md` — project non-goal on agent-to-agent communication; this PRD stays on the human→agent side of that line.