docs: add PRD 0012 — stuck-agent recovery flow

2026-05-24 23:10:30 -04:00
parent b0581e60d7
commit b4c9e149b0
1 changed files with 81 additions and 0 deletions
@@ -0,0 +1,81 @@
+# PRD 0012: Stuck-agent recovery flow
+
+- **Status:** Draft
+- **Author:** didericis
+- **Created:** 2026-05-24
+
+## Summary
+
+When an agent running inside a claude-bottle container gets blocked by a missing permission, tool, or skill, it asks for help via a PR comment; the user approves a manifest change in a TUI dashboard; the orchestrator rebuilds the container from the new manifest and resumes work on the same branch — without ever opening a live channel into the running bottle.
+
+## Problem
+
+Running parallel agents in isolated bottles makes it cheap to spin up work in parallel, but expensive to recover when an agent gets stuck. Today, if a bottle is missing a permission or a tool the agent needs to make progress, the only options are to kill the container and start over (losing work) or open a live channel into the bottle to fix it in place (breaking the sandbox property that makes bottles trustworthy in the first place). The user feels this directly whenever a parallel run blocks on something the manifest didn't anticipate.
+
+## Goals / Success Criteria
+
+A real stuck agent recovers end-to-end through the flow: the agent hits a missing permission, posts a PR comment describing the ask, the user reviews the request in a dashboard, approves a manifest diff, and a fresh bottle picks up on the same branch and continues. The whole loop completes without anyone running `docker attach` or opening any live channel into the original container.
+
+## Non-goals
+
+- Live attach or in-place mutation of running containers. The whole design exists to avoid this.
+- Agent-to-agent communication. Re-stated from the project's existing non-goals; the recovery flow is human→agent only.
+- Auditing or forensic replay of agent runs. Git/forge history is the audit log; this PRD does not add a separate run log.
+- Reducing time-to-unstuck below some target. Faster than kill-and-restart is implicit, but no specific SLO is in scope.
+
+## Scope
+
+### In scope
+
+- A `/request-bottle-change` slash command the agent invokes when it knows it's blocked.
+- A TUI dashboard that lists running bottles and pending change requests, and takes approve/reject input from the user.
+- A rebuild orchestrator that tears down the old bottle, applies the approved manifest change, and starts a replacement bottle on the same branch.
+- A state-preservation helper that carries forward what it can across the rebuild (working tree is mandatory; transcript / reasoning context is best-effort).
+- A stuck-signal mechanism that does not require a forge token inside the bottle: the agent's slash command sends the request to the existing cred-proxy endpoint, which (with a host-mounted volume) writes the sentinel artifact on the host side. The orchestrator polls that artifact and posts the PR comment from outside the bottle.
+
+### Out of scope
+
+- A tool-denial hook that auto-detects "stuck" without the agent's involvement. Deferred to a follow-up; v1 is opt-in via the slash command.
+- A web dashboard. TUI only in v1.
+- Live channel into running containers (see Non-goals).
+- Agent-to-agent communication (see Non-goals).
+- Auditing / forensic replay (see Non-goals).
+
+## Proposed Design
+
+### New services / components
+
+- **`/request-bottle-change` slash command.** Shipped as a skill mounted into bottles. When the agent invokes it, the command POSTs a structured request (what's needed, why, what was tried) to the cred-proxy endpoint and halts the agent. The agent never touches the host filesystem.
+- **TUI dashboard.** A `claude-bottle dashboard` (or similarly named) command that lists running bottles, surfaces pending change requests, shows the proposed manifest diff, and accepts approve/reject input. Targets stdlib only; a TUI library is added only if the experience truly demands it.
+- **Rebuild orchestrator.** The plumbing that, on approval, tears down the existing bottle, applies the approved manifest change, snapshots state via the state-preservation helper, and starts a fresh bottle on the same branch.
+- **State-preservation helper.** Mandatory: ensures the working tree is pushed before teardown. Best-effort: carries forward the agent's transcript / reasoning context into the replacement container so the new agent starts warm rather than cold.
+
+### Existing code touched
+
+- **cred-proxy** (PRD 0010) — extended with an endpoint that accepts stuck-requests from inside a bottle and writes the sentinel artifact to a host-mounted volume.
+- **`cli.py`** — gains the dashboard subcommand and the rebuild path.
+- **Bottle lifecycle scripts** — extended for orchestrated teardown + rebuild with state hand-off, distinct from a fresh-spawn.
+- **Bottle manifest schema** — may need to record the originating manifest version / change history per agent run, so the dashboard can show "what changed" rather than "what is."
+
+### Data model changes
+
+- A new stuck-request artifact (probably JSON) written by the cred-proxy on behalf of the agent, with whatever fields the dashboard needs to render the ask.
+- A per-agent-run record sufficient to map a running bottle back to its PR / branch, so the orchestrator knows where to post the comment and which branch to resume on.
+
+### External dependencies
+
+- The Gitea API / `tea` CLI is already in the toolbox (the project is on Gitea); no new auth surface beyond what the orchestrator already needs to read/post on PRs.
+- A TUI library is a *maybe* — only if stdlib can't carry the dashboard experience. Default to no new dependency.
+
+## Open questions
+
+- What exactly does best-effort transcript preservation look like? Mount the agent's state directory, snapshot on teardown, remount in the replacement? How much fidelity is "good enough" for the new agent to pick up?
+- Should v1 also ship the tool-denial hook (auto-detect stuck), or strictly the agent-initiated slash command? Currently deferred, but the line is worth confirming during implementation.
+- How does the dashboard handle rejection? Does the agent get a comment back saying "denied, here's why," or does the bottle just stay torn down?
+- How does the orchestrator know which PR / branch a given bottle maps to — recorded at bottle-spawn time, derived from the working tree, or specified in the manifest?
+- Concurrency: if multiple bottles request changes simultaneously, what does the dashboard surface and in what order?
+
+## References
+
+- PRD 0010 — cred-proxy (the endpoint extended to carry stuck-requests).
+- `CLAUDE.md` — project non-goal on agent-to-agent communication; this PRD stays on the human→agent side of that line.