docs: add PRD 0012 — stuck-agent recovery flow
This commit is contained in:
@@ -0,0 +1,81 @@
|
||||
# PRD 0012: Stuck-agent recovery flow
|
||||
|
||||
- **Status:** Draft
|
||||
- **Author:** didericis
|
||||
- **Created:** 2026-05-24
|
||||
|
||||
## Summary
|
||||
|
||||
When an agent running inside a claude-bottle container gets blocked by a missing permission, tool, or skill, it asks for help via a PR comment; the user approves a manifest change in a TUI dashboard; the orchestrator rebuilds the container from the new manifest and resumes work on the same branch — without ever opening a live channel into the running bottle.
|
||||
|
||||
## Problem
|
||||
|
||||
Running parallel agents in isolated bottles makes it cheap to spin up work in parallel, but expensive to recover when an agent gets stuck. Today, if a bottle is missing a permission or a tool the agent needs to make progress, the only options are to kill the container and start over (losing work) or open a live channel into the bottle to fix it in place (breaking the sandbox property that makes bottles trustworthy in the first place). The user feels this directly whenever a parallel run blocks on something the manifest didn't anticipate.
|
||||
|
||||
## Goals / Success Criteria
|
||||
|
||||
A real stuck agent recovers end-to-end through the flow: the agent hits a missing permission, posts a PR comment describing the ask, the user reviews the request in a dashboard, approves a manifest diff, and a fresh bottle picks up on the same branch and continues. The whole loop completes without anyone running `docker attach` or opening any live channel into the original container.
|
||||
|
||||
## Non-goals
|
||||
|
||||
- Live attach or in-place mutation of running containers. The whole design exists to avoid this.
|
||||
- Agent-to-agent communication. Re-stated from the project's existing non-goals; the recovery flow is human→agent only.
|
||||
- Auditing or forensic replay of agent runs. Git/forge history is the audit log; this PRD does not add a separate run log.
|
||||
- Reducing time-to-unstuck below some target. Faster than kill-and-restart is implicit, but no specific SLO is in scope.
|
||||
|
||||
## Scope
|
||||
|
||||
### In scope
|
||||
|
||||
- A `/request-bottle-change` slash command the agent invokes when it knows it's blocked.
|
||||
- A TUI dashboard that lists running bottles and pending change requests, and takes approve/reject input from the user.
|
||||
- A rebuild orchestrator that tears down the old bottle, applies the approved manifest change, and starts a replacement bottle on the same branch.
|
||||
- A state-preservation helper that carries forward what it can across the rebuild (working tree is mandatory; transcript / reasoning context is best-effort).
|
||||
- A stuck-signal mechanism that does not require a forge token inside the bottle: the agent's slash command sends the request to the existing cred-proxy endpoint, which (with a host-mounted volume) writes the sentinel artifact on the host side. The orchestrator polls that artifact and posts the PR comment from outside the bottle.
|
||||
|
||||
### Out of scope
|
||||
|
||||
- A tool-denial hook that auto-detects "stuck" without the agent's involvement. Deferred to a follow-up; v1 is opt-in via the slash command.
|
||||
- A web dashboard. TUI only in v1.
|
||||
- Live channel into running containers (see Non-goals).
|
||||
- Agent-to-agent communication (see Non-goals).
|
||||
- Auditing / forensic replay (see Non-goals).
|
||||
|
||||
## Proposed Design
|
||||
|
||||
### New services / components
|
||||
|
||||
- **`/request-bottle-change` slash command.** Shipped as a skill mounted into bottles. When the agent invokes it, the command POSTs a structured request (what's needed, why, what was tried) to the cred-proxy endpoint and halts the agent. The agent never touches the host filesystem.
|
||||
- **TUI dashboard.** A `claude-bottle dashboard` (or similarly named) command that lists running bottles, surfaces pending change requests, shows the proposed manifest diff, and accepts approve/reject input. Targets stdlib only; a TUI library is added only if the experience truly demands it.
|
||||
- **Rebuild orchestrator.** The plumbing that, on approval, tears down the existing bottle, applies the approved manifest change, snapshots state via the state-preservation helper, and starts a fresh bottle on the same branch.
|
||||
- **State-preservation helper.** Mandatory: ensures the working tree is pushed before teardown. Best-effort: carries forward the agent's transcript / reasoning context into the replacement container so the new agent starts warm rather than cold.
|
||||
|
||||
### Existing code touched
|
||||
|
||||
- **cred-proxy** (PRD 0010) — extended with an endpoint that accepts stuck-requests from inside a bottle and writes the sentinel artifact to a host-mounted volume.
|
||||
- **`cli.py`** — gains the dashboard subcommand and the rebuild path.
|
||||
- **Bottle lifecycle scripts** — extended for orchestrated teardown + rebuild with state hand-off, distinct from a fresh-spawn.
|
||||
- **Bottle manifest schema** — may need to record the originating manifest version / change history per agent run, so the dashboard can show "what changed" rather than "what is."
|
||||
|
||||
### Data model changes
|
||||
|
||||
- A new stuck-request artifact (probably JSON) written by the cred-proxy on behalf of the agent, with whatever fields the dashboard needs to render the ask.
|
||||
- A per-agent-run record sufficient to map a running bottle back to its PR / branch, so the orchestrator knows where to post the comment and which branch to resume on.
|
||||
|
||||
### External dependencies
|
||||
|
||||
- The Gitea API / `tea` CLI is already in the toolbox (the project is on Gitea); no new auth surface beyond what the orchestrator already needs to read/post on PRs.
|
||||
- A TUI library is a *maybe* — only if stdlib can't carry the dashboard experience. Default to no new dependency.
|
||||
|
||||
## Open questions
|
||||
|
||||
- What exactly does best-effort transcript preservation look like? Mount the agent's state directory, snapshot on teardown, remount in the replacement? How much fidelity is "good enough" for the new agent to pick up?
|
||||
- Should v1 also ship the tool-denial hook (auto-detect stuck), or strictly the agent-initiated slash command? Currently deferred, but the line is worth confirming during implementation.
|
||||
- How does the dashboard handle rejection? Does the agent get a comment back saying "denied, here's why," or does the bottle just stay torn down?
|
||||
- How does the orchestrator know which PR / branch a given bottle maps to — recorded at bottle-spawn time, derived from the working tree, or specified in the manifest?
|
||||
- Concurrency: if multiple bottles request changes simultaneously, what does the dashboard surface and in what order?
|
||||
|
||||
## References
|
||||
|
||||
- PRD 0010 — cred-proxy (the endpoint extended to carry stuck-requests).
|
||||
- `CLAUDE.md` — project non-goal on agent-to-agent communication; this PRD stays on the human→agent side of that line.
|
||||
Reference in New Issue
Block a user