Files
bot-bottle/docs/prds/0012-stuck-agent-recovery-flow.md
T
didericis 83756fa8c9
test / unit (pull_request) Successful in 12s
test / integration (pull_request) Successful in 22s
docs(prd-0012): open question for gitlock/pipelock exception flow
2026-05-24 23:12:55 -04:00

7.2 KiB

PRD 0012: Stuck-agent recovery flow

  • Status: Draft
  • Author: didericis
  • Created: 2026-05-24

Summary

When an agent running inside a claude-bottle container gets blocked by a missing permission, tool, or skill, it asks for help via a PR comment; the user approves a manifest change in a TUI dashboard; the orchestrator rebuilds the container from the new manifest and resumes work on the same branch — without ever opening a live channel into the running bottle.

Problem

Running parallel agents in isolated bottles makes it cheap to spin up work in parallel, but expensive to recover when an agent gets stuck. Today, if a bottle is missing a permission or a tool the agent needs to make progress, the only options are to kill the container and start over (losing work) or open a live channel into the bottle to fix it in place (breaking the sandbox property that makes bottles trustworthy in the first place). The user feels this directly whenever a parallel run blocks on something the manifest didn't anticipate.

Goals / Success Criteria

A real stuck agent recovers end-to-end through the flow: the agent hits a missing permission, posts a PR comment describing the ask, the user reviews the request in a dashboard, approves a manifest diff, and a fresh bottle picks up on the same branch and continues. The whole loop completes without anyone running docker attach or opening any live channel into the original container.

Non-goals

  • Live attach or in-place mutation of running containers. The whole design exists to avoid this.
  • Agent-to-agent communication. Re-stated from the project's existing non-goals; the recovery flow is human→agent only.
  • Auditing or forensic replay of agent runs. Git/forge history is the audit log; this PRD does not add a separate run log.
  • Reducing time-to-unstuck below some target. Faster than kill-and-restart is implicit, but no specific SLO is in scope.

Scope

In scope

  • A /request-bottle-change slash command the agent invokes when it knows it's blocked.
  • A TUI dashboard that lists running bottles and pending change requests, and takes approve/reject input from the user.
  • A rebuild orchestrator that tears down the old bottle, applies the approved manifest change, and starts a replacement bottle on the same branch.
  • A state-preservation helper that carries forward what it can across the rebuild (working tree is mandatory; transcript / reasoning context is best-effort).
  • A stuck-signal mechanism that does not require a forge token inside the bottle: the agent's slash command sends the request to the existing cred-proxy endpoint, which (with a host-mounted volume) writes the sentinel artifact on the host side. The orchestrator polls that artifact and posts the PR comment from outside the bottle.

Out of scope

  • A tool-denial hook that auto-detects "stuck" without the agent's involvement. Deferred to a follow-up; v1 is opt-in via the slash command.
  • A web dashboard. TUI only in v1.
  • Live channel into running containers (see Non-goals).
  • Agent-to-agent communication (see Non-goals).
  • Auditing / forensic replay (see Non-goals).

Proposed Design

New services / components

  • /request-bottle-change slash command. Shipped as a skill mounted into bottles. When the agent invokes it, the command POSTs a structured request (what's needed, why, what was tried) to the cred-proxy endpoint and halts the agent. The agent never touches the host filesystem.
  • TUI dashboard. A claude-bottle dashboard (or similarly named) command that lists running bottles, surfaces pending change requests, shows the proposed manifest diff, and accepts approve/reject input. Targets stdlib only; a TUI library is added only if the experience truly demands it.
  • Rebuild orchestrator. The plumbing that, on approval, tears down the existing bottle, applies the approved manifest change, snapshots state via the state-preservation helper, and starts a fresh bottle on the same branch.
  • State-preservation helper. Mandatory: ensures the working tree is pushed before teardown. Best-effort: carries forward the agent's transcript / reasoning context into the replacement container so the new agent starts warm rather than cold.

Existing code touched

  • cred-proxy (PRD 0010) — extended with an endpoint that accepts stuck-requests from inside a bottle and writes the sentinel artifact to a host-mounted volume.
  • cli.py — gains the dashboard subcommand and the rebuild path.
  • Bottle lifecycle scripts — extended for orchestrated teardown + rebuild with state hand-off, distinct from a fresh-spawn.
  • Bottle manifest schema — may need to record the originating manifest version / change history per agent run, so the dashboard can show "what changed" rather than "what is."

Data model changes

  • A new stuck-request artifact (probably JSON) written by the cred-proxy on behalf of the agent, with whatever fields the dashboard needs to render the ask.
  • A per-agent-run record sufficient to map a running bottle back to its PR / branch, so the orchestrator knows where to post the comment and which branch to resume on.

External dependencies

  • The Gitea API / tea CLI is already in the toolbox (the project is on Gitea); no new auth surface beyond what the orchestrator already needs to read/post on PRs.
  • A TUI library is a maybe — only if stdlib can't carry the dashboard experience. Default to no new dependency.

Open questions

  • What exactly does best-effort transcript preservation look like? Mount the agent's state directory, snapshot on teardown, remount in the replacement? How much fidelity is "good enough" for the new agent to pick up?
  • Should v1 also ship the tool-denial hook (auto-detect stuck), or strictly the agent-initiated slash command? Currently deferred, but the line is worth confirming during implementation.
  • How does the dashboard handle rejection? Does the agent get a comment back saying "denied, here's why," or does the bottle just stay torn down?
  • How does the orchestrator know which PR / branch a given bottle maps to — recorded at bottle-spawn time, derived from the working tree, or specified in the manifest?
  • Concurrency: if multiple bottles request changes simultaneously, what does the dashboard surface and in what order?
  • How does the flow handle one-off exceptions to gitlock / pipelock denials — e.g. a commit that includes docs with intentionally-bogus tokens that the secret scanner correctly flags? The shape (agent blocked → ask via PR comment → user approves → continue) is the same as a manifest-change request, but the resolution is different: a per-operation override or a scoped allowlist entry, not a new manifest. Does this fold into the same /request-bottle-change slash command with a different request type, or is it a separate slash command (e.g. /request-gate-exception)? And how is an "exception" expressed safely — by commit SHA, by content hash, by a narrow allowlist rule? Either way, the approval must be auditable so a future reader can see what was waived and why.

References

  • PRD 0010 — cred-proxy (the endpoint extended to carry stuck-requests).
  • CLAUDE.md — project non-goal on agent-to-agent communication; this PRD stays on the human→agent side of that line.