47c3ba63f8
Flip Status: Draft -> Active for the 23 PRDs whose work has shipped to main (including 0027, now that PR #95 has merged). Leaves the terminal-status PRDs unchanged: 0007 and 0010 (Superseded) and 0014 (Retargeted) were replaced, not shipped as-is. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
71 lines
5.0 KiB
Markdown
71 lines
5.0 KiB
Markdown
# PRD 0016: capability block remediation
|
|
|
|
- **Status:** Active
|
|
- **Author:** didericis
|
|
- **Created:** 2026-05-25
|
|
- **Parent:** PRD 0012
|
|
- **Depends on:** PRD 0013
|
|
|
|
## Summary
|
|
|
|
Wires the **capability block** path (PRD 0012 *Stuck categories*) end-to-end. On operator approval of a `capability-block` proposal, the rebuild orchestrator tears down the existing bottle, builds from the new Dockerfile, and starts a replacement bottle on the same branch via the state-preservation helper. The replacement agent picks up where the original left off, now with the missing capability. Heaviest of the three remediation PRDs because the orchestrator and state-preservation helper are non-trivial.
|
|
|
|
## Problem
|
|
|
|
See PRD 0012. This PRD specifically addresses: with 0013 in place, the operator can approve a `capability-block` proposal but nothing happens — the bottle is not rebuilt, the agent stays stuck. This PRD closes the loop. Unlike 0014 and 0015, the remediation requires container teardown + rebuild + state hand-off, so the design surface is larger.
|
|
|
|
## Goals / Success Criteria
|
|
|
|
A real capability block recovers end-to-end: the agent's invocation of a tool / command / skill fails (not found, permission denied), the agent calls `capability-block` with a proposed Dockerfile and justification, the operator approves in the TUI, the orchestrator tears down the bottle and starts a replacement built from the new Dockerfile, the replacement agent inherits the working tree and best-effort transcript and continues on the same branch.
|
|
|
|
## Non-goals
|
|
|
|
- Live mutation of the running container (re-stated from PRD 0012 non-goals).
|
|
- Forking into multiple parallel rebuilt bottles. One-for-one replacement only.
|
|
- cred-proxy or pipelock handling (covered by 0014 and 0015).
|
|
|
|
## Scope
|
|
|
|
### In scope
|
|
|
|
- A rebuild orchestrator that, on operator approval, tears down the existing bottle, builds from the approved Dockerfile, and starts a replacement on the same branch.
|
|
- A state-preservation helper that handles the hand-off across the rebuild: working tree push is mandatory; transcript / reasoning context is best-effort.
|
|
- `capability-block` approval handler in the MCP sidecar (replacing the 0013 no-op): on approval, hand off to the orchestrator.
|
|
- Bottle lifecycle script changes for orchestrated teardown + rebuild (distinct from a fresh-spawn).
|
|
- Bottle manifest schema changes: record originating manifest version / change history per agent run so the dashboard can show "what changed" rather than "what is."
|
|
- A per-agent-run record that maps a running bottle back to its PR / branch, so the orchestrator knows which branch to resume on.
|
|
|
|
### Out of scope
|
|
|
|
- Rolling back a rebuild that the replacement agent regrets. The audit trail (git history + bottle rebuild record) shows what changed; a follow-up `capability-block` proposal can revert.
|
|
|
|
## Proposed Design
|
|
|
|
### New services / components
|
|
|
|
- **Rebuild orchestrator.** On approval, tears down the existing bottle, builds from the new Dockerfile, snapshots state via the state-preservation helper, and starts a fresh bottle on the same branch.
|
|
- **State-preservation helper.** Mandatory: ensures the working tree is pushed before teardown. Best-effort: carries forward the agent's transcript / reasoning context — including the approved `capability-block` proposal — into the replacement container so the new agent starts warm rather than cold.
|
|
|
|
### Existing code touched
|
|
|
|
- **MCP sidecar** (PRD 0013) — the `capability-block` approval handler stops being a no-op; on approval, hands off to the rebuild orchestrator.
|
|
- **Bottle lifecycle scripts** — extended for orchestrated teardown + rebuild with state hand-off, distinct from a fresh-spawn.
|
|
- **Bottle manifest schema** — records the originating manifest version / change history per agent run.
|
|
- **`cli.py`** — gains the rebuild path.
|
|
|
|
### Data model changes
|
|
|
|
- A per-agent-run record sufficient to map a running bottle back to its PR / branch.
|
|
|
|
## Open questions
|
|
|
|
- **`capability-block` return semantics.** The current agent is torn down on approval, so the tool's return value never reaches it. Options: (a) fire-and-forget, the tool returns immediately with "queued" and the agent halts; (b) block the tool, let the rebuild orchestrator's teardown kill the connection, the replacement agent gets the approval record via state-preservation; (c) the tool blocks, returns "approved" right before teardown, the agent has milliseconds to log it. (b) seems cleanest but is worth confirming during implementation.
|
|
- **Best-effort transcript preservation.** Mount the agent's state directory, snapshot on teardown, remount in the replacement? How much fidelity is "good enough" for the new agent to pick up?
|
|
- **Bottle → PR/branch mapping.** Recorded at bottle-spawn time, derived from the working tree, or specified in the manifest?
|
|
- **Rejection semantics.** Does the agent receive a tool reply explaining the rejection, or does the bottle just stay torn down?
|
|
|
|
## References
|
|
|
|
- PRD 0012 — stuck-agent recovery flow overview.
|
|
- PRD 0013 — supervise plane foundation (prerequisite).
|