docs(prd): add supervise wait bounds

2026-06-02 07:58:39 +00:00
parent fe6059e4a6
commit 7c260eeff9
1 changed files with 101 additions and 0 deletions
@@ -0,0 +1,101 @@
+# PRD 0035: Supervise Wait Bounds
+
+- **Status:** Draft
+- **Author:** didericis-codex
+- **Created:** 2026-06-02
+- **Issue:** #128
+
+## Summary
+
+Bound the supervise sidecar's request-thread waits so an agent tool call cannot
+hold an HTTP worker forever while waiting for operator action. Preserve the MCP
+tool surface, but make timeout behavior explicit, observable, and tested.
+
+## Problem
+
+`bot_bottle/supervise_server.py` handles MCP over a threaded stdlib HTTP
+server. Tool calls validate a proposal, write it to the supervise queue, and
+then wait for the operator response file. Today that wait can last forever.
+Each outstanding tool call consumes one server thread until the operator acts.
+
+The route-listing helper also performs a live HTTP request to egress inside the
+request thread. It has a short timeout today, but the behavior is not described
+as part of a broader request-thread budget.
+
+This is operationally risky in multi-agent or repeated-call scenarios: a stuck
+or ignored proposal can accumulate blocked threads, and callers do not get a
+clear "still pending" answer they can reason about.
+
+## Goals / Success Criteria
+
+- Add a bounded wait for operator responses to supervise tool calls.
+- Return a clear JSON-RPC tool result when the proposal remains pending after
+  the timeout; do not treat pending operator action as an internal server error.
+- Keep the queued proposal on disk after timeout so the operator can still act.
+- Make the wait duration configurable by environment with a conservative
+  default.
+- Preserve current success and rejection result shapes for completed operator
+  responses.
+- Keep `list-egress-routes` bounded and document its timeout behavior.
+- Add focused tests for approved responses, timed-out pending responses, and
+  route-list timeout/error handling.
+
+## Non-goals
+
+- No asynchronous framework or new runtime dependency.
+- No replacement of the stdlib threaded HTTP server.
+- No change to the host-side supervise queue format.
+- No cancellation protocol between the agent and operator UI.
+- No dashboard or TUI changes.
+
+## Scope
+
+In scope:
+
+- `bot_bottle/supervise_server.py` request handling.
+- Any small helper in `bot_bottle/supervise.py` needed to support a bounded
+  wait cleanly.
+- Unit tests around tool-call response waiting and route-list behavior.
+
+Out of scope:
+
+- Reworking proposal persistence.
+- Changing egress apply or pipelock apply flows.
+- Adding background workers to complete HTTP requests after the client returns.
+
+## Design
+
+Introduce a supervise response wait budget, for example
+`SUPERVISE_RESPONSE_TIMEOUT_SECONDS`, with a documented default. The existing
+poll loop should stop after that budget and return a normal tool result such as
+`{"status": "pending", "notes": "operator response timed out; proposal remains queued"}`.
+The exact field names should fit the existing response schema so agents can
+handle success, rejection, and pending with one result parser.
+
+The proposal file must remain in the queue when the HTTP call times out. The
+operator can still approve or reject it later, but that later response will not
+resume the original HTTP request.
+
+Route-listing should continue to use a short HTTP timeout to egress. Errors
+should be returned as tool results or JSON-RPC errors consistently with the
+existing server behavior; the implementation should avoid an unbounded socket
+wait in the request thread.
+
+## Testing Strategy
+
+- Unit-test a tool call whose response appears before the timeout.
+- Unit-test a tool call whose response never appears and assert the request
+  returns a pending result while the proposal remains queued.
+- Unit-test invalid timeout env values fall back or fail clearly.
+- Unit-test `list-egress-routes` timeout/error behavior with a fake URL opener.
+
+Run:
+
+- `python3 -m unittest tests.unit.test_supervise_server`
+- `python3 -m unittest discover -s tests/unit`
+
+## Open Questions
+
+- What default wait budget is best for agent ergonomics? A short timeout keeps
+  worker threads free; a longer timeout gives an operator more time to respond
+  inline before the agent has to retry.