diff --git a/docs/prds/0035-supervise-wait-bounds.md b/docs/prds/0035-supervise-wait-bounds.md new file mode 100644 index 0000000..ee68781 --- /dev/null +++ b/docs/prds/0035-supervise-wait-bounds.md @@ -0,0 +1,101 @@ +# PRD 0035: Supervise Wait Bounds + +- **Status:** Draft +- **Author:** didericis-codex +- **Created:** 2026-06-02 +- **Issue:** #128 + +## Summary + +Bound the supervise sidecar's request-thread waits so an agent tool call cannot +hold an HTTP worker forever while waiting for operator action. Preserve the MCP +tool surface, but make timeout behavior explicit, observable, and tested. + +## Problem + +`bot_bottle/supervise_server.py` handles MCP over a threaded stdlib HTTP +server. Tool calls validate a proposal, write it to the supervise queue, and +then wait for the operator response file. Today that wait can last forever. +Each outstanding tool call consumes one server thread until the operator acts. + +The route-listing helper also performs a live HTTP request to egress inside the +request thread. It has a short timeout today, but the behavior is not described +as part of a broader request-thread budget. + +This is operationally risky in multi-agent or repeated-call scenarios: a stuck +or ignored proposal can accumulate blocked threads, and callers do not get a +clear "still pending" answer they can reason about. + +## Goals / Success Criteria + +- Add a bounded wait for operator responses to supervise tool calls. +- Return a clear JSON-RPC tool result when the proposal remains pending after + the timeout; do not treat pending operator action as an internal server error. +- Keep the queued proposal on disk after timeout so the operator can still act. +- Make the wait duration configurable by environment with a conservative + default. +- Preserve current success and rejection result shapes for completed operator + responses. +- Keep `list-egress-routes` bounded and document its timeout behavior. +- Add focused tests for approved responses, timed-out pending responses, and + route-list timeout/error handling. + +## Non-goals + +- No asynchronous framework or new runtime dependency. +- No replacement of the stdlib threaded HTTP server. +- No change to the host-side supervise queue format. +- No cancellation protocol between the agent and operator UI. +- No dashboard or TUI changes. + +## Scope + +In scope: + +- `bot_bottle/supervise_server.py` request handling. +- Any small helper in `bot_bottle/supervise.py` needed to support a bounded + wait cleanly. +- Unit tests around tool-call response waiting and route-list behavior. + +Out of scope: + +- Reworking proposal persistence. +- Changing egress apply or pipelock apply flows. +- Adding background workers to complete HTTP requests after the client returns. + +## Design + +Introduce a supervise response wait budget, for example +`SUPERVISE_RESPONSE_TIMEOUT_SECONDS`, with a documented default. The existing +poll loop should stop after that budget and return a normal tool result such as +`{"status": "pending", "notes": "operator response timed out; proposal remains queued"}`. +The exact field names should fit the existing response schema so agents can +handle success, rejection, and pending with one result parser. + +The proposal file must remain in the queue when the HTTP call times out. The +operator can still approve or reject it later, but that later response will not +resume the original HTTP request. + +Route-listing should continue to use a short HTTP timeout to egress. Errors +should be returned as tool results or JSON-RPC errors consistently with the +existing server behavior; the implementation should avoid an unbounded socket +wait in the request thread. + +## Testing Strategy + +- Unit-test a tool call whose response appears before the timeout. +- Unit-test a tool call whose response never appears and assert the request + returns a pending result while the proposal remains queued. +- Unit-test invalid timeout env values fall back or fail clearly. +- Unit-test `list-egress-routes` timeout/error behavior with a fake URL opener. + +Run: + +- `python3 -m unittest tests.unit.test_supervise_server` +- `python3 -m unittest discover -s tests/unit` + +## Open Questions + +- What default wait budget is best for agent ergonomics? A short timeout keeps + worker threads free; a longer timeout gives an operator more time to respond + inline before the agent has to retry.