docs(prd): add supervise wait bounds
This commit is contained in:
@@ -0,0 +1,101 @@
|
||||
# PRD 0035: Supervise Wait Bounds
|
||||
|
||||
- **Status:** Draft
|
||||
- **Author:** didericis-codex
|
||||
- **Created:** 2026-06-02
|
||||
- **Issue:** #128
|
||||
|
||||
## Summary
|
||||
|
||||
Bound the supervise sidecar's request-thread waits so an agent tool call cannot
|
||||
hold an HTTP worker forever while waiting for operator action. Preserve the MCP
|
||||
tool surface, but make timeout behavior explicit, observable, and tested.
|
||||
|
||||
## Problem
|
||||
|
||||
`bot_bottle/supervise_server.py` handles MCP over a threaded stdlib HTTP
|
||||
server. Tool calls validate a proposal, write it to the supervise queue, and
|
||||
then wait for the operator response file. Today that wait can last forever.
|
||||
Each outstanding tool call consumes one server thread until the operator acts.
|
||||
|
||||
The route-listing helper also performs a live HTTP request to egress inside the
|
||||
request thread. It has a short timeout today, but the behavior is not described
|
||||
as part of a broader request-thread budget.
|
||||
|
||||
This is operationally risky in multi-agent or repeated-call scenarios: a stuck
|
||||
or ignored proposal can accumulate blocked threads, and callers do not get a
|
||||
clear "still pending" answer they can reason about.
|
||||
|
||||
## Goals / Success Criteria
|
||||
|
||||
- Add a bounded wait for operator responses to supervise tool calls.
|
||||
- Return a clear JSON-RPC tool result when the proposal remains pending after
|
||||
the timeout; do not treat pending operator action as an internal server error.
|
||||
- Keep the queued proposal on disk after timeout so the operator can still act.
|
||||
- Make the wait duration configurable by environment with a conservative
|
||||
default.
|
||||
- Preserve current success and rejection result shapes for completed operator
|
||||
responses.
|
||||
- Keep `list-egress-routes` bounded and document its timeout behavior.
|
||||
- Add focused tests for approved responses, timed-out pending responses, and
|
||||
route-list timeout/error handling.
|
||||
|
||||
## Non-goals
|
||||
|
||||
- No asynchronous framework or new runtime dependency.
|
||||
- No replacement of the stdlib threaded HTTP server.
|
||||
- No change to the host-side supervise queue format.
|
||||
- No cancellation protocol between the agent and operator UI.
|
||||
- No dashboard or TUI changes.
|
||||
|
||||
## Scope
|
||||
|
||||
In scope:
|
||||
|
||||
- `bot_bottle/supervise_server.py` request handling.
|
||||
- Any small helper in `bot_bottle/supervise.py` needed to support a bounded
|
||||
wait cleanly.
|
||||
- Unit tests around tool-call response waiting and route-list behavior.
|
||||
|
||||
Out of scope:
|
||||
|
||||
- Reworking proposal persistence.
|
||||
- Changing egress apply or pipelock apply flows.
|
||||
- Adding background workers to complete HTTP requests after the client returns.
|
||||
|
||||
## Design
|
||||
|
||||
Introduce a supervise response wait budget, for example
|
||||
`SUPERVISE_RESPONSE_TIMEOUT_SECONDS`, with a documented default. The existing
|
||||
poll loop should stop after that budget and return a normal tool result such as
|
||||
`{"status": "pending", "notes": "operator response timed out; proposal remains queued"}`.
|
||||
The exact field names should fit the existing response schema so agents can
|
||||
handle success, rejection, and pending with one result parser.
|
||||
|
||||
The proposal file must remain in the queue when the HTTP call times out. The
|
||||
operator can still approve or reject it later, but that later response will not
|
||||
resume the original HTTP request.
|
||||
|
||||
Route-listing should continue to use a short HTTP timeout to egress. Errors
|
||||
should be returned as tool results or JSON-RPC errors consistently with the
|
||||
existing server behavior; the implementation should avoid an unbounded socket
|
||||
wait in the request thread.
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
- Unit-test a tool call whose response appears before the timeout.
|
||||
- Unit-test a tool call whose response never appears and assert the request
|
||||
returns a pending result while the proposal remains queued.
|
||||
- Unit-test invalid timeout env values fall back or fail clearly.
|
||||
- Unit-test `list-egress-routes` timeout/error behavior with a fake URL opener.
|
||||
|
||||
Run:
|
||||
|
||||
- `python3 -m unittest tests.unit.test_supervise_server`
|
||||
- `python3 -m unittest discover -s tests/unit`
|
||||
|
||||
## Open Questions
|
||||
|
||||
- What default wait budget is best for agent ergonomics? A short timeout keeps
|
||||
worker threads free; a longer timeout gives an operator more time to respond
|
||||
inline before the agent has to retry.
|
||||
Reference in New Issue
Block a user