100 lines
3.8 KiB
Markdown
100 lines
3.8 KiB
Markdown
# PRD 0035: Supervise Wait Bounds
|
|
|
|
- **Status:** Active
|
|
- **Author:** didericis-codex
|
|
- **Created:** 2026-06-02
|
|
- **Issue:** #128
|
|
|
|
## Summary
|
|
|
|
Bound the supervise sidecar's request-thread waits so an agent tool call cannot
|
|
hold an HTTP worker forever while waiting for operator action. Preserve the MCP
|
|
tool surface, but make timeout behavior explicit, observable, and tested.
|
|
|
|
## Problem
|
|
|
|
`bot_bottle/supervise_server.py` handles MCP over a threaded stdlib HTTP
|
|
server. Tool calls validate a proposal, write it to the supervise queue, and
|
|
then wait for the operator response file. Today that wait can last forever.
|
|
Each outstanding tool call consumes one server thread until the operator acts.
|
|
|
|
The route-listing helper also performs a live HTTP request to egress inside the
|
|
request thread. It has a short timeout today, but the behavior is not described
|
|
as part of a broader request-thread budget.
|
|
|
|
This is operationally risky in multi-agent or repeated-call scenarios: a stuck
|
|
or ignored proposal can accumulate blocked threads, and callers do not get a
|
|
clear "still pending" answer they can reason about.
|
|
|
|
## Goals / Success Criteria
|
|
|
|
- Add a bounded wait for operator responses to supervise tool calls.
|
|
- Return a clear JSON-RPC tool result when the proposal remains pending after
|
|
the timeout; do not treat pending operator action as an internal server error.
|
|
- Keep the queued proposal on disk after timeout so the operator can still act.
|
|
- Make the wait duration configurable by environment with a conservative
|
|
default.
|
|
- Preserve current success and rejection result shapes for completed operator
|
|
responses.
|
|
- Keep `list-egress-routes` bounded and document its timeout behavior.
|
|
- Add focused tests for approved responses, timed-out pending responses, and
|
|
route-list timeout/error handling.
|
|
|
|
## Non-goals
|
|
|
|
- No asynchronous framework or new runtime dependency.
|
|
- No replacement of the stdlib threaded HTTP server.
|
|
- No change to the host-side supervise queue format.
|
|
- No cancellation protocol between the agent and operator UI.
|
|
- No dashboard or TUI changes.
|
|
|
|
## Scope
|
|
|
|
In scope:
|
|
|
|
- `bot_bottle/supervise_server.py` request handling.
|
|
- Any small helper in `bot_bottle/supervise.py` needed to support a bounded
|
|
wait cleanly.
|
|
- Unit tests around tool-call response waiting and route-list behavior.
|
|
|
|
Out of scope:
|
|
|
|
- Reworking proposal persistence.
|
|
- Changing egress apply or pipelock apply flows.
|
|
- Adding background workers to complete HTTP requests after the client returns.
|
|
|
|
## Design
|
|
|
|
Introduce a supervise response wait budget,
|
|
`SUPERVISE_RESPONSE_TIMEOUT_SECONDS`, with a 30 second default. The existing
|
|
poll loop should stop after that budget and return a normal tool result such as
|
|
`{"status": "pending", "notes": "operator response timed out; proposal remains queued"}`.
|
|
The exact field names should fit the existing response schema so agents can
|
|
handle success, rejection, and pending with one result parser.
|
|
|
|
The proposal file must remain in the queue when the HTTP call times out. The
|
|
operator can still approve or reject it later, but that later response will not
|
|
resume the original HTTP request.
|
|
|
|
Route-listing should continue to use a short HTTP timeout to egress. Errors
|
|
should be returned as tool results or JSON-RPC errors consistently with the
|
|
existing server behavior; the implementation should avoid an unbounded socket
|
|
wait in the request thread.
|
|
|
|
## Testing Strategy
|
|
|
|
- Unit-test a tool call whose response appears before the timeout.
|
|
- Unit-test a tool call whose response never appears and assert the request
|
|
returns a pending result while the proposal remains queued.
|
|
- Unit-test invalid timeout env values fall back or fail clearly.
|
|
- Unit-test `list-egress-routes` timeout/error behavior with a fake URL opener.
|
|
|
|
Run:
|
|
|
|
- `python3 -m unittest tests.unit.test_supervise_server`
|
|
- `python3 -m unittest discover -s tests/unit`
|
|
|
|
## Open Questions
|
|
|
|
None.
|