Files
bot-bottle/docs/prds/0035-supervise-wait-bounds.md
didericis-codex 3472e06efb
test / integration (pull_request) Successful in 1m4s
test / unit (pull_request) Successful in 45s
test / unit (push) Successful in 36s
test / integration (push) Successful in 46s
complete(prd): mark PRD 0035 active
2026-06-02 08:06:53 +00:00

3.8 KiB

PRD 0035: Supervise Wait Bounds

  • Status: Active
  • Author: didericis-codex
  • Created: 2026-06-02
  • Issue: #128

Summary

Bound the supervise sidecar's request-thread waits so an agent tool call cannot hold an HTTP worker forever while waiting for operator action. Preserve the MCP tool surface, but make timeout behavior explicit, observable, and tested.

Problem

bot_bottle/supervise_server.py handles MCP over a threaded stdlib HTTP server. Tool calls validate a proposal, write it to the supervise queue, and then wait for the operator response file. Today that wait can last forever. Each outstanding tool call consumes one server thread until the operator acts.

The route-listing helper also performs a live HTTP request to egress inside the request thread. It has a short timeout today, but the behavior is not described as part of a broader request-thread budget.

This is operationally risky in multi-agent or repeated-call scenarios: a stuck or ignored proposal can accumulate blocked threads, and callers do not get a clear "still pending" answer they can reason about.

Goals / Success Criteria

  • Add a bounded wait for operator responses to supervise tool calls.
  • Return a clear JSON-RPC tool result when the proposal remains pending after the timeout; do not treat pending operator action as an internal server error.
  • Keep the queued proposal on disk after timeout so the operator can still act.
  • Make the wait duration configurable by environment with a conservative default.
  • Preserve current success and rejection result shapes for completed operator responses.
  • Keep list-egress-routes bounded and document its timeout behavior.
  • Add focused tests for approved responses, timed-out pending responses, and route-list timeout/error handling.

Non-goals

  • No asynchronous framework or new runtime dependency.
  • No replacement of the stdlib threaded HTTP server.
  • No change to the host-side supervise queue format.
  • No cancellation protocol between the agent and operator UI.
  • No dashboard or TUI changes.

Scope

In scope:

  • bot_bottle/supervise_server.py request handling.
  • Any small helper in bot_bottle/supervise.py needed to support a bounded wait cleanly.
  • Unit tests around tool-call response waiting and route-list behavior.

Out of scope:

  • Reworking proposal persistence.
  • Changing egress apply or pipelock apply flows.
  • Adding background workers to complete HTTP requests after the client returns.

Design

Introduce a supervise response wait budget, SUPERVISE_RESPONSE_TIMEOUT_SECONDS, with a 30 second default. The existing poll loop should stop after that budget and return a normal tool result such as {"status": "pending", "notes": "operator response timed out; proposal remains queued"}. The exact field names should fit the existing response schema so agents can handle success, rejection, and pending with one result parser.

The proposal file must remain in the queue when the HTTP call times out. The operator can still approve or reject it later, but that later response will not resume the original HTTP request.

Route-listing should continue to use a short HTTP timeout to egress. Errors should be returned as tool results or JSON-RPC errors consistently with the existing server behavior; the implementation should avoid an unbounded socket wait in the request thread.

Testing Strategy

  • Unit-test a tool call whose response appears before the timeout.
  • Unit-test a tool call whose response never appears and assert the request returns a pending result while the proposal remains queued.
  • Unit-test invalid timeout env values fall back or fail clearly.
  • Unit-test list-egress-routes timeout/error behavior with a fake URL opener.

Run:

  • python3 -m unittest tests.unit.test_supervise_server
  • python3 -m unittest discover -s tests/unit

Open Questions

None.