Files
bot-bottle/docs/prds/0034-sidecar-restart-shutdown-semantics.md
didericis-codex fe6059e4a6
test / unit (pull_request) Successful in 39s
test / integration (pull_request) Successful in 52s
test / unit (push) Successful in 34s
test / integration (push) Successful in 50s
complete(prd): mark PRD 0034 active
2026-06-02 07:52:38 +00:00

6.6 KiB

PRD 0034: Sidecar Restart and Shutdown Semantics

  • Status: Active
  • Author: didericis-codex
  • Created: 2026-06-02
  • Issue: #126

Summary

Make the sidecar bundle supervisor's signal, restart, and exit-code behavior explicit and easier to reason about. In particular, move pipelock restart work out of direct SIGUSR1 handler execution while preserving the caller-visible docker kill --signal USR1 <bundle> contract used by pipelock apply flows.

Problem

bot_bottle/sidecar_init.py is PID 1 for the bundled sidecar container. It starts egress, pipelock, git-gate/git-http, and supervise; forwards shutdown signals; forwards SIGHUP to egress; and restarts pipelock on SIGUSR1 after allowlist changes.

The current SIGUSR1 handler calls sup.restart_daemon("pipelock") directly. That restart path can terminate a child, wait up to a grace timeout, SIGKILL a stubborn child, spawn a replacement with subprocess.Popen, and start a new pump thread. In CPython signal handlers run between bytecodes in the main thread, so this is not the same as POSIX async-signal-unsafe C code, but it still means signal handling can block the supervisor loop for the restart grace window and makes stacked signals harder to reason about.

The exit-code contract is also easy to misread. _Supervisor.exit_code() returns the maximum child return code. That preserves a positive crash code when a child failed before graceful shutdown, but the docstring currently frames graceful shutdown as returning zero because signal-killed children have negative return codes. The implementation is reasonable; the contract needs to be deliberate and tested around crash-then-shutdown behavior.

Goals / Success Criteria

  • Preserve the external signal contract:
    • SIGTERM/SIGINT requests bundle shutdown.
    • SIGHUP still forwards to the live egress child.
    • SIGUSR1 still requests an in-place pipelock restart.
  • Keep signal handlers small: handlers should record intent and return, not perform blocking subprocess lifecycle work directly.
  • Process queued restart requests from the supervisor's main loop so restart behavior is serialized with tick() and shutdown state.
  • Avoid respawning children after shutdown has started.
  • Coalesce or serialize repeated pipelock restart requests in a documented way so stacked SIGUSR1 delivery cannot overlap restarts.
  • Clarify and test aggregate exit-code semantics:
    • clean unattended exits return zero when every child exits zero.
    • signal-only shutdown does not invent a positive failure code.
    • a positive child crash before shutdown remains visible on supervisor exit.
  • Keep the implementation stdlib-only.

Non-goals

  • No new process supervisor dependency such as supervisord, s6, or runit.
  • No automatic restart policy for arbitrary unexpected child death.
  • No changes to the bundle's daemon set, daemon argv, env filtering, or Docker compose contract.
  • No changes to egress route reload semantics beyond preserving SIGHUP forwarding.
  • No user-facing CLI changes.

Scope

In scope:

  • Internal signal handling and supervisor event-loop structure in bot_bottle/sidecar_init.py.
  • Tests in tests/unit/test_sidecar_init.py for queued restart behavior, shutdown/restart ordering, repeated restart requests, and exit-code semantics.
  • Docstring/comment updates that describe the concrete signal and exit-code contracts.

Out of scope:

  • Changing pipelock itself to reload config in process.
  • Restarting egress, git-gate, git-http, or supervise on demand.
  • Reporting restart events to the supervise MCP plane.
  • Changing the interim policy that unexpected child death leaves surviving daemons running.

Design

Keep _Supervisor as the owner of child process state, but add an explicit pending-action boundary between signal delivery and subprocess lifecycle work. The exact API can be small, for example:

  • request_shutdown(reason) keeps its existing idempotent behavior.
  • request_restart(daemon_name) records a pending restart request unless shutdown is already in progress.
  • tick() drains pending restart work before or after child-death logging in a documented order.

The SIGUSR1 handler should call only the non-blocking request method. The main loop should continue to call tick() and sleep on _POLL_INTERVAL; tick() then performs the actual restart_daemon("pipelock") work while normal Python control flow is in the supervisor loop.

Repeated restart requests should not overlap. Restart requests coalesce by daemon name: if three SIGUSR1 signals arrive before the next loop turn, one pipelock restart is enough because each restart rereads the latest pipelock.yaml from disk. This treats SIGUSR1 as "make pipelock reflect the current config" rather than "run exactly one restart per signal."

Shutdown wins over restart. If SIGTERM/SIGINT is received while a restart is pending, the supervisor should drop the pending restart and terminate live children. If shutdown starts while restart_daemon is already executing in the main loop, the existing restart operation may finish, but no additional queued restart should start after shutdown state is set. A simpler implementation may check shutdown only before each queued restart, because signal handlers execute between bytecodes and cannot interrupt a single blocking wait() until control returns to Python.

Exit-code behavior should be documented as "positive failures win, otherwise return zero." Positive process failures remain visible, while a clean shutdown of only zero-exit or signal-terminated children returns zero instead of leaking platform-specific negative signal return codes to the container exit status.

Implementation Chunks

  1. Add characterization tests for SIGUSR1 queuing, repeated restart coalescing, shutdown dropping pending restarts, and crash-then-shutdown exit codes.
  2. Add a pending restart request structure to _Supervisor and a non-blocking request method.
  3. Change the SIGUSR1 handler in main() to enqueue the pipelock restart instead of calling restart_daemon directly.
  4. Drain pending restarts from tick() with shutdown checks and documented ordering.
  5. Update docstrings and comments around signal handling and exit_code().

Testing Strategy

Run the existing sidecar unit tests:

  • python3 -m unittest tests.unit.test_sidecar_init

Add focused unit tests that avoid process-wide signal handler races where possible by driving _Supervisor directly. End-to-end signal tests can remain limited to main() behavior that cannot be exercised otherwise.

Also run the full unit suite before merge:

  • python3 -m unittest discover -s tests/unit

Open Questions

None.