# PRD 0034: Sidecar Restart and Shutdown Semantics - **Status:** Active - **Author:** didericis-codex - **Created:** 2026-06-02 - **Issue:** #126 ## Summary Make the sidecar bundle supervisor's signal, restart, and exit-code behavior explicit and easier to reason about. In particular, move pipelock restart work out of direct SIGUSR1 handler execution while preserving the caller-visible `docker kill --signal USR1 ` contract used by pipelock apply flows. ## Problem `bot_bottle/sidecar_init.py` is PID 1 for the bundled sidecar container. It starts egress, pipelock, git-gate/git-http, and supervise; forwards shutdown signals; forwards SIGHUP to egress; and restarts pipelock on SIGUSR1 after allowlist changes. The current SIGUSR1 handler calls `sup.restart_daemon("pipelock")` directly. That restart path can terminate a child, wait up to a grace timeout, SIGKILL a stubborn child, spawn a replacement with `subprocess.Popen`, and start a new pump thread. In CPython signal handlers run between bytecodes in the main thread, so this is not the same as POSIX async-signal-unsafe C code, but it still means signal handling can block the supervisor loop for the restart grace window and makes stacked signals harder to reason about. The exit-code contract is also easy to misread. `_Supervisor.exit_code()` returns the maximum child return code. That preserves a positive crash code when a child failed before graceful shutdown, but the docstring currently frames graceful shutdown as returning zero because signal-killed children have negative return codes. The implementation is reasonable; the contract needs to be deliberate and tested around crash-then-shutdown behavior. ## Goals / Success Criteria - Preserve the external signal contract: - SIGTERM/SIGINT requests bundle shutdown. - SIGHUP still forwards to the live egress child. - SIGUSR1 still requests an in-place pipelock restart. - Keep signal handlers small: handlers should record intent and return, not perform blocking subprocess lifecycle work directly. - Process queued restart requests from the supervisor's main loop so restart behavior is serialized with `tick()` and shutdown state. - Avoid respawning children after shutdown has started. - Coalesce or serialize repeated pipelock restart requests in a documented way so stacked SIGUSR1 delivery cannot overlap restarts. - Clarify and test aggregate exit-code semantics: - clean unattended exits return zero when every child exits zero. - signal-only shutdown does not invent a positive failure code. - a positive child crash before shutdown remains visible on supervisor exit. - Keep the implementation stdlib-only. ## Non-goals - No new process supervisor dependency such as supervisord, s6, or runit. - No automatic restart policy for arbitrary unexpected child death. - No changes to the bundle's daemon set, daemon argv, env filtering, or Docker compose contract. - No changes to egress route reload semantics beyond preserving SIGHUP forwarding. - No user-facing CLI changes. ## Scope In scope: - Internal signal handling and supervisor event-loop structure in `bot_bottle/sidecar_init.py`. - Tests in `tests/unit/test_sidecar_init.py` for queued restart behavior, shutdown/restart ordering, repeated restart requests, and exit-code semantics. - Docstring/comment updates that describe the concrete signal and exit-code contracts. Out of scope: - Changing pipelock itself to reload config in process. - Restarting egress, git-gate, git-http, or supervise on demand. - Reporting restart events to the supervise MCP plane. - Changing the interim policy that unexpected child death leaves surviving daemons running. ## Design Keep `_Supervisor` as the owner of child process state, but add an explicit pending-action boundary between signal delivery and subprocess lifecycle work. The exact API can be small, for example: - `request_shutdown(reason)` keeps its existing idempotent behavior. - `request_restart(daemon_name)` records a pending restart request unless shutdown is already in progress. - `tick()` drains pending restart work before or after child-death logging in a documented order. The SIGUSR1 handler should call only the non-blocking request method. The main loop should continue to call `tick()` and sleep on `_POLL_INTERVAL`; `tick()` then performs the actual `restart_daemon("pipelock")` work while normal Python control flow is in the supervisor loop. Repeated restart requests should not overlap. Restart requests coalesce by daemon name: if three SIGUSR1 signals arrive before the next loop turn, one pipelock restart is enough because each restart rereads the latest `pipelock.yaml` from disk. This treats SIGUSR1 as "make pipelock reflect the current config" rather than "run exactly one restart per signal." Shutdown wins over restart. If SIGTERM/SIGINT is received while a restart is pending, the supervisor should drop the pending restart and terminate live children. If shutdown starts while `restart_daemon` is already executing in the main loop, the existing restart operation may finish, but no additional queued restart should start after shutdown state is set. A simpler implementation may check shutdown only before each queued restart, because signal handlers execute between bytecodes and cannot interrupt a single blocking `wait()` until control returns to Python. Exit-code behavior should be documented as "positive failures win, otherwise return zero." Positive process failures remain visible, while a clean shutdown of only zero-exit or signal-terminated children returns zero instead of leaking platform-specific negative signal return codes to the container exit status. ## Implementation Chunks 1. Add characterization tests for SIGUSR1 queuing, repeated restart coalescing, shutdown dropping pending restarts, and crash-then-shutdown exit codes. 2. Add a pending restart request structure to `_Supervisor` and a non-blocking request method. 3. Change the SIGUSR1 handler in `main()` to enqueue the pipelock restart instead of calling `restart_daemon` directly. 4. Drain pending restarts from `tick()` with shutdown checks and documented ordering. 5. Update docstrings and comments around signal handling and `exit_code()`. ## Testing Strategy Run the existing sidecar unit tests: - `python3 -m unittest tests.unit.test_sidecar_init` Add focused unit tests that avoid process-wide signal handler races where possible by driving `_Supervisor` directly. End-to-end signal tests can remain limited to `main()` behavior that cannot be exercised otherwise. Also run the full unit suite before merge: - `python3 -m unittest discover -s tests/unit` ## Open Questions None.