diff --git a/docs/prds/0034-sidecar-restart-shutdown-semantics.md b/docs/prds/0034-sidecar-restart-shutdown-semantics.md new file mode 100644 index 0000000..3ac997c --- /dev/null +++ b/docs/prds/0034-sidecar-restart-shutdown-semantics.md @@ -0,0 +1,157 @@ +# PRD 0034: Sidecar Restart and Shutdown Semantics + +- **Status:** Draft +- **Author:** didericis-codex +- **Created:** 2026-06-02 +- **Issue:** #126 + +## Summary + +Make the sidecar bundle supervisor's signal, restart, and exit-code behavior +explicit and easier to reason about. In particular, move pipelock restart work +out of direct SIGUSR1 handler execution while preserving the caller-visible +`docker kill --signal USR1 ` contract used by pipelock apply flows. + +## Problem + +`bot_bottle/sidecar_init.py` is PID 1 for the bundled sidecar container. It +starts egress, pipelock, git-gate/git-http, and supervise; forwards shutdown +signals; forwards SIGHUP to egress; and restarts pipelock on SIGUSR1 after +allowlist changes. + +The current SIGUSR1 handler calls `sup.restart_daemon("pipelock")` directly. +That restart path can terminate a child, wait up to a grace timeout, SIGKILL a +stubborn child, spawn a replacement with `subprocess.Popen`, and start a new +pump thread. In CPython signal handlers run between bytecodes in the main +thread, so this is not the same as POSIX async-signal-unsafe C code, but it +still means signal handling can block the supervisor loop for the restart grace +window and makes stacked signals harder to reason about. + +The exit-code contract is also easy to misread. `_Supervisor.exit_code()` +returns the maximum child return code. That preserves a positive crash code +when a child failed before graceful shutdown, but the docstring currently +frames graceful shutdown as returning zero because signal-killed children have +negative return codes. The implementation is reasonable; the contract needs to +be deliberate and tested around crash-then-shutdown behavior. + +## Goals / Success Criteria + +- Preserve the external signal contract: + - SIGTERM/SIGINT requests bundle shutdown. + - SIGHUP still forwards to the live egress child. + - SIGUSR1 still requests an in-place pipelock restart. +- Keep signal handlers small: handlers should record intent and return, not + perform blocking subprocess lifecycle work directly. +- Process queued restart requests from the supervisor's main loop so restart + behavior is serialized with `tick()` and shutdown state. +- Avoid respawning children after shutdown has started. +- Coalesce or serialize repeated pipelock restart requests in a documented way + so stacked SIGUSR1 delivery cannot overlap restarts. +- Clarify and test aggregate exit-code semantics: + - clean unattended exits return zero when every child exits zero. + - signal-only shutdown does not invent a positive failure code. + - a positive child crash before shutdown remains visible on supervisor exit. +- Keep the implementation stdlib-only. + +## Non-goals + +- No new process supervisor dependency such as supervisord, s6, or runit. +- No automatic restart policy for arbitrary unexpected child death. +- No changes to the bundle's daemon set, daemon argv, env filtering, or Docker + compose contract. +- No changes to egress route reload semantics beyond preserving SIGHUP + forwarding. +- No user-facing CLI changes. + +## Scope + +In scope: + +- Internal signal handling and supervisor event-loop structure in + `bot_bottle/sidecar_init.py`. +- Tests in `tests/unit/test_sidecar_init.py` for queued restart behavior, + shutdown/restart ordering, repeated restart requests, and exit-code + semantics. +- Docstring/comment updates that describe the concrete signal and exit-code + contracts. + +Out of scope: + +- Changing pipelock itself to reload config in process. +- Restarting egress, git-gate, git-http, or supervise on demand. +- Reporting restart events to the supervise MCP plane. +- Changing the interim policy that unexpected child death leaves surviving + daemons running. + +## Design + +Keep `_Supervisor` as the owner of child process state, but add an explicit +pending-action boundary between signal delivery and subprocess lifecycle work. +The exact API can be small, for example: + +- `request_shutdown(reason)` keeps its existing idempotent behavior. +- `request_restart(daemon_name)` records a pending restart request unless + shutdown is already in progress. +- `tick()` drains pending restart work before or after child-death logging in a + documented order. + +The SIGUSR1 handler should call only the non-blocking request method. The main +loop should continue to call `tick()` and sleep on `_POLL_INTERVAL`; `tick()` +then performs the actual `restart_daemon("pipelock")` work while normal Python +control flow is in the supervisor loop. + +Repeated restart requests should not overlap. Either coalescing or FIFO +serialization is acceptable, but the PRD prefers coalescing by daemon name: if +three SIGUSR1 signals arrive before the next loop turn, one pipelock restart is +enough because each restart rereads the latest `pipelock.yaml` from disk. +Document this because it is a semantic choice. + +Shutdown wins over restart. If SIGTERM/SIGINT is received while a restart is +pending, the supervisor should drop the pending restart and terminate live +children. If shutdown starts while `restart_daemon` is already executing in the +main loop, the existing restart operation may finish, but no additional queued +restart should start after shutdown state is set. A simpler implementation may +check shutdown only before each queued restart, because signal handlers execute +between bytecodes and cannot interrupt a single blocking `wait()` until control +returns to Python. + +Exit-code behavior should be documented as "positive failures win, otherwise +return the maximum observed child return code." That matches the current intent: +positive process failures remain visible, while a clean shutdown of only +signal-terminated children does not hide an earlier crash. + +## Implementation Chunks + +1. Add characterization tests for SIGUSR1 queuing, repeated restart coalescing, + shutdown dropping pending restarts, and crash-then-shutdown exit codes. +2. Add a pending restart request structure to `_Supervisor` and a + non-blocking request method. +3. Change the SIGUSR1 handler in `main()` to enqueue the pipelock restart + instead of calling `restart_daemon` directly. +4. Drain pending restarts from `tick()` with shutdown checks and documented + ordering. +5. Update docstrings and comments around signal handling and `exit_code()`. + +## Testing Strategy + +Run the existing sidecar unit tests: + +- `python3 -m unittest tests.unit.test_sidecar_init` + +Add focused unit tests that avoid process-wide signal handler races where +possible by driving `_Supervisor` directly. End-to-end signal tests can remain +limited to `main()` behavior that cannot be exercised otherwise. + +Also run the full unit suite before merge: + +- `python3 -m unittest discover -s tests/unit` + +## Open Questions + +- Should repeated restart requests be coalesced by daemon name, or should the + supervisor preserve every queued request? Coalescing is simpler and appears + sufficient because pipelock rereads the latest config on restart. +- Should exit-code handling clamp all negative signal return codes to zero + when no positive child failure occurred, or should it continue returning the + maximum raw child return code? The current tests tolerate platform-specific + negative signal codes; tightening this would be a behavior change.