6.6 KiB
PRD 0034: Sidecar Restart and Shutdown Semantics
- Status: Active
- Author: didericis-codex
- Created: 2026-06-02
- Issue: #126
Summary
Make the sidecar bundle supervisor's signal, restart, and exit-code behavior
explicit and easier to reason about. In particular, move pipelock restart work
out of direct SIGUSR1 handler execution while preserving the caller-visible
docker kill --signal USR1 <bundle> contract used by pipelock apply flows.
Problem
bot_bottle/sidecar_init.py is PID 1 for the bundled sidecar container. It
starts egress, pipelock, git-gate/git-http, and supervise; forwards shutdown
signals; forwards SIGHUP to egress; and restarts pipelock on SIGUSR1 after
allowlist changes.
The current SIGUSR1 handler calls sup.restart_daemon("pipelock") directly.
That restart path can terminate a child, wait up to a grace timeout, SIGKILL a
stubborn child, spawn a replacement with subprocess.Popen, and start a new
pump thread. In CPython signal handlers run between bytecodes in the main
thread, so this is not the same as POSIX async-signal-unsafe C code, but it
still means signal handling can block the supervisor loop for the restart grace
window and makes stacked signals harder to reason about.
The exit-code contract is also easy to misread. _Supervisor.exit_code()
returns the maximum child return code. That preserves a positive crash code
when a child failed before graceful shutdown, but the docstring currently
frames graceful shutdown as returning zero because signal-killed children have
negative return codes. The implementation is reasonable; the contract needs to
be deliberate and tested around crash-then-shutdown behavior.
Goals / Success Criteria
- Preserve the external signal contract:
- SIGTERM/SIGINT requests bundle shutdown.
- SIGHUP still forwards to the live egress child.
- SIGUSR1 still requests an in-place pipelock restart.
- Keep signal handlers small: handlers should record intent and return, not perform blocking subprocess lifecycle work directly.
- Process queued restart requests from the supervisor's main loop so restart
behavior is serialized with
tick()and shutdown state. - Avoid respawning children after shutdown has started.
- Coalesce or serialize repeated pipelock restart requests in a documented way so stacked SIGUSR1 delivery cannot overlap restarts.
- Clarify and test aggregate exit-code semantics:
- clean unattended exits return zero when every child exits zero.
- signal-only shutdown does not invent a positive failure code.
- a positive child crash before shutdown remains visible on supervisor exit.
- Keep the implementation stdlib-only.
Non-goals
- No new process supervisor dependency such as supervisord, s6, or runit.
- No automatic restart policy for arbitrary unexpected child death.
- No changes to the bundle's daemon set, daemon argv, env filtering, or Docker compose contract.
- No changes to egress route reload semantics beyond preserving SIGHUP forwarding.
- No user-facing CLI changes.
Scope
In scope:
- Internal signal handling and supervisor event-loop structure in
bot_bottle/sidecar_init.py. - Tests in
tests/unit/test_sidecar_init.pyfor queued restart behavior, shutdown/restart ordering, repeated restart requests, and exit-code semantics. - Docstring/comment updates that describe the concrete signal and exit-code contracts.
Out of scope:
- Changing pipelock itself to reload config in process.
- Restarting egress, git-gate, git-http, or supervise on demand.
- Reporting restart events to the supervise MCP plane.
- Changing the interim policy that unexpected child death leaves surviving daemons running.
Design
Keep _Supervisor as the owner of child process state, but add an explicit
pending-action boundary between signal delivery and subprocess lifecycle work.
The exact API can be small, for example:
request_shutdown(reason)keeps its existing idempotent behavior.request_restart(daemon_name)records a pending restart request unless shutdown is already in progress.tick()drains pending restart work before or after child-death logging in a documented order.
The SIGUSR1 handler should call only the non-blocking request method. The main
loop should continue to call tick() and sleep on _POLL_INTERVAL; tick()
then performs the actual restart_daemon("pipelock") work while normal Python
control flow is in the supervisor loop.
Repeated restart requests should not overlap. Restart requests coalesce by
daemon name: if three SIGUSR1 signals arrive before the next loop turn, one
pipelock restart is enough because each restart rereads the latest
pipelock.yaml from disk. This treats SIGUSR1 as "make pipelock reflect the
current config" rather than "run exactly one restart per signal."
Shutdown wins over restart. If SIGTERM/SIGINT is received while a restart is
pending, the supervisor should drop the pending restart and terminate live
children. If shutdown starts while restart_daemon is already executing in the
main loop, the existing restart operation may finish, but no additional queued
restart should start after shutdown state is set. A simpler implementation may
check shutdown only before each queued restart, because signal handlers execute
between bytecodes and cannot interrupt a single blocking wait() until control
returns to Python.
Exit-code behavior should be documented as "positive failures win, otherwise return zero." Positive process failures remain visible, while a clean shutdown of only zero-exit or signal-terminated children returns zero instead of leaking platform-specific negative signal return codes to the container exit status.
Implementation Chunks
- Add characterization tests for SIGUSR1 queuing, repeated restart coalescing, shutdown dropping pending restarts, and crash-then-shutdown exit codes.
- Add a pending restart request structure to
_Supervisorand a non-blocking request method. - Change the SIGUSR1 handler in
main()to enqueue the pipelock restart instead of callingrestart_daemondirectly. - Drain pending restarts from
tick()with shutdown checks and documented ordering. - Update docstrings and comments around signal handling and
exit_code().
Testing Strategy
Run the existing sidecar unit tests:
python3 -m unittest tests.unit.test_sidecar_init
Add focused unit tests that avoid process-wide signal handler races where
possible by driving _Supervisor directly. End-to-end signal tests can remain
limited to main() behavior that cannot be exercised otherwise.
Also run the full unit suite before merge:
python3 -m unittest discover -s tests/unit
Open Questions
None.