bot-bottle/docs/prds/0034-sidecar-restart-shutdown-semantics.md

# PRD 0034: Sidecar Restart and Shutdown Semantics

- **Status:** Active
- **Author:** didericis-codex
- **Created:** 2026-06-02
- **Issue:** #126

## Summary

Make the sidecar bundle supervisor's signal, restart, and exit-code behavior
explicit and easier to reason about. In particular, move pipelock restart work
out of direct SIGUSR1 handler execution while preserving the caller-visible
`docker kill --signal USR1 <bundle>` contract used by pipelock apply flows.

## Problem

`bot_bottle/sidecar_init.py` is PID 1 for the bundled sidecar container. It
starts egress, pipelock, git-gate/git-http, and supervise; forwards shutdown
signals; forwards SIGHUP to egress; and restarts pipelock on SIGUSR1 after
allowlist changes.

The current SIGUSR1 handler calls `sup.restart_daemon("pipelock")` directly.
That restart path can terminate a child, wait up to a grace timeout, SIGKILL a
stubborn child, spawn a replacement with `subprocess.Popen`, and start a new
pump thread. In CPython signal handlers run between bytecodes in the main
thread, so this is not the same as POSIX async-signal-unsafe C code, but it
still means signal handling can block the supervisor loop for the restart grace
window and makes stacked signals harder to reason about.

The exit-code contract is also easy to misread. `_Supervisor.exit_code()`
returns the maximum child return code. That preserves a positive crash code
when a child failed before graceful shutdown, but the docstring currently
frames graceful shutdown as returning zero because signal-killed children have
negative return codes. The implementation is reasonable; the contract needs to
be deliberate and tested around crash-then-shutdown behavior.

## Goals / Success Criteria

- Preserve the external signal contract:
  - SIGTERM/SIGINT requests bundle shutdown.
  - SIGHUP still forwards to the live egress child.
  - SIGUSR1 still requests an in-place pipelock restart.
- Keep signal handlers small: handlers should record intent and return, not
  perform blocking subprocess lifecycle work directly.
- Process queued restart requests from the supervisor's main loop so restart
  behavior is serialized with `tick()` and shutdown state.
- Avoid respawning children after shutdown has started.
- Coalesce or serialize repeated pipelock restart requests in a documented way
  so stacked SIGUSR1 delivery cannot overlap restarts.
- Clarify and test aggregate exit-code semantics:
  - clean unattended exits return zero when every child exits zero.
  - signal-only shutdown does not invent a positive failure code.
  - a positive child crash before shutdown remains visible on supervisor exit.
- Keep the implementation stdlib-only.

## Non-goals

- No new process supervisor dependency such as supervisord, s6, or runit.
- No automatic restart policy for arbitrary unexpected child death.
- No changes to the bundle's daemon set, daemon argv, env filtering, or Docker
  compose contract.
- No changes to egress route reload semantics beyond preserving SIGHUP
  forwarding.
- No user-facing CLI changes.

## Scope

In scope:

- Internal signal handling and supervisor event-loop structure in
  `bot_bottle/sidecar_init.py`.
- Tests in `tests/unit/test_sidecar_init.py` for queued restart behavior,
  shutdown/restart ordering, repeated restart requests, and exit-code
  semantics.
- Docstring/comment updates that describe the concrete signal and exit-code
  contracts.

Out of scope:

- Changing pipelock itself to reload config in process.
- Restarting egress, git-gate, git-http, or supervise on demand.
- Reporting restart events to the supervise MCP plane.
- Changing the interim policy that unexpected child death leaves surviving
  daemons running.

## Design

Keep `_Supervisor` as the owner of child process state, but add an explicit
pending-action boundary between signal delivery and subprocess lifecycle work.
The exact API can be small, for example:

- `request_shutdown(reason)` keeps its existing idempotent behavior.
- `request_restart(daemon_name)` records a pending restart request unless
  shutdown is already in progress.
- `tick()` drains pending restart work before or after child-death logging in a
  documented order.

The SIGUSR1 handler should call only the non-blocking request method. The main
loop should continue to call `tick()` and sleep on `_POLL_INTERVAL`; `tick()`
then performs the actual `restart_daemon("pipelock")` work while normal Python
control flow is in the supervisor loop.

Repeated restart requests should not overlap. Restart requests coalesce by
daemon name: if three SIGUSR1 signals arrive before the next loop turn, one
pipelock restart is enough because each restart rereads the latest
`pipelock.yaml` from disk. This treats SIGUSR1 as "make pipelock reflect the
current config" rather than "run exactly one restart per signal."

Shutdown wins over restart. If SIGTERM/SIGINT is received while a restart is
pending, the supervisor should drop the pending restart and terminate live
children. If shutdown starts while `restart_daemon` is already executing in the
main loop, the existing restart operation may finish, but no additional queued
restart should start after shutdown state is set. A simpler implementation may
check shutdown only before each queued restart, because signal handlers execute
between bytecodes and cannot interrupt a single blocking `wait()` until control
returns to Python.

Exit-code behavior should be documented as "positive failures win, otherwise
return zero." Positive process failures remain visible, while a clean shutdown
of only zero-exit or signal-terminated children returns zero instead of leaking
platform-specific negative signal return codes to the container exit status.

## Implementation Chunks

1. Add characterization tests for SIGUSR1 queuing, repeated restart coalescing,
   shutdown dropping pending restarts, and crash-then-shutdown exit codes.
2. Add a pending restart request structure to `_Supervisor` and a
   non-blocking request method.
3. Change the SIGUSR1 handler in `main()` to enqueue the pipelock restart
   instead of calling `restart_daemon` directly.
4. Drain pending restarts from `tick()` with shutdown checks and documented
   ordering.
5. Update docstrings and comments around signal handling and `exit_code()`.

## Testing Strategy

Run the existing sidecar unit tests:

- `python3 -m unittest tests.unit.test_sidecar_init`

Add focused unit tests that avoid process-wide signal handler races where
possible by driving `_Supervisor` directly. End-to-end signal tests can remain
limited to `main()` behavior that cannot be exercised otherwise.

Also run the full unit suite before merge:

- `python3 -m unittest discover -s tests/unit`

## Open Questions

None.