fix(sidecars): child death no longer tears down the bundle

Reverses chunk 1's "any unexpected child death tears down the rest" policy. New behavior: a daemon dying is logged but does NOT initiate shutdown — the surviving daemons keep running and whatever the dead one served starts failing visibly on the agent side. The supervisor exits only when (a) it receives SIGTERM/SIGINT, or (b) every child has died on its own. Eventual design is restart-the-dead-daemon plus a notification to the supervise sidecar so the operator sees the event explicitly; this commit ships only the "log and leave alone" half. PRD 0024 open question 1 updated to reflect the new intent. Tests updated: replaced "crash propagates exit code via auto-teardown" with three cases that exercise the new policy (crash without shutdown leaves survivors up, crash-then-signal surfaces the nonzero code, all-children-die-unattended still converges the loop). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 00:19:50 -04:00
parent fa9b754d77
commit 62109a1caf
3 changed files with 117 additions and 58 deletions
@@ -390,17 +390,21 @@ rewrite.
 ## Open questions

 1. **Init failure semantics.** When one daemon crashes mid-run,
-   should the bundle exit (killing the bottle) or restart just
-   that daemon? Today, with four separate containers, docker
-   restarts the crashed one and the bottle stays up. Default
-   for this PRD: bundle exits on any child death; the bottle
-   tears down. Restart logic can land later if operators hit
-   it in practice.
-2. **Exit-code propagation.** If multiple daemons die in quick
-   succession (likely under SIGTERM), which exit code wins?
-   First-to-die is simplest. Worst-case (highest nonzero
-   exit code) gives clearest signal in logs. Default to
-   first-to-die unless an operator scenario disagrees.
+   the bundle does NOT tear down the survivors — the failure is
+   logged, the surviving daemons keep running, and whatever the
+   dead one served starts failing in a way the agent surfaces.
+   The eventual design is restart-the-dead-daemon plus a
+   notification to the supervise sidecar so the operator sees
+   the event explicitly; chunk 1 ships only the "log and leave
+   alone" half. Tear-down-the-bundle was considered and
+   rejected: one sick daemon shouldn't take the bottle offline.
+2. **Exit-code propagation.** When the supervisor finally exits
+   (signal-driven shutdown, or every child having died on its
+   own), the container exits with `max(child returncodes)` —
+   the worst nonzero code wins. On graceful shutdown every child
+   is signal-killed (negative returncode) so the max is 0; a
+   crashed-before-signal daemon's nonzero code wins and reaches
+   the operator on container exit.
 3. **Image pin policy.** Pin `claude-bottle-sidecars` by tag
   (`:latest` rebuilt per-release) or by digest written into a
   `CLAUDE_BOTTLE_SIDECAR_IMAGE` env var like the existing