fix(sidecars): child death no longer tears down the bundle
test / unit (pull_request) Successful in 20s
test / integration (pull_request) Successful in 1m8s

Reverses chunk 1's "any unexpected child death tears down the
rest" policy. New behavior: a daemon dying is logged but does
NOT initiate shutdown — the surviving daemons keep running and
whatever the dead one served starts failing visibly on the
agent side. The supervisor exits only when (a) it receives
SIGTERM/SIGINT, or (b) every child has died on its own.

Eventual design is restart-the-dead-daemon plus a notification
to the supervise sidecar so the operator sees the event
explicitly; this commit ships only the "log and leave alone"
half. PRD 0024 open question 1 updated to reflect the new
intent.

Tests updated: replaced "crash propagates exit code via
auto-teardown" with three cases that exercise the new policy
(crash without shutdown leaves survivors up, crash-then-signal
surfaces the nonzero code, all-children-die-unattended still
converges the loop).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-27 00:19:50 -04:00
parent fa9b754d77
commit 62109a1caf
3 changed files with 117 additions and 58 deletions
+15 -11
View File
@@ -390,17 +390,21 @@ rewrite.
## Open questions
1. **Init failure semantics.** When one daemon crashes mid-run,
should the bundle exit (killing the bottle) or restart just
that daemon? Today, with four separate containers, docker
restarts the crashed one and the bottle stays up. Default
for this PRD: bundle exits on any child death; the bottle
tears down. Restart logic can land later if operators hit
it in practice.
2. **Exit-code propagation.** If multiple daemons die in quick
succession (likely under SIGTERM), which exit code wins?
First-to-die is simplest. Worst-case (highest nonzero
exit code) gives clearest signal in logs. Default to
first-to-die unless an operator scenario disagrees.
the bundle does NOT tear down the survivors — the failure is
logged, the surviving daemons keep running, and whatever the
dead one served starts failing in a way the agent surfaces.
The eventual design is restart-the-dead-daemon plus a
notification to the supervise sidecar so the operator sees
the event explicitly; chunk 1 ships only the "log and leave
alone" half. Tear-down-the-bundle was considered and
rejected: one sick daemon shouldn't take the bottle offline.
2. **Exit-code propagation.** When the supervisor finally exits
(signal-driven shutdown, or every child having died on its
own), the container exits with `max(child returncodes)` —
the worst nonzero code wins. On graceful shutdown every child
is signal-killed (negative returncode) so the max is 0; a
crashed-before-signal daemon's nonzero code wins and reaches
the operator on container exit.
3. **Image pin policy.** Pin `claude-bottle-sidecars` by tag
(`:latest` rebuilt per-release) or by digest written into a
`CLAUDE_BOTTLE_SIDECAR_IMAGE` env var like the existing