fix(sidecars): child death no longer tears down the bundle
Reverses chunk 1's "any unexpected child death tears down the rest" policy. New behavior: a daemon dying is logged but does NOT initiate shutdown — the surviving daemons keep running and whatever the dead one served starts failing visibly on the agent side. The supervisor exits only when (a) it receives SIGTERM/SIGINT, or (b) every child has died on its own. Eventual design is restart-the-dead-daemon plus a notification to the supervise sidecar so the operator sees the event explicitly; this commit ships only the "log and leave alone" half. PRD 0024 open question 1 updated to reflect the new intent. Tests updated: replaced "crash propagates exit code via auto-teardown" with three cases that exercise the new policy (crash without shutdown leaves survivors up, crash-then-signal surfaces the nonzero code, all-children-die-unattended still converges the loop). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -390,17 +390,21 @@ rewrite.
|
||||
## Open questions
|
||||
|
||||
1. **Init failure semantics.** When one daemon crashes mid-run,
|
||||
should the bundle exit (killing the bottle) or restart just
|
||||
that daemon? Today, with four separate containers, docker
|
||||
restarts the crashed one and the bottle stays up. Default
|
||||
for this PRD: bundle exits on any child death; the bottle
|
||||
tears down. Restart logic can land later if operators hit
|
||||
it in practice.
|
||||
2. **Exit-code propagation.** If multiple daemons die in quick
|
||||
succession (likely under SIGTERM), which exit code wins?
|
||||
First-to-die is simplest. Worst-case (highest nonzero
|
||||
exit code) gives clearest signal in logs. Default to
|
||||
first-to-die unless an operator scenario disagrees.
|
||||
the bundle does NOT tear down the survivors — the failure is
|
||||
logged, the surviving daemons keep running, and whatever the
|
||||
dead one served starts failing in a way the agent surfaces.
|
||||
The eventual design is restart-the-dead-daemon plus a
|
||||
notification to the supervise sidecar so the operator sees
|
||||
the event explicitly; chunk 1 ships only the "log and leave
|
||||
alone" half. Tear-down-the-bundle was considered and
|
||||
rejected: one sick daemon shouldn't take the bottle offline.
|
||||
2. **Exit-code propagation.** When the supervisor finally exits
|
||||
(signal-driven shutdown, or every child having died on its
|
||||
own), the container exits with `max(child returncodes)` —
|
||||
the worst nonzero code wins. On graceful shutdown every child
|
||||
is signal-killed (negative returncode) so the max is 0; a
|
||||
crashed-before-signal daemon's nonzero code wins and reaches
|
||||
the operator on container exit.
|
||||
3. **Image pin policy.** Pin `claude-bottle-sidecars` by tag
|
||||
(`:latest` rebuilt per-release) or by digest written into a
|
||||
`CLAUDE_BOTTLE_SIDECAR_IMAGE` env var like the existing
|
||||
|
||||
Reference in New Issue
Block a user