fix(sidecars): per-daemon pipelock restart keeps supervise socket alive

`apply_allowlist_change` used `docker restart <bundle>` to make pipelock reload, which bounced ALL four daemons — including supervise, whose MCP socket the agent's claude-code client had open. That dropped the connection. A second apply works because supervise has come back up by then. Fix: per-daemon restart via SIGUSR1. - New `_Supervisor.restart_daemon(name)` terminates one named child and spawns a replacement in place. Other daemons keep running. - main() wires SIGUSR1 → `restart_daemon("pipelock")`. Pipelock has no in-process reload, so this is its analog of egress's SIGHUP-reload-addon path. Pipelock is the only daemon that currently needs hot-config reload via restart; if others acquire the need, add a new signal. - `apply_allowlist_change` now `docker kill --signal USR1 <bundle>` instead of `docker restart`. Supervise / egress / git-gate keep running across the apply. Tests: - New `_Supervisor.restart_daemon` cases: replaces in place (different pid post-restart, sibling daemon unchanged), unknown name is a no-op, restart-during-shutdown is a no-op. - `test_pipelock_apply` rewritten to bring up the bundle image with `CLAUDE_BOTTLE_SIDECAR_DAEMONS=pipelock` so the supervisor is PID 1 and handles SIGUSR1. The previous standalone-pipelock setup wouldn't survive SIGUSR1 (pipelock default disposition is terminate). Test builds the bundle image in setUpClass (cached layers make repeat runs fast). 531 tests passing locally (unit + integration). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 02:12:37 -04:00
parent c48f791d7d
commit 5b9ceaaaee
4 changed files with 140 additions and 42 deletions
@@ -227,6 +227,53 @@ class TestSupervisor(unittest.TestCase):
        sup.request_shutdown(reason="cleanup")
        self._drive(sup)

+    def test_restart_daemon_replaces_in_place(self):
+        # pipelock_apply.py sends SIGUSR1 to the bundle, supervisor
+        # restarts the pipelock daemon, supervise (the other
+        # daemon's MCP server in production) stays up.
+        specs = [
+            _DaemonSpec("pipelock", ("/bin/sleep", "30")),
+            _DaemonSpec("supervise", ("/bin/sleep", "30")),
+        ]
+        sup = _Supervisor(specs)
+        sup.start_all()
+        time.sleep(0.1)
+        old_pipelock_pid = sup.procs[0][1].pid
+        supervise_pid = sup.procs[1][1].pid
+
+        ok = sup.restart_daemon("pipelock", grace=2.0)
+        self.assertTrue(ok)
+
+        # Pipelock got a fresh PID — different process.
+        new_pipelock_pid = sup.procs[0][1].pid
+        self.assertNotEqual(old_pipelock_pid, new_pipelock_pid)
+        # Supervise's PID is unchanged — it was NOT restarted.
+        self.assertEqual(supervise_pid, sup.procs[1][1].pid)
+        self.assertIsNone(sup.procs[1][1].poll(),
+                          "supervise should still be running")
+
+        sup.request_shutdown(reason="cleanup")
+        self._drive(sup)
+
+    def test_restart_unknown_daemon_no_op(self):
+        specs = [_DaemonSpec("a", ("/bin/sleep", "30"))]
+        sup = _Supervisor(specs)
+        sup.start_all()
+        ok = sup.restart_daemon("ghost")
+        self.assertFalse(ok)
+        sup.request_shutdown(reason="cleanup")
+        self._drive(sup)
+
+    def test_restart_during_shutdown_is_no_op(self):
+        specs = [_DaemonSpec("pipelock", ("/bin/sleep", "30"))]
+        sup = _Supervisor(specs)
+        sup.start_all()
+        sup.request_shutdown(reason="test")
+        ok = sup.restart_daemon("pipelock")
+        self.assertFalse(ok,
+                         "must not respawn a daemon during teardown")
+        self._drive(sup)
+
    def test_shutdown_after_start_terminates_children(self):
        # Two long-running children. Caller requests shutdown;
        # both should receive SIGTERM and exit. exit_code() is