docs(prd-0020): record answers to open questions, switch to no-teardown-on-quit

2026-05-26 03:10:26 -04:00
parent ec20293c0a
commit 26322bdfd5
1 changed files with 154 additions and 94 deletions
@@ -21,11 +21,12 @@ shape from `docs/research/claude-code-pane-in-dashboard.md`).
 When the operator exits claude, the dashboard re-renders with
 the now-running bottle visible in the agents pane.

-Crucially, the bottle's lifetime is owned by the *dashboard
-process*, not by the individual claude session. Exit claude →
-back to dashboard, bottle still running. Start another agent →
-two bottles up at once. Quit the dashboard → all dashboard-
-launched bottles tear down.
+Crucially, the bottle's lifetime is decoupled from both the
+claude session AND the dashboard process. Exit claude → back to
+dashboard, bottle still running. Start another agent → two
+bottles up at once. Quit the dashboard → bottles continue
+running. Teardown is **always explicit**: the operator presses
+`x` on an agent, or runs `./cli.py cleanup` later.

 ## Problem

@@ -75,10 +76,12 @@ captures full-merged logs per bottle (PRD 0018). It already
 6. Pressing `x` (or similar — keybinding decided in design)
   on a selected agent stops just that bottle (compose down +
   state cleanup) without quitting the dashboard.
-7. Quitting the dashboard (`q`) tears down every bottle the
-   dashboard started, unless something has explicitly preserved
-   the state (capability-block, crash). Matches today's
-   start.py teardown semantics.
+7. Quitting the dashboard (`q`) leaves every running bottle
+   running. Bottle teardown is always explicit (per-bottle `x`
+   or `./cli.py cleanup`). The next `./cli.py dashboard`
+   invocation re-discovers them via `list_active_slugs()` and
+   surfaces re-attach for any it can reconstruct context for
+   (see "Cross-dashboard re-attach" below).

 ## Non-goals

@@ -91,10 +94,18 @@ captures full-merged logs per bottle (PRD 0018). It already
  the dashboard treats them as read-only-watch (already does
  today). Re-attach only applies to bottles the *current
  dashboard process* started.
- **Persisting a "bottle pool" across dashboard runs.** When
-  the dashboard quits, its bottles go. Resume across dashboard
-  invocations is `./cli.py resume <identity>`, which is
-  unchanged.
+- **Resurrecting an out-of-process bottle into a new dashboard
+  with full re-attach.** A bottle started by `./cli.py start`
+  in another terminal — or by a previous dashboard run, now
+  exited — appears in the agents pane (already does, PRD 0019)
+  and can be re-attached via `docker exec -it claude` because
+  the agent container is still running `sleep infinity`. That's
+  in scope. What's *out* is anything that requires the launch-
+  context object to drive teardown — e.g., the
+  ExitStack-tracked CA + state cleanup `_settle_state` performs
+  today. Cross-dashboard re-attach uses the existing
+  `./cli.py cleanup` for teardown, not an `x` keypress (see
+  open questions).
 - **Multi-window UI.** Single curses window, two existing
  panes (proposals + agents); the agent picker is a modal, not
  a third pane.
@@ -147,31 +158,69 @@ Today's flow:
     # context exits → compose down → state cleanup
 ```

-The proposed dashboard-owned flow:
+The proposed dashboard-driven flow:

 ```
 ./cli.py dashboard
-  └─ stack = ExitStack()
-     bottles: dict[str, DockerBottle] = {}
+  └─ bottles: dict[str, tuple[ContextManager, DockerBottle]] = {}

     # operator presses `n`, picks agent
-     ctx = backend.launch(plan)
-     bottle = stack.enter_context(ctx)           ← bottle stays alive
-     bottles[plan.slug] = bottle
+     cm = backend.launch(plan)
+     bottle = cm.__enter__()                     ← enter but don't bind to a `with`
+     bottles[plan.slug] = (cm, bottle)

     # operator interacts via:
     curses.endwin()
     bottle.exec_claude([...], tty=True)         ← blocks; returns on Ctrl-D
     stdscr.refresh()
-     # bottle is STILL ALIVE here — only the claude process exited
+     # bottle is STILL ALIVE — only the claude process exited

-     # ... operator does other things, eventually `q`:
-     stack.close()                               ← tears down every bottle
+     # ... operator presses `x` on selected agent:
+     cm, _ = bottles.pop(slug)
+     cm.__exit__(None, None, None)               ← tears down just that one
+
+     # ... operator presses `q`:
+     return  # bottles dict still populated; no teardown
 ```

-The shift is one line of code semantically but the change in
-operator experience is real: bottles outlive any single claude
-session.
+Two shifts:
+
+1. Bottles outlive any single claude session — the dashboard
+   manages enter/exit per bottle, not per attach. Exit claude
+   → still in the dashboard with the bottle running.
+2. Bottles outlive the dashboard process itself. Quitting the
+   dashboard does NOT close the context managers; the docker
+   compose project keeps running with the agent container in
+   `sleep infinity`. A subsequent dashboard invocation
+   re-discovers it via `docker compose ls` (PRD 0019's
+   `list_active_slugs`) and surfaces re-attach.
+
+   The trade-off: state cleanup that today runs in
+   `_settle_state` (transcript snapshot, preserve-marker
+   evaluation, state-dir reap) doesn't fire on a quit-while-
+   running bottle. It DOES fire when the operator explicitly
+   stops via `x`, because that calls `cm.__exit__`. For
+   bottles a previous dashboard quit on, `./cli.py cleanup`
+   is the path — its compose-down + state-reap logic
+   already covers the case.
+
+### Cross-dashboard re-attach
+
+When the dashboard discovers a bottle in `discover_active_agents`
+that it didn't itself start (a previous-dashboard or external
+`./cli.py start` bottle), Enter still attaches via `docker exec
+-it … claude` — the agent container is running `sleep infinity`
+exactly the same way regardless of who started it. The only
+thing the current dashboard lacks for those bottles is the
+launch-context object needed to drive a clean teardown via
+`x`.
+
+For v1 we surface this honestly: pressing `x` on a non-owned
+agent shows a status hint pointing at `./cli.py cleanup` (or
+`./cli.py cleanup` targeted at the slug if we add that flag
+later). The agent stays alive; the operator handles teardown
+out-of-band. Enter (re-attach) works for both owned and
+non-owned bottles.

 ### Agent picker

@@ -213,40 +262,43 @@ if modal proves fiddly.

 ### Re-attach (Enter on agent)

-Same handoff pattern the new-agent flow uses. The dashboard
-already holds the `DockerBottle` for any slug it started —
-`bottle.exec_claude([...], tty=True)` does the right `docker
-exec -it claude …` and returns on session exit. Re-attach is
-"already-running" + the same exec call; the agent picker isn't
-involved.
-
-For agents the dashboard didn't start (read-only watch), Enter
-is a no-op with a status hint ("dashboard didn't start this
-bottle; resume with `./cli.py resume <identity>` outside the
-dashboard"). PRD-0019's selection model already differentiates
-focus; this layer just gates the action.
+Same handoff pattern the new-agent flow uses. For an agent the
+dashboard started this session, the dashboard holds the
+`DockerBottle` handle in its `bottles` dict and calls
+`bottle.exec_claude(...)`. For an agent it discovered via
+`list_active_slugs` (previous-dashboard or external start),
+the dashboard synthesizes a one-shot `DockerBottle` from the
+slug — container name is `claude-bottle-<slug>`, no prompt
+path because the agent's claude config already has `--append-
+system-prompt-file` baked in from the original launch —
+and runs the same exec. Either way, Enter drops to
+full-screen claude; on exit the dashboard re-renders.

 ### Explicit per-bottle stop

-`x` on a selected dashboard-owned agent invokes
-`stack.pop_callback`-style targeted teardown: take that bottle
-out of the map, call its `close()` to tear down compose + state,
-update the agents pane on the next refresh. Bottles the
-dashboard didn't start (`x` on a read-only-watch row) → no-op
-with a status hint.
+`x` on a dashboard-owned agent: pop the `(cm, bottle)` from
+the dict, call `cm.__exit__(None, None, None)` which drives
+the existing compose-down + state-settle logic. Refresh the
+agents pane.
+
+`x` on a non-owned agent (discovered via `list_active_slugs`
+but not in `bottles` dict): no-op with status hint pointing
+at `./cli.py cleanup` (the existing path that tears down
+ANY claude-bottle compose project plus reaps state dirs).

 ### Dashboard quit

-`q` (existing) calls `stack.close()` before exit; every
-dashboard-launched bottle goes through its normal teardown
-(`compose down` + state settle). Preserve markers (capability-
-block, crash) still keep state across teardown. The dashboard
-process itself returns 0.
+`q` returns the dashboard process to 0 without touching any
+running bottles. The `bottles` dict goes out of scope but
+because the context managers' `__exit__` is never invoked,
+the `docker compose` project keeps running. The next dashboard
+invocation discovers the bottles via `list_active_slugs` and
+surfaces re-attach.

-If the operator wants to keep bottles alive past dashboard
-exit, the existing path is unchanged: launch them via
-`./cli.py start` in a separate terminal. That ownership stays
-out-of-band.
+This is a real departure from today's `./cli.py start`
+semantics (which couples bottle lifetime to the process via
+ExitStack). It's intentional: the dashboard is a watching +
+acting surface, not a lifetime owner.

 ## Implementation chunks

@@ -274,54 +326,62 @@ Sized for one PR each.
   down every bottle it started" contract in `dashboard.py`'s
   module docstring.

+## Resolved questions
+
+1. **Modal vs. drop-and-resume for preflight Y/N.** Resolved:
+   **modal.** Render the preflight lines centered in a curses
+   sub-window with `[y/N]` at the bottom; capture the next
+   keypress. If geometry proves fiddly during implementation
+   we'll fall back to drop-and-resume, but modal is the target.
+
+2. **Agent picker: text-filter typing.** Resolved: **yes,
+   include filter typing.** As the operator types, the list
+   filters to agents whose name matches (substring,
+   case-insensitive). j/k still navigates within the filtered
+   set; Esc clears the filter on first press, exits the picker
+   on the second.
+
+3. **Container-died-during-claude handling.** Keep the design
+   as drafted: transcript snapshot (`snapshot_transcript`) +
+   `mark_preserved` if exit code is non-zero + remove from
+   the `bottles` dict + status line `"claude session for
+   [slug] ended with exit N; preserved for resume"`. The
+   bottle's `cm.__exit__` would normally run on stop; here it
+   runs as part of the death-handling (the container is
+   already gone, but compose-down + state-settle still
+   sequence the network removal + state cleanup correctly).
+
+4. **Double-start of the same agent.** Allowed. The picker
+   surfaces a `(N running)` annotation next to any agent name
+   that already has live bottles in this dashboard's `bottles`
+   dict OR in `list_active_slugs()`, so the operator sees the
+   running-count before picking. Selecting an already-running
+   agent name mints a fresh slug for the new bottle as
+   normal.
+
+5. **Quit behavior.** Resolved: **`q` does NOT tear down any
+   bottles.** Dashboard exit is purely a UI exit; the
+   bottles dict goes out of scope without invoking `__exit__`,
+   so the `docker compose` projects keep running. Bottle
+   teardown is always explicit: per-bottle `x` (for
+   dashboard-owned), or `./cli.py cleanup` (for everything).
+
 ## Open questions

-1. **Modal vs. drop-and-resume for preflight Y/N.** Both work;
-   modal is nicer if the curses geometry handling is
-   straightforward. Pick during chunk 2 by prototyping the
-   modal in ~30 lines and seeing if it looks right.
-
-2. **Agent picker: text-filter typing?** v1 is j/k navigation
-   only. If the manifest has 20+ agents the picker gets noisy;
-   add fzf-style filter input later if needed.
-
-3. **What happens if `attach_claude` exits because the
-   container died** (not a clean claude exit — e.g., OOM,
-   panic)? Today's `_settle_state` marks the bottle preserved
-   for non-zero exit codes. The dashboard's re-render needs to
-   notice the bottle is gone (compose down or container-not-
-   running state) and surface a status line. Probably:
-   transcript snapshot + mark preserved + remove from
-   `bottles` map + status line "claude session for [slug]
-   ended with exit N; preserved for resume".
-
-4. **Double-start of the same agent.** Allowed by design — slugs
-   are unique per launch — but the picker should make it clear
-   this is a "start a SECOND bottle" decision, not a "re-enter
-   the first." Probably handled by showing the running-count in
-   the picker row.
-
-5. **Should `q` confirm before tearing down N running
-   bottles?** A 5-bottle dashboard with 5 in-flight sessions
-   loses non-trivial state on accidental `q`. Probably yes:
-   curses modal "quit and tear down N bottles? [y/N]". Skip
-   confirmation when there are zero owned bottles.
-
 6. **Race between handoff and 1s refresh tick.** While the
   dashboard's `stdscr.timeout` is set, a key press fires the
   handoff and the dashboard sits in `docker exec` for minutes.
   `discover_active_agents` / `discover_pending` don't poll
-   during that window, which is fine — the moment we
-   `stdscr.refresh()` after exec returns, the next loop iter
-   runs discovery and the panes reflect reality. Worth calling
-   out in the design but no special handling needed.
-
-7. **Multi-bottle resource use.** Five bottles up means five
-   compose projects: 5×(agent + pipelock + egress optional +
-   git-gate optional + supervise optional) containers, plus 5×2
-   networks. On a 16-GiB host this is fine; on something
-   smaller the operator might want a soft cap or a warning.
-   Out of v1; flag for follow-up if it bites.
+   during that window — that's harmless on its own (the moment
+   we `stdscr.refresh()` after exec returns, the next loop
+   iter runs discovery and the panes reflect reality), but
+   it does mean: (a) proposals queued during the claude
+   session won't fire any operator notification until the
+   handoff ends, and (b) a bottle that died mid-claude won't
+   be detectable until the operator exits back to the
+   dashboard. Not blocking v1 — flagging as a known limitation
+   to revisit alongside the option-2 embedded-emulator path
+   from the research doc.

 ## References