From 26322bdfd58ea5d12313adaeea7ccade2ef294fd Mon Sep 17 00:00:00 2001 From: didericis Date: Tue, 26 May 2026 03:10:26 -0400 Subject: [PATCH] docs(prd-0020): record answers to open questions, switch to no-teardown-on-quit --- .../0020-start-and-attach-from-dashboard.md | 248 +++++++++++------- 1 file changed, 154 insertions(+), 94 deletions(-) diff --git a/docs/prds/0020-start-and-attach-from-dashboard.md b/docs/prds/0020-start-and-attach-from-dashboard.md index 251cde0..235f201 100644 --- a/docs/prds/0020-start-and-attach-from-dashboard.md +++ b/docs/prds/0020-start-and-attach-from-dashboard.md @@ -21,11 +21,12 @@ shape from `docs/research/claude-code-pane-in-dashboard.md`). When the operator exits claude, the dashboard re-renders with the now-running bottle visible in the agents pane. -Crucially, the bottle's lifetime is owned by the *dashboard -process*, not by the individual claude session. Exit claude → -back to dashboard, bottle still running. Start another agent → -two bottles up at once. Quit the dashboard → all dashboard- -launched bottles tear down. +Crucially, the bottle's lifetime is decoupled from both the +claude session AND the dashboard process. Exit claude → back to +dashboard, bottle still running. Start another agent → two +bottles up at once. Quit the dashboard → bottles continue +running. Teardown is **always explicit**: the operator presses +`x` on an agent, or runs `./cli.py cleanup` later. ## Problem @@ -75,10 +76,12 @@ captures full-merged logs per bottle (PRD 0018). It already 6. Pressing `x` (or similar — keybinding decided in design) on a selected agent stops just that bottle (compose down + state cleanup) without quitting the dashboard. -7. Quitting the dashboard (`q`) tears down every bottle the - dashboard started, unless something has explicitly preserved - the state (capability-block, crash). Matches today's - start.py teardown semantics. +7. Quitting the dashboard (`q`) leaves every running bottle + running. Bottle teardown is always explicit (per-bottle `x` + or `./cli.py cleanup`). The next `./cli.py dashboard` + invocation re-discovers them via `list_active_slugs()` and + surfaces re-attach for any it can reconstruct context for + (see "Cross-dashboard re-attach" below). ## Non-goals @@ -91,10 +94,18 @@ captures full-merged logs per bottle (PRD 0018). It already the dashboard treats them as read-only-watch (already does today). Re-attach only applies to bottles the *current dashboard process* started. -- **Persisting a "bottle pool" across dashboard runs.** When - the dashboard quits, its bottles go. Resume across dashboard - invocations is `./cli.py resume `, which is - unchanged. +- **Resurrecting an out-of-process bottle into a new dashboard + with full re-attach.** A bottle started by `./cli.py start` + in another terminal — or by a previous dashboard run, now + exited — appears in the agents pane (already does, PRD 0019) + and can be re-attached via `docker exec -it claude` because + the agent container is still running `sleep infinity`. That's + in scope. What's *out* is anything that requires the launch- + context object to drive teardown — e.g., the + ExitStack-tracked CA + state cleanup `_settle_state` performs + today. Cross-dashboard re-attach uses the existing + `./cli.py cleanup` for teardown, not an `x` keypress (see + open questions). - **Multi-window UI.** Single curses window, two existing panes (proposals + agents); the agent picker is a modal, not a third pane. @@ -147,31 +158,69 @@ Today's flow: # context exits → compose down → state cleanup ``` -The proposed dashboard-owned flow: +The proposed dashboard-driven flow: ``` ./cli.py dashboard - └─ stack = ExitStack() - bottles: dict[str, DockerBottle] = {} + └─ bottles: dict[str, tuple[ContextManager, DockerBottle]] = {} # operator presses `n`, picks agent - ctx = backend.launch(plan) - bottle = stack.enter_context(ctx) ← bottle stays alive - bottles[plan.slug] = bottle + cm = backend.launch(plan) + bottle = cm.__enter__() ← enter but don't bind to a `with` + bottles[plan.slug] = (cm, bottle) # operator interacts via: curses.endwin() bottle.exec_claude([...], tty=True) ← blocks; returns on Ctrl-D stdscr.refresh() - # bottle is STILL ALIVE here — only the claude process exited + # bottle is STILL ALIVE — only the claude process exited - # ... operator does other things, eventually `q`: - stack.close() ← tears down every bottle + # ... operator presses `x` on selected agent: + cm, _ = bottles.pop(slug) + cm.__exit__(None, None, None) ← tears down just that one + + # ... operator presses `q`: + return # bottles dict still populated; no teardown ``` -The shift is one line of code semantically but the change in -operator experience is real: bottles outlive any single claude -session. +Two shifts: + +1. Bottles outlive any single claude session — the dashboard + manages enter/exit per bottle, not per attach. Exit claude + → still in the dashboard with the bottle running. +2. Bottles outlive the dashboard process itself. Quitting the + dashboard does NOT close the context managers; the docker + compose project keeps running with the agent container in + `sleep infinity`. A subsequent dashboard invocation + re-discovers it via `docker compose ls` (PRD 0019's + `list_active_slugs`) and surfaces re-attach. + + The trade-off: state cleanup that today runs in + `_settle_state` (transcript snapshot, preserve-marker + evaluation, state-dir reap) doesn't fire on a quit-while- + running bottle. It DOES fire when the operator explicitly + stops via `x`, because that calls `cm.__exit__`. For + bottles a previous dashboard quit on, `./cli.py cleanup` + is the path — its compose-down + state-reap logic + already covers the case. + +### Cross-dashboard re-attach + +When the dashboard discovers a bottle in `discover_active_agents` +that it didn't itself start (a previous-dashboard or external +`./cli.py start` bottle), Enter still attaches via `docker exec +-it … claude` — the agent container is running `sleep infinity` +exactly the same way regardless of who started it. The only +thing the current dashboard lacks for those bottles is the +launch-context object needed to drive a clean teardown via +`x`. + +For v1 we surface this honestly: pressing `x` on a non-owned +agent shows a status hint pointing at `./cli.py cleanup` (or +`./cli.py cleanup` targeted at the slug if we add that flag +later). The agent stays alive; the operator handles teardown +out-of-band. Enter (re-attach) works for both owned and +non-owned bottles. ### Agent picker @@ -213,40 +262,43 @@ if modal proves fiddly. ### Re-attach (Enter on agent) -Same handoff pattern the new-agent flow uses. The dashboard -already holds the `DockerBottle` for any slug it started — -`bottle.exec_claude([...], tty=True)` does the right `docker -exec -it claude …` and returns on session exit. Re-attach is -"already-running" + the same exec call; the agent picker isn't -involved. - -For agents the dashboard didn't start (read-only watch), Enter -is a no-op with a status hint ("dashboard didn't start this -bottle; resume with `./cli.py resume ` outside the -dashboard"). PRD-0019's selection model already differentiates -focus; this layer just gates the action. +Same handoff pattern the new-agent flow uses. For an agent the +dashboard started this session, the dashboard holds the +`DockerBottle` handle in its `bottles` dict and calls +`bottle.exec_claude(...)`. For an agent it discovered via +`list_active_slugs` (previous-dashboard or external start), +the dashboard synthesizes a one-shot `DockerBottle` from the +slug — container name is `claude-bottle-`, no prompt +path because the agent's claude config already has `--append- +system-prompt-file` baked in from the original launch — +and runs the same exec. Either way, Enter drops to +full-screen claude; on exit the dashboard re-renders. ### Explicit per-bottle stop -`x` on a selected dashboard-owned agent invokes -`stack.pop_callback`-style targeted teardown: take that bottle -out of the map, call its `close()` to tear down compose + state, -update the agents pane on the next refresh. Bottles the -dashboard didn't start (`x` on a read-only-watch row) → no-op -with a status hint. +`x` on a dashboard-owned agent: pop the `(cm, bottle)` from +the dict, call `cm.__exit__(None, None, None)` which drives +the existing compose-down + state-settle logic. Refresh the +agents pane. + +`x` on a non-owned agent (discovered via `list_active_slugs` +but not in `bottles` dict): no-op with status hint pointing +at `./cli.py cleanup` (the existing path that tears down +ANY claude-bottle compose project plus reaps state dirs). ### Dashboard quit -`q` (existing) calls `stack.close()` before exit; every -dashboard-launched bottle goes through its normal teardown -(`compose down` + state settle). Preserve markers (capability- -block, crash) still keep state across teardown. The dashboard -process itself returns 0. +`q` returns the dashboard process to 0 without touching any +running bottles. The `bottles` dict goes out of scope but +because the context managers' `__exit__` is never invoked, +the `docker compose` project keeps running. The next dashboard +invocation discovers the bottles via `list_active_slugs` and +surfaces re-attach. -If the operator wants to keep bottles alive past dashboard -exit, the existing path is unchanged: launch them via -`./cli.py start` in a separate terminal. That ownership stays -out-of-band. +This is a real departure from today's `./cli.py start` +semantics (which couples bottle lifetime to the process via +ExitStack). It's intentional: the dashboard is a watching + +acting surface, not a lifetime owner. ## Implementation chunks @@ -274,54 +326,62 @@ Sized for one PR each. down every bottle it started" contract in `dashboard.py`'s module docstring. +## Resolved questions + +1. **Modal vs. drop-and-resume for preflight Y/N.** Resolved: + **modal.** Render the preflight lines centered in a curses + sub-window with `[y/N]` at the bottom; capture the next + keypress. If geometry proves fiddly during implementation + we'll fall back to drop-and-resume, but modal is the target. + +2. **Agent picker: text-filter typing.** Resolved: **yes, + include filter typing.** As the operator types, the list + filters to agents whose name matches (substring, + case-insensitive). j/k still navigates within the filtered + set; Esc clears the filter on first press, exits the picker + on the second. + +3. **Container-died-during-claude handling.** Keep the design + as drafted: transcript snapshot (`snapshot_transcript`) + + `mark_preserved` if exit code is non-zero + remove from + the `bottles` dict + status line `"claude session for + [slug] ended with exit N; preserved for resume"`. The + bottle's `cm.__exit__` would normally run on stop; here it + runs as part of the death-handling (the container is + already gone, but compose-down + state-settle still + sequence the network removal + state cleanup correctly). + +4. **Double-start of the same agent.** Allowed. The picker + surfaces a `(N running)` annotation next to any agent name + that already has live bottles in this dashboard's `bottles` + dict OR in `list_active_slugs()`, so the operator sees the + running-count before picking. Selecting an already-running + agent name mints a fresh slug for the new bottle as + normal. + +5. **Quit behavior.** Resolved: **`q` does NOT tear down any + bottles.** Dashboard exit is purely a UI exit; the + bottles dict goes out of scope without invoking `__exit__`, + so the `docker compose` projects keep running. Bottle + teardown is always explicit: per-bottle `x` (for + dashboard-owned), or `./cli.py cleanup` (for everything). + ## Open questions -1. **Modal vs. drop-and-resume for preflight Y/N.** Both work; - modal is nicer if the curses geometry handling is - straightforward. Pick during chunk 2 by prototyping the - modal in ~30 lines and seeing if it looks right. - -2. **Agent picker: text-filter typing?** v1 is j/k navigation - only. If the manifest has 20+ agents the picker gets noisy; - add fzf-style filter input later if needed. - -3. **What happens if `attach_claude` exits because the - container died** (not a clean claude exit — e.g., OOM, - panic)? Today's `_settle_state` marks the bottle preserved - for non-zero exit codes. The dashboard's re-render needs to - notice the bottle is gone (compose down or container-not- - running state) and surface a status line. Probably: - transcript snapshot + mark preserved + remove from - `bottles` map + status line "claude session for [slug] - ended with exit N; preserved for resume". - -4. **Double-start of the same agent.** Allowed by design — slugs - are unique per launch — but the picker should make it clear - this is a "start a SECOND bottle" decision, not a "re-enter - the first." Probably handled by showing the running-count in - the picker row. - -5. **Should `q` confirm before tearing down N running - bottles?** A 5-bottle dashboard with 5 in-flight sessions - loses non-trivial state on accidental `q`. Probably yes: - curses modal "quit and tear down N bottles? [y/N]". Skip - confirmation when there are zero owned bottles. - 6. **Race between handoff and 1s refresh tick.** While the dashboard's `stdscr.timeout` is set, a key press fires the handoff and the dashboard sits in `docker exec` for minutes. `discover_active_agents` / `discover_pending` don't poll - during that window, which is fine — the moment we - `stdscr.refresh()` after exec returns, the next loop iter - runs discovery and the panes reflect reality. Worth calling - out in the design but no special handling needed. - -7. **Multi-bottle resource use.** Five bottles up means five - compose projects: 5×(agent + pipelock + egress optional + - git-gate optional + supervise optional) containers, plus 5×2 - networks. On a 16-GiB host this is fine; on something - smaller the operator might want a soft cap or a warning. - Out of v1; flag for follow-up if it bites. + during that window — that's harmless on its own (the moment + we `stdscr.refresh()` after exec returns, the next loop + iter runs discovery and the panes reflect reality), but + it does mean: (a) proposals queued during the claude + session won't fire any operator notification until the + handoff ends, and (b) a bottle that died mid-claude won't + be detectable until the operator exits back to the + dashboard. Not blocking v1 — flagging as a known limitation + to revisit alongside the option-2 embedded-emulator path + from the research doc. ## References