fix(dashboard): surface launch/crash failures (#100) #102

Merged
didericis merged 3 commits from fix/dashboard-crash-logging into main 2026-05-29 00:27:22 -04:00
Collaborator

Closes #100.

Tracking PR — root-cause writeup below; implementation to follow on this branch.

The bug

The dashboard (./cli.py dashboard) runs its TUI under curses.wrapper(_main_loop), and cmd_dashboard only catches KeyboardInterrupt. When something fails — most often a malformed manifest / config — the failure is invisible:

  • Manifest.resolve(...) reports config problems via die(), which prints bot-bottle: error: <reason> to stderr and raises Die. But that print happens while curses owns the terminal, so the text lands on the alternate screen and is wiped when curses restores the terminal on the way out. The dashboard just vanishes with exit code 1 and no reason.
  • Die subclasses SystemExit, not Exception. So the one place that guards manifest loading — the n (new-agent) handler's except Exception — does not catch config errors either; they propagate and crash the TUI.
  • The startup manifest probe (_get_manifest() in _main_loop) is unguarded, so a config typo kills the dashboard before it even renders.
  • For any other unexpected exception, the traceback is the only signal — and it, too, is painted over by the alternate screen.

Net: "dashboard crashed, don't know why" — exactly the report in #100.

The fix (planned)

  1. log.Die carries its message + add log.error(), so a caller that swallowed stderr can re-surface the reason.
  2. Boundary safety-net in cmd_dashboard: catch Die (re-print the reason now that the terminal is restored) and any other Exception (write a full traceback to ~/.bot-bottle/logs/dashboard-crash.log and point the operator at it); return non-zero.
  3. Guard the startup manifest probe so a malformed config degrades to a status-line warning instead of crashing — the operator can still watch running bottles.
  4. Fix the n-flow guard to catch Die (a SystemExit, which except Exception misses today).

Plus unit tests for the new error paths.

Closes #100. Tracking PR — root-cause writeup below; implementation to follow on this branch. ## The bug The dashboard (`./cli.py dashboard`) runs its TUI under `curses.wrapper(_main_loop)`, and `cmd_dashboard` only catches `KeyboardInterrupt`. When something fails — most often a malformed manifest / config — the failure is **invisible**: - `Manifest.resolve(...)` reports config problems via `die()`, which prints `bot-bottle: error: <reason>` to stderr and raises `Die`. But that print happens **while curses owns the terminal**, so the text lands on the alternate screen and is **wiped when curses restores the terminal** on the way out. The dashboard just vanishes with exit code 1 and no reason. - `Die` subclasses `SystemExit`, **not** `Exception`. So the one place that guards manifest loading — the `n` (new-agent) handler's `except Exception` — does **not** catch config errors either; they propagate and crash the TUI. - The startup manifest probe (`_get_manifest()` in `_main_loop`) is unguarded, so a config typo kills the dashboard before it even renders. - For any other unexpected exception, the traceback is the only signal — and it, too, is painted over by the alternate screen. Net: "dashboard crashed, don't know why" — exactly the report in #100. ## The fix (planned) 1. **`log.Die` carries its message** + add `log.error()`, so a caller that swallowed stderr can re-surface the reason. 2. **Boundary safety-net in `cmd_dashboard`:** catch `Die` (re-print the reason now that the terminal is restored) and any other `Exception` (write a full traceback to `~/.bot-bottle/logs/dashboard-crash.log` and point the operator at it); return non-zero. 3. **Guard the startup manifest probe** so a malformed config degrades to a status-line warning instead of crashing — the operator can still watch running bottles. 4. **Fix the `n`-flow guard** to catch `Die` (a `SystemExit`, which `except Exception` misses today). Plus unit tests for the new error paths.
didericis added 1 commit 2026-05-28 23:49:26 -04:00
fix(dashboard): surface launch/crash failures (#100)
test / unit (pull_request) Successful in 29s
test / integration (pull_request) Successful in 43s
99ec267c74
The dashboard runs under curses.wrapper and cmd_dashboard only caught
KeyboardInterrupt, so failures vanished:
- die() prints to stderr, but under curses that lands on the alternate
  screen and is wiped on exit, so config errors gave no reason.
- Die is a SystemExit, so the new-agent flow's `except Exception` never
  caught config errors; they crashed the TUI.
- the startup manifest probe was unguarded.

Now: Die carries its message (+ log.error()); cmd_dashboard re-surfaces
a Die's reason once the terminal is restored and writes any other
crash's traceback to ~/.bot-bottle/logs/dashboard-crash.log; the startup
probe and the new-agent flow degrade a bad config to a status-line
warning instead of crashing.

Closes #100

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
didericis force-pushed fix/dashboard-crash-logging from ba7ebf2f20 to 99ec267c74 2026-05-28 23:49:26 -04:00 Compare
didericis reviewed 2026-05-28 23:51:28 -04:00
@@ -1411,1 +1459,3 @@
status_line = "warning: no bot-bottle config/agents found; new-agent picker is empty"
# A malformed manifest must not take the whole dashboard down — the
# operator may just be watching running bottles. Degrade to a
# status-line warning. Die is a SystemExit (not an Exception), so it
Owner

Is there a good reason to die when a manifest can't be read? Seems like that should be an exception.

Is there a good reason to die when a manifest can't be read? Seems like that should be an exception.
didericis added 1 commit 2026-05-29 00:15:22 -04:00
refactor(manifest): raise ManifestError instead of die()
test / unit (pull_request) Successful in 36s
test / integration (pull_request) Successful in 59s
847baa84be
Review feedback on #102: a manifest that can't be read should raise an
exception, not call die() (a SystemExit). That SystemExit was the whole
reason the dashboard had to special-case Die.

manifest.py now raises ManifestError (a plain Exception) for every
validation failure. The CLI dispatcher catches it and prints+exits 1
(same UX as before); the dashboard catches it with a normal
`except ManifestError` and degrades to a status-line warning. Manifest
tests assert on ManifestError + its message.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Author
Collaborator

Good call — addressed in 847baa8. manifest.py now raises a ManifestError (a plain Exception) for every validation failure instead of calling die(). The CLI dispatcher catches it and prints + exits 1 (same UX as before), and the dashboard catches it with a normal except ManifestError and degrades to a status-line warning — so the Die/SystemExit special-casing is gone (that awkwardness was the only reason it existed). Manifest tests now assert on ManifestError and its message. Full suite green (746).

Good call — addressed in 847baa8. `manifest.py` now raises a `ManifestError` (a plain `Exception`) for every validation failure instead of calling `die()`. The CLI dispatcher catches it and prints + exits 1 (same UX as before), and the dashboard catches it with a normal `except ManifestError` and degrades to a status-line warning — so the `Die`/SystemExit special-casing is gone (that awkwardness was the only reason it existed). Manifest tests now assert on `ManifestError` and its message. Full suite green (746).
didericis approved these changes 2026-05-29 00:18:41 -04:00
didericis added 1 commit 2026-05-29 00:24:02 -04:00
refactor(manifest): drop dead _load_json_or_die helper
test / unit (pull_request) Successful in 32s
test / integration (pull_request) Successful in 47s
test / unit (push) Successful in 28s
test / integration (push) Successful in 41s
7b2474a5d3
It had no callers — a leftover from the pre-PRD-0011 bot-bottle.json
loader (the manifest is per-file Markdown now). Removing it also drops
the now-unused `json` import.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
didericis merged commit 7b2474a5d3 into main 2026-05-29 00:27:22 -04:00
didericis deleted branch fix/dashboard-crash-logging 2026-05-29 00:27:22 -04:00
Sign in to join this conversation.