fix(dashboard): surface launch/crash failures (#100) #102
Reference in New Issue
Block a user
Delete Branch "fix/dashboard-crash-logging"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #100.
Tracking PR — root-cause writeup below; implementation to follow on this branch.
The bug
The dashboard (
./cli.py dashboard) runs its TUI undercurses.wrapper(_main_loop), andcmd_dashboardonly catchesKeyboardInterrupt. When something fails — most often a malformed manifest / config — the failure is invisible:Manifest.resolve(...)reports config problems viadie(), which printsbot-bottle: error: <reason>to stderr and raisesDie. But that print happens while curses owns the terminal, so the text lands on the alternate screen and is wiped when curses restores the terminal on the way out. The dashboard just vanishes with exit code 1 and no reason.DiesubclassesSystemExit, notException. So the one place that guards manifest loading — then(new-agent) handler'sexcept Exception— does not catch config errors either; they propagate and crash the TUI._get_manifest()in_main_loop) is unguarded, so a config typo kills the dashboard before it even renders.Net: "dashboard crashed, don't know why" — exactly the report in #100.
The fix (planned)
log.Diecarries its message + addlog.error(), so a caller that swallowed stderr can re-surface the reason.cmd_dashboard: catchDie(re-print the reason now that the terminal is restored) and any otherException(write a full traceback to~/.bot-bottle/logs/dashboard-crash.logand point the operator at it); return non-zero.n-flow guard to catchDie(aSystemExit, whichexcept Exceptionmisses today).Plus unit tests for the new error paths.
ba7ebf2f20to99ec267c74@@ -1411,1 +1459,3 @@status_line = "warning: no bot-bottle config/agents found; new-agent picker is empty"# A malformed manifest must not take the whole dashboard down — the# operator may just be watching running bottles. Degrade to a# status-line warning. Die is a SystemExit (not an Exception), so itIs there a good reason to die when a manifest can't be read? Seems like that should be an exception.
Good call — addressed in
847baa8.manifest.pynow raises aManifestError(a plainException) for every validation failure instead of callingdie(). The CLI dispatcher catches it and prints + exits 1 (same UX as before), and the dashboard catches it with a normalexcept ManifestErrorand degrades to a status-line warning — so theDie/SystemExit special-casing is gone (that awkwardness was the only reason it existed). Manifest tests now assert onManifestErrorand its message. Full suite green (746).