Surface the metric ADR 0004 says matters — the critical security/logic core, currently 95% — as a README badge, distinct from the informational global `coverage` badge. - scripts/critical-modules.txt: single source of truth for the core module list. scripts/coverage.sh now reads it (instead of a hardcoded string) and update-badges.yml reads the same file, so the badge and the `critical` report cannot drift. - update-badges.yml: a `core coverage` step reuses the unit-coverage data (every core module is unit-tested, so unit-only is accurate for it) and sed-updates the new badge, like the existing ones. - README: `core coverage 95%` badge linking to ADR 0004 so a reader can find out what "core" means. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NkwFXLFff9PYPy4wgVBJp9
4.4 KiB
ADR 0004: Risk-weighted coverage, not a single global target
- Status: Accepted
- Date: 2026-06-25
- Deciders: didericis
Context
bot-bottle is a security tool: it sandboxes agents, scans egress for secret exfiltration, strips credentials, and gates git pushes. A latent bug in that logic is expensive, so test coverage there genuinely matters. But the repo also contains code where coverage is a poor signal:
- Interactive entry-point shells —
cli/init.py(aread_tty_line()prompt loop) andcli/tui.py(a curses picker). Their bodies are I/O; a unit test has to fake the entire terminal conversation, so it inflates the number without asserting behaviour that would otherwise go unchecked. - Subprocess / backend orchestration — the docker / smolmachines /
macos-container backends shell out to
docker,container,smolvm. Mock-heavy unit tests here mostly re-assert the argv you already wrote (the test passes whether or not the real teardown works), while many of the missed branches are failure paths you cannot provoke against a real daemon on cue.
Chasing a single global percentage (e.g. 90%) pushes the most test effort onto the least safety-relevant code — exactly backwards — and invites performative tests written to colour a line rather than to catch a regression (Goodhart's law).
Decision
Coverage is risk-weighted, measured over the combined unit + integration suites, with three rules:
-
Critical modules target ≥ 90%. The security/logic core —
egress_addon{,_core}.py,dlp_detectors.py,egress.py,manifest*.py,git_gate.py,git_http_backend.py,supervise.py,yaml_subset.py,bottle_state.py— is Docker-independent and unit-testable, so it carries the high bar. We ratchet toward 90% as these modules are touched; new gaps in them are not acceptable. -
Subprocess/backend orchestration is covered by the integration suite, not omitted.
scripts/coverage.shruns unit + integration under one coverage measurement so these modules are scored where they are actually exercised. They stay visible — hiding the code that tears down sandboxes and wires networks is the one place we will not omit. -
Interactive entry-point shells are omitted (
.coveragerc), with a rationale comment. This is the only sanctioned use ofomitbesidestests/*.
The forward-looking guard is a diff-coverage gate
(scripts/diff_coverage.py): new/changed executable lines on a branch
must be ≥ 90% covered. This catches regressions where they are
introduced without forcing a back-fill crusade through legacy glue. The
gate skips lines in omitted files (there is no coverage data for them),
so the omit list cannot launder new logic into the dark: anything that
needs real testing must live outside the interactive shells to be
scored at all.
The global percentage is informational, not a CI gate — it would otherwise be hostage to the CI runner's Docker availability and to the omit list.
Consequences
- The number we report (
scripts/coverage.sh) means "coverage of the code we consider testable, across both suites" — a dip is a real regression in code we control, not noise from added CLI glue. - No incentive to write mock-the-mock tests for orchestration to defend a global figure.
- The omit list needs governance: an entry must be a genuinely
interactive shell, justified in the
.coveragerccomment and here.cli/init.pyandcli/tui.pyqualify; backend orchestration does not. - CI must run the integration suite under coverage to score the orchestration modules; where the runner lacks Docker those tests skip and their modules read low — accepted, because the enforced gates (critical-module standard + diff coverage) are Docker-independent.
- "We're at N%" is now a curated figure; outsiders should read the policy, not just the badge.
Links
- PRs #290 (cover the egress adapter), and the coverage-policy PR that introduces this record.
.coveragerc,scripts/coverage.sh,scripts/diff_coverage.py.scripts/critical-modules.txt— the single source of truth for the core-module list; read by bothscripts/coverage.shand theupdate-badges.yml"core coverage" badge so they cannot drift.- The README carries a
core coveragebadge (auto-updated from that list) — the headline number, distinct from the informational globalcoveragebadge.