d89d389bef
Surface the metric ADR 0004 says matters — the critical security/logic core, currently 95% — as a README badge, distinct from the informational global `coverage` badge. - scripts/critical-modules.txt: single source of truth for the core module list. scripts/coverage.sh now reads it (instead of a hardcoded string) and update-badges.yml reads the same file, so the badge and the `critical` report cannot drift. - update-badges.yml: a `core coverage` step reuses the unit-coverage data (every core module is unit-tested, so unit-only is accurate for it) and sed-updates the new badge, like the existing ones. - README: `core coverage 95%` badge linking to ADR 0004 so a reader can find out what "core" means. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NkwFXLFff9PYPy4wgVBJp9
97 lines
4.4 KiB
Markdown
97 lines
4.4 KiB
Markdown
# ADR 0004: Risk-weighted coverage, not a single global target
|
|
|
|
- **Status:** Accepted
|
|
- **Date:** 2026-06-25
|
|
- **Deciders:** didericis
|
|
|
|
## Context
|
|
|
|
bot-bottle is a security tool: it sandboxes agents, scans egress for
|
|
secret exfiltration, strips credentials, and gates git pushes. A latent
|
|
bug in that logic is expensive, so test coverage there genuinely
|
|
matters. But the repo also contains code where coverage is a poor
|
|
signal:
|
|
|
|
- **Interactive entry-point shells** — `cli/init.py` (a `read_tty_line()`
|
|
prompt loop) and `cli/tui.py` (a curses picker). Their bodies are I/O;
|
|
a unit test has to fake the entire terminal conversation, so it
|
|
inflates the number without asserting behaviour that would otherwise
|
|
go unchecked.
|
|
- **Subprocess / backend orchestration** — the docker / smolmachines /
|
|
macos-container backends shell out to `docker`, `container`, `smolvm`.
|
|
Mock-heavy unit tests here mostly re-assert the argv you already
|
|
wrote (the test passes whether or not the real teardown works), while
|
|
many of the missed *branches* are failure paths you cannot provoke
|
|
against a real daemon on cue.
|
|
|
|
Chasing a single global percentage (e.g. 90%) pushes the most test
|
|
effort onto the least safety-relevant code — exactly backwards — and
|
|
invites performative tests written to colour a line rather than to catch
|
|
a regression (Goodhart's law).
|
|
|
|
## Decision
|
|
|
|
Coverage is **risk-weighted**, measured over the **combined unit +
|
|
integration** suites, with three rules:
|
|
|
|
1. **Critical modules target ≥ 90%.** The security/logic core —
|
|
`egress_addon{,_core}.py`, `dlp_detectors.py`, `egress.py`,
|
|
`manifest*.py`, `git_gate.py`, `git_http_backend.py`, `supervise.py`,
|
|
`yaml_subset.py`, `bottle_state.py` — is Docker-independent and
|
|
unit-testable, so it carries the high bar. We ratchet toward 90% as
|
|
these modules are touched; new gaps in them are not acceptable.
|
|
|
|
2. **Subprocess/backend orchestration is covered by the integration
|
|
suite, not omitted.** `scripts/coverage.sh` runs unit + integration
|
|
under one coverage measurement so these modules are scored where they
|
|
are actually exercised. They stay *visible* — hiding the code that
|
|
tears down sandboxes and wires networks is the one place we will not
|
|
omit.
|
|
|
|
3. **Interactive entry-point shells are omitted** (`.coveragerc`), with a
|
|
rationale comment. This is the only sanctioned use of `omit` besides
|
|
`tests/*`.
|
|
|
|
The forward-looking guard is a **diff-coverage gate**
|
|
(`scripts/diff_coverage.py`): new/changed executable lines on a branch
|
|
must be ≥ 90% covered. This catches regressions where they are
|
|
introduced without forcing a back-fill crusade through legacy glue. The
|
|
gate skips lines in omitted files (there is no coverage data for them),
|
|
so the omit list cannot launder *new* logic into the dark: anything that
|
|
needs real testing must live outside the interactive shells to be
|
|
scored at all.
|
|
|
|
The **global percentage is informational**, not a CI gate — it would
|
|
otherwise be hostage to the CI runner's Docker availability and to the
|
|
omit list.
|
|
|
|
## Consequences
|
|
|
|
- The number we report (`scripts/coverage.sh`) means "coverage of the
|
|
code we consider testable, across both suites" — a dip is a real
|
|
regression in code we control, not noise from added CLI glue.
|
|
- No incentive to write mock-the-mock tests for orchestration to defend
|
|
a global figure.
|
|
- The omit list needs governance: an entry must be a genuinely
|
|
interactive shell, justified in the `.coveragerc` comment and here.
|
|
`cli/init.py` and `cli/tui.py` qualify; backend orchestration does
|
|
not.
|
|
- CI must run the integration suite under coverage to score the
|
|
orchestration modules; where the runner lacks Docker those tests skip
|
|
and their modules read low — accepted, because the *enforced* gates
|
|
(critical-module standard + diff coverage) are Docker-independent.
|
|
- "We're at N%" is now a curated figure; outsiders should read the
|
|
policy, not just the badge.
|
|
|
|
## Links
|
|
|
|
- PRs #290 (cover the egress adapter), and the coverage-policy PR that
|
|
introduces this record.
|
|
- `.coveragerc`, `scripts/coverage.sh`, `scripts/diff_coverage.py`.
|
|
- `scripts/critical-modules.txt` — the single source of truth for the
|
|
core-module list; read by both `scripts/coverage.sh` and the
|
|
`update-badges.yml` "core coverage" badge so they cannot drift.
|
|
- The README carries a `core coverage` badge (auto-updated from that
|
|
list) — the headline number, distinct from the informational global
|
|
`coverage` badge.
|