632ab002ed
Adopt ADR 0004: stop chasing a single global coverage number and measure what matters instead. - Omit the genuinely-interactive `cli/init.py` shell (read_tty_line prompt loops) alongside the existing `cli/tui.py`, with a rationale comment in .coveragerc. Subprocess/backend orchestration is NOT omitted — it stays visible and is scored via the integration suite. - scripts/coverage.sh runs unit + integration under one coverage measurement (the policy's yardstick) and can report the critical security/logic core held to the >=90% target. - scripts/diff_coverage.py is a stdlib-only gate (no diff-cover dep): new/changed executable lines must be >=90% covered. This is the enforced regression guard; the global number is informational. - CI gains a `coverage` job: combined report + the diff-coverage gate. - Unit-test `cli/__init__.py` dispatch/exit-code mapping (it's logic, not I/O, so it earns tests rather than an omit). Combined unit+integration coverage now reports 83% global / 87% across the critical modules; per-module ratcheting toward 90% is the ongoing work this policy frames. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NkwFXLFff9PYPy4wgVBJp9
91 lines
4.0 KiB
Markdown
91 lines
4.0 KiB
Markdown
# ADR 0004: Risk-weighted coverage, not a single global target
|
|
|
|
- **Status:** Accepted
|
|
- **Date:** 2026-06-25
|
|
- **Deciders:** didericis
|
|
|
|
## Context
|
|
|
|
bot-bottle is a security tool: it sandboxes agents, scans egress for
|
|
secret exfiltration, strips credentials, and gates git pushes. A latent
|
|
bug in that logic is expensive, so test coverage there genuinely
|
|
matters. But the repo also contains code where coverage is a poor
|
|
signal:
|
|
|
|
- **Interactive entry-point shells** — `cli/init.py` (a `read_tty_line()`
|
|
prompt loop) and `cli/tui.py` (a curses picker). Their bodies are I/O;
|
|
a unit test has to fake the entire terminal conversation, so it
|
|
inflates the number without asserting behaviour that would otherwise
|
|
go unchecked.
|
|
- **Subprocess / backend orchestration** — the docker / smolmachines /
|
|
macos-container backends shell out to `docker`, `container`, `smolvm`.
|
|
Mock-heavy unit tests here mostly re-assert the argv you already
|
|
wrote (the test passes whether or not the real teardown works), while
|
|
many of the missed *branches* are failure paths you cannot provoke
|
|
against a real daemon on cue.
|
|
|
|
Chasing a single global percentage (e.g. 90%) pushes the most test
|
|
effort onto the least safety-relevant code — exactly backwards — and
|
|
invites performative tests written to colour a line rather than to catch
|
|
a regression (Goodhart's law).
|
|
|
|
## Decision
|
|
|
|
Coverage is **risk-weighted**, measured over the **combined unit +
|
|
integration** suites, with three rules:
|
|
|
|
1. **Critical modules target ≥ 90%.** The security/logic core —
|
|
`egress_addon{,_core}.py`, `dlp_detectors.py`, `egress.py`,
|
|
`manifest*.py`, `git_gate.py`, `git_http_backend.py`, `supervise.py`,
|
|
`yaml_subset.py`, `bottle_state.py` — is Docker-independent and
|
|
unit-testable, so it carries the high bar. We ratchet toward 90% as
|
|
these modules are touched; new gaps in them are not acceptable.
|
|
|
|
2. **Subprocess/backend orchestration is covered by the integration
|
|
suite, not omitted.** `scripts/coverage.sh` runs unit + integration
|
|
under one coverage measurement so these modules are scored where they
|
|
are actually exercised. They stay *visible* — hiding the code that
|
|
tears down sandboxes and wires networks is the one place we will not
|
|
omit.
|
|
|
|
3. **Interactive entry-point shells are omitted** (`.coveragerc`), with a
|
|
rationale comment. This is the only sanctioned use of `omit` besides
|
|
`tests/*`.
|
|
|
|
The forward-looking guard is a **diff-coverage gate**
|
|
(`scripts/diff_coverage.py`): new/changed executable lines on a branch
|
|
must be ≥ 90% covered. This catches regressions where they are
|
|
introduced without forcing a back-fill crusade through legacy glue. The
|
|
gate skips lines in omitted files (there is no coverage data for them),
|
|
so the omit list cannot launder *new* logic into the dark: anything that
|
|
needs real testing must live outside the interactive shells to be
|
|
scored at all.
|
|
|
|
The **global percentage is informational**, not a CI gate — it would
|
|
otherwise be hostage to the CI runner's Docker availability and to the
|
|
omit list.
|
|
|
|
## Consequences
|
|
|
|
- The number we report (`scripts/coverage.sh`) means "coverage of the
|
|
code we consider testable, across both suites" — a dip is a real
|
|
regression in code we control, not noise from added CLI glue.
|
|
- No incentive to write mock-the-mock tests for orchestration to defend
|
|
a global figure.
|
|
- The omit list needs governance: an entry must be a genuinely
|
|
interactive shell, justified in the `.coveragerc` comment and here.
|
|
`cli/init.py` and `cli/tui.py` qualify; backend orchestration does
|
|
not.
|
|
- CI must run the integration suite under coverage to score the
|
|
orchestration modules; where the runner lacks Docker those tests skip
|
|
and their modules read low — accepted, because the *enforced* gates
|
|
(critical-module standard + diff coverage) are Docker-independent.
|
|
- "We're at N%" is now a curated figure; outsiders should read the
|
|
policy, not just the badge.
|
|
|
|
## Links
|
|
|
|
- PRs #290 (cover the egress adapter), and the coverage-policy PR that
|
|
introduces this record.
|
|
- `.coveragerc`, `scripts/coverage.sh`, `scripts/diff_coverage.py`.
|