Files
bot-bottle/docs/decisions/0004-coverage-policy.md
T
didericis 632ab002ed
lint / lint (push) Successful in 1m52s
test / unit (pull_request) Successful in 46s
test / integration (pull_request) Successful in 16s
test / coverage (pull_request) Successful in 1m2s
ci(coverage): risk-weighted coverage policy + diff-coverage gate
Adopt ADR 0004: stop chasing a single global coverage number and
measure what matters instead.

- Omit the genuinely-interactive `cli/init.py` shell (read_tty_line
  prompt loops) alongside the existing `cli/tui.py`, with a rationale
  comment in .coveragerc. Subprocess/backend orchestration is NOT
  omitted — it stays visible and is scored via the integration suite.
- scripts/coverage.sh runs unit + integration under one coverage
  measurement (the policy's yardstick) and can report the critical
  security/logic core held to the >=90% target.
- scripts/diff_coverage.py is a stdlib-only gate (no diff-cover dep):
  new/changed executable lines must be >=90% covered. This is the
  enforced regression guard; the global number is informational.
- CI gains a `coverage` job: combined report + the diff-coverage gate.
- Unit-test `cli/__init__.py` dispatch/exit-code mapping (it's logic,
  not I/O, so it earns tests rather than an omit).

Combined unit+integration coverage now reports 83% global / 87% across
the critical modules; per-module ratcheting toward 90% is the ongoing
work this policy frames.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NkwFXLFff9PYPy4wgVBJp9
2026-06-25 21:29:08 -04:00

91 lines
4.0 KiB
Markdown

# ADR 0004: Risk-weighted coverage, not a single global target
- **Status:** Accepted
- **Date:** 2026-06-25
- **Deciders:** didericis
## Context
bot-bottle is a security tool: it sandboxes agents, scans egress for
secret exfiltration, strips credentials, and gates git pushes. A latent
bug in that logic is expensive, so test coverage there genuinely
matters. But the repo also contains code where coverage is a poor
signal:
- **Interactive entry-point shells** — `cli/init.py` (a `read_tty_line()`
prompt loop) and `cli/tui.py` (a curses picker). Their bodies are I/O;
a unit test has to fake the entire terminal conversation, so it
inflates the number without asserting behaviour that would otherwise
go unchecked.
- **Subprocess / backend orchestration** — the docker / smolmachines /
macos-container backends shell out to `docker`, `container`, `smolvm`.
Mock-heavy unit tests here mostly re-assert the argv you already
wrote (the test passes whether or not the real teardown works), while
many of the missed *branches* are failure paths you cannot provoke
against a real daemon on cue.
Chasing a single global percentage (e.g. 90%) pushes the most test
effort onto the least safety-relevant code — exactly backwards — and
invites performative tests written to colour a line rather than to catch
a regression (Goodhart's law).
## Decision
Coverage is **risk-weighted**, measured over the **combined unit +
integration** suites, with three rules:
1. **Critical modules target ≥ 90%.** The security/logic core —
`egress_addon{,_core}.py`, `dlp_detectors.py`, `egress.py`,
`manifest*.py`, `git_gate.py`, `git_http_backend.py`, `supervise.py`,
`yaml_subset.py`, `bottle_state.py` — is Docker-independent and
unit-testable, so it carries the high bar. We ratchet toward 90% as
these modules are touched; new gaps in them are not acceptable.
2. **Subprocess/backend orchestration is covered by the integration
suite, not omitted.** `scripts/coverage.sh` runs unit + integration
under one coverage measurement so these modules are scored where they
are actually exercised. They stay *visible* — hiding the code that
tears down sandboxes and wires networks is the one place we will not
omit.
3. **Interactive entry-point shells are omitted** (`.coveragerc`), with a
rationale comment. This is the only sanctioned use of `omit` besides
`tests/*`.
The forward-looking guard is a **diff-coverage gate**
(`scripts/diff_coverage.py`): new/changed executable lines on a branch
must be ≥ 90% covered. This catches regressions where they are
introduced without forcing a back-fill crusade through legacy glue. The
gate skips lines in omitted files (there is no coverage data for them),
so the omit list cannot launder *new* logic into the dark: anything that
needs real testing must live outside the interactive shells to be
scored at all.
The **global percentage is informational**, not a CI gate — it would
otherwise be hostage to the CI runner's Docker availability and to the
omit list.
## Consequences
- The number we report (`scripts/coverage.sh`) means "coverage of the
code we consider testable, across both suites" — a dip is a real
regression in code we control, not noise from added CLI glue.
- No incentive to write mock-the-mock tests for orchestration to defend
a global figure.
- The omit list needs governance: an entry must be a genuinely
interactive shell, justified in the `.coveragerc` comment and here.
`cli/init.py` and `cli/tui.py` qualify; backend orchestration does
not.
- CI must run the integration suite under coverage to score the
orchestration modules; where the runner lacks Docker those tests skip
and their modules read low — accepted, because the *enforced* gates
(critical-module standard + diff coverage) are Docker-independent.
- "We're at N%" is now a curated figure; outsiders should read the
policy, not just the badge.
## Links
- PRs #290 (cover the egress adapter), and the coverage-policy PR that
introduces this record.
- `.coveragerc`, `scripts/coverage.sh`, `scripts/diff_coverage.py`.