Files
bot-bottle/docs/decisions/0004-coverage-policy.md
T
didericis 632ab002ed
lint / lint (push) Successful in 1m52s
test / unit (pull_request) Successful in 46s
test / integration (pull_request) Successful in 16s
test / coverage (pull_request) Successful in 1m2s
ci(coverage): risk-weighted coverage policy + diff-coverage gate
Adopt ADR 0004: stop chasing a single global coverage number and
measure what matters instead.

- Omit the genuinely-interactive `cli/init.py` shell (read_tty_line
  prompt loops) alongside the existing `cli/tui.py`, with a rationale
  comment in .coveragerc. Subprocess/backend orchestration is NOT
  omitted — it stays visible and is scored via the integration suite.
- scripts/coverage.sh runs unit + integration under one coverage
  measurement (the policy's yardstick) and can report the critical
  security/logic core held to the >=90% target.
- scripts/diff_coverage.py is a stdlib-only gate (no diff-cover dep):
  new/changed executable lines must be >=90% covered. This is the
  enforced regression guard; the global number is informational.
- CI gains a `coverage` job: combined report + the diff-coverage gate.
- Unit-test `cli/__init__.py` dispatch/exit-code mapping (it's logic,
  not I/O, so it earns tests rather than an omit).

Combined unit+integration coverage now reports 83% global / 87% across
the critical modules; per-module ratcheting toward 90% is the ongoing
work this policy frames.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NkwFXLFff9PYPy4wgVBJp9
2026-06-25 21:29:08 -04:00

4.0 KiB

ADR 0004: Risk-weighted coverage, not a single global target

  • Status: Accepted
  • Date: 2026-06-25
  • Deciders: didericis

Context

bot-bottle is a security tool: it sandboxes agents, scans egress for secret exfiltration, strips credentials, and gates git pushes. A latent bug in that logic is expensive, so test coverage there genuinely matters. But the repo also contains code where coverage is a poor signal:

  • Interactive entry-point shellscli/init.py (a read_tty_line() prompt loop) and cli/tui.py (a curses picker). Their bodies are I/O; a unit test has to fake the entire terminal conversation, so it inflates the number without asserting behaviour that would otherwise go unchecked.
  • Subprocess / backend orchestration — the docker / smolmachines / macos-container backends shell out to docker, container, smolvm. Mock-heavy unit tests here mostly re-assert the argv you already wrote (the test passes whether or not the real teardown works), while many of the missed branches are failure paths you cannot provoke against a real daemon on cue.

Chasing a single global percentage (e.g. 90%) pushes the most test effort onto the least safety-relevant code — exactly backwards — and invites performative tests written to colour a line rather than to catch a regression (Goodhart's law).

Decision

Coverage is risk-weighted, measured over the combined unit + integration suites, with three rules:

  1. Critical modules target ≥ 90%. The security/logic core — egress_addon{,_core}.py, dlp_detectors.py, egress.py, manifest*.py, git_gate.py, git_http_backend.py, supervise.py, yaml_subset.py, bottle_state.py — is Docker-independent and unit-testable, so it carries the high bar. We ratchet toward 90% as these modules are touched; new gaps in them are not acceptable.

  2. Subprocess/backend orchestration is covered by the integration suite, not omitted. scripts/coverage.sh runs unit + integration under one coverage measurement so these modules are scored where they are actually exercised. They stay visible — hiding the code that tears down sandboxes and wires networks is the one place we will not omit.

  3. Interactive entry-point shells are omitted (.coveragerc), with a rationale comment. This is the only sanctioned use of omit besides tests/*.

The forward-looking guard is a diff-coverage gate (scripts/diff_coverage.py): new/changed executable lines on a branch must be ≥ 90% covered. This catches regressions where they are introduced without forcing a back-fill crusade through legacy glue. The gate skips lines in omitted files (there is no coverage data for them), so the omit list cannot launder new logic into the dark: anything that needs real testing must live outside the interactive shells to be scored at all.

The global percentage is informational, not a CI gate — it would otherwise be hostage to the CI runner's Docker availability and to the omit list.

Consequences

  • The number we report (scripts/coverage.sh) means "coverage of the code we consider testable, across both suites" — a dip is a real regression in code we control, not noise from added CLI glue.
  • No incentive to write mock-the-mock tests for orchestration to defend a global figure.
  • The omit list needs governance: an entry must be a genuinely interactive shell, justified in the .coveragerc comment and here. cli/init.py and cli/tui.py qualify; backend orchestration does not.
  • CI must run the integration suite under coverage to score the orchestration modules; where the runner lacks Docker those tests skip and their modules read low — accepted, because the enforced gates (critical-module standard + diff coverage) are Docker-independent.
  • "We're at N%" is now a curated figure; outsiders should read the policy, not just the badge.
  • PRs #290 (cover the egress adapter), and the coverage-policy PR that introduces this record.
  • .coveragerc, scripts/coverage.sh, scripts/diff_coverage.py.