Files
bot-bottle/docs/decisions/0004-coverage-policy.md
T
didericis d89d389bef ci(coverage): add auto-updated "core coverage" badge
Surface the metric ADR 0004 says matters — the critical security/logic
core, currently 95% — as a README badge, distinct from the
informational global `coverage` badge.

- scripts/critical-modules.txt: single source of truth for the core
  module list. scripts/coverage.sh now reads it (instead of a hardcoded
  string) and update-badges.yml reads the same file, so the badge and
  the `critical` report cannot drift.
- update-badges.yml: a `core coverage` step reuses the unit-coverage
  data (every core module is unit-tested, so unit-only is accurate for
  it) and sed-updates the new badge, like the existing ones.
- README: `core coverage 95%` badge linking to ADR 0004 so a reader can
  find out what "core" means.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NkwFXLFff9PYPy4wgVBJp9
2026-06-26 02:19:35 -04:00

4.4 KiB

ADR 0004: Risk-weighted coverage, not a single global target

  • Status: Accepted
  • Date: 2026-06-25
  • Deciders: didericis

Context

bot-bottle is a security tool: it sandboxes agents, scans egress for secret exfiltration, strips credentials, and gates git pushes. A latent bug in that logic is expensive, so test coverage there genuinely matters. But the repo also contains code where coverage is a poor signal:

  • Interactive entry-point shellscli/init.py (a read_tty_line() prompt loop) and cli/tui.py (a curses picker). Their bodies are I/O; a unit test has to fake the entire terminal conversation, so it inflates the number without asserting behaviour that would otherwise go unchecked.
  • Subprocess / backend orchestration — the docker / smolmachines / macos-container backends shell out to docker, container, smolvm. Mock-heavy unit tests here mostly re-assert the argv you already wrote (the test passes whether or not the real teardown works), while many of the missed branches are failure paths you cannot provoke against a real daemon on cue.

Chasing a single global percentage (e.g. 90%) pushes the most test effort onto the least safety-relevant code — exactly backwards — and invites performative tests written to colour a line rather than to catch a regression (Goodhart's law).

Decision

Coverage is risk-weighted, measured over the combined unit + integration suites, with three rules:

  1. Critical modules target ≥ 90%. The security/logic core — egress_addon{,_core}.py, dlp_detectors.py, egress.py, manifest*.py, git_gate.py, git_http_backend.py, supervise.py, yaml_subset.py, bottle_state.py — is Docker-independent and unit-testable, so it carries the high bar. We ratchet toward 90% as these modules are touched; new gaps in them are not acceptable.

  2. Subprocess/backend orchestration is covered by the integration suite, not omitted. scripts/coverage.sh runs unit + integration under one coverage measurement so these modules are scored where they are actually exercised. They stay visible — hiding the code that tears down sandboxes and wires networks is the one place we will not omit.

  3. Interactive entry-point shells are omitted (.coveragerc), with a rationale comment. This is the only sanctioned use of omit besides tests/*.

The forward-looking guard is a diff-coverage gate (scripts/diff_coverage.py): new/changed executable lines on a branch must be ≥ 90% covered. This catches regressions where they are introduced without forcing a back-fill crusade through legacy glue. The gate skips lines in omitted files (there is no coverage data for them), so the omit list cannot launder new logic into the dark: anything that needs real testing must live outside the interactive shells to be scored at all.

The global percentage is informational, not a CI gate — it would otherwise be hostage to the CI runner's Docker availability and to the omit list.

Consequences

  • The number we report (scripts/coverage.sh) means "coverage of the code we consider testable, across both suites" — a dip is a real regression in code we control, not noise from added CLI glue.
  • No incentive to write mock-the-mock tests for orchestration to defend a global figure.
  • The omit list needs governance: an entry must be a genuinely interactive shell, justified in the .coveragerc comment and here. cli/init.py and cli/tui.py qualify; backend orchestration does not.
  • CI must run the integration suite under coverage to score the orchestration modules; where the runner lacks Docker those tests skip and their modules read low — accepted, because the enforced gates (critical-module standard + diff coverage) are Docker-independent.
  • "We're at N%" is now a curated figure; outsiders should read the policy, not just the badge.
  • PRs #290 (cover the egress adapter), and the coverage-policy PR that introduces this record.
  • .coveragerc, scripts/coverage.sh, scripts/diff_coverage.py.
  • scripts/critical-modules.txt — the single source of truth for the core-module list; read by both scripts/coverage.sh and the update-badges.yml "core coverage" badge so they cannot drift.
  • The README carries a core coverage badge (auto-updated from that list) — the headline number, distinct from the informational global coverage badge.