Files

T

didericis d89d389bef ci(coverage): add auto-updated "core coverage" badge

Surface the metric ADR 0004 says matters — the critical security/logic
core, currently 95% — as a README badge, distinct from the
informational global `coverage` badge.

- scripts/critical-modules.txt: single source of truth for the core
  module list. scripts/coverage.sh now reads it (instead of a hardcoded
  string) and update-badges.yml reads the same file, so the badge and
  the `critical` report cannot drift.
- update-badges.yml: a `core coverage` step reuses the unit-coverage
  data (every core module is unit-tested, so unit-only is accurate for
  it) and sed-updates the new badge, like the existing ones.
- README: `core coverage 95%` badge linking to ADR 0004 so a reader can
  find out what "core" means.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NkwFXLFff9PYPy4wgVBJp9

2026-06-26 02:19:35 -04:00

4.4 KiB

Raw Blame History

ADR 0004: Risk-weighted coverage, not a single global target

Status: Accepted
Date: 2026-06-25
Deciders: didericis

Context

bot-bottle is a security tool: it sandboxes agents, scans egress for secret exfiltration, strips credentials, and gates git pushes. A latent bug in that logic is expensive, so test coverage there genuinely matters. But the repo also contains code where coverage is a poor signal:

Interactive entry-point shells — cli/init.py (a read_tty_line() prompt loop) and cli/tui.py (a curses picker). Their bodies are I/O; a unit test has to fake the entire terminal conversation, so it inflates the number without asserting behaviour that would otherwise go unchecked.
Subprocess / backend orchestration — the docker / smolmachines / macos-container backends shell out to docker, container, smolvm. Mock-heavy unit tests here mostly re-assert the argv you already wrote (the test passes whether or not the real teardown works), while many of the missed branches are failure paths you cannot provoke against a real daemon on cue.

Chasing a single global percentage (e.g. 90%) pushes the most test effort onto the least safety-relevant code — exactly backwards — and invites performative tests written to colour a line rather than to catch a regression (Goodhart's law).

Decision

Coverage is risk-weighted, measured over the combined unit + integration suites, with three rules:

Critical modules target ≥ 90%. The security/logic core — egress_addon{,_core}.py, dlp_detectors.py, egress.py, manifest*.py, git_gate.py, git_http_backend.py, supervise.py, yaml_subset.py, bottle_state.py — is Docker-independent and unit-testable, so it carries the high bar. We ratchet toward 90% as these modules are touched; new gaps in them are not acceptable.
Subprocess/backend orchestration is covered by the integration suite, not omitted. scripts/coverage.sh runs unit + integration under one coverage measurement so these modules are scored where they are actually exercised. They stay visible — hiding the code that tears down sandboxes and wires networks is the one place we will not omit.
Interactive entry-point shells are omitted (.coveragerc), with a rationale comment. This is the only sanctioned use of omit besides tests/*.

The forward-looking guard is a diff-coverage gate (scripts/diff_coverage.py): new/changed executable lines on a branch must be ≥ 90% covered. This catches regressions where they are introduced without forcing a back-fill crusade through legacy glue. The gate skips lines in omitted files (there is no coverage data for them), so the omit list cannot launder new logic into the dark: anything that needs real testing must live outside the interactive shells to be scored at all.

The global percentage is informational, not a CI gate — it would otherwise be hostage to the CI runner's Docker availability and to the omit list.

Consequences

The number we report (scripts/coverage.sh) means "coverage of the code we consider testable, across both suites" — a dip is a real regression in code we control, not noise from added CLI glue.
No incentive to write mock-the-mock tests for orchestration to defend a global figure.
The omit list needs governance: an entry must be a genuinely interactive shell, justified in the .coveragerc comment and here. cli/init.py and cli/tui.py qualify; backend orchestration does not.
CI must run the integration suite under coverage to score the orchestration modules; where the runner lacks Docker those tests skip and their modules read low — accepted, because the enforced gates (critical-module standard + diff coverage) are Docker-independent.
"We're at N%" is now a curated figure; outsiders should read the policy, not just the badge.

4.4 KiB Raw Blame History

ADR 0004: Risk-weighted coverage, not a single global target

Context

Decision

Consequences

Links

4.4 KiB

Raw Blame History