docs(research): add note on git secret-scanning as defense-in-depth
Threat-models the case where a credential ends up in a tracked file and is git-pushed to a public remote — the secret is compromised the instant the push lands (events API, scrapers), not at merge time. Recommends gitleaks as the smallest-blast- radius layer to add: Go binary, MIT, offline, scans full history, hookable from the existing .githooks/. No code or workflow change; just the research note.
This commit is contained in:
@@ -0,0 +1,246 @@
|
||||
# Git secret scanning as further hardening
|
||||
|
||||
Research into whether claude-bottle should add a secret-scanning step to
|
||||
its git workflow — both on the host repo and (potentially) inside
|
||||
bottles — and what tools exist for it. Motivated by the threat model
|
||||
below: a secret accidentally `git push`ed to a public remote is
|
||||
*already compromised* the moment the push hits the wire, well before
|
||||
anyone clicks "merge."
|
||||
|
||||
## Summary
|
||||
|
||||
A pre-commit / pre-push secret scanner is a cheap, high-leverage layer
|
||||
of defense-in-depth that doesn't replace any existing control
|
||||
(`.gitignore`, environment-variable hygiene, network egress guards) but
|
||||
catches the one case where everything else fails: a credential ending
|
||||
up in a tracked file or commit message and being pushed to a public
|
||||
remote. For claude-bottle specifically, `gitleaks` is the clearest fit
|
||||
— Go binary, MIT, scans full history including commit messages, runs
|
||||
fully offline, and integrates with the existing `.githooks/` directory
|
||||
without adding a new runtime.
|
||||
|
||||
## Attack vector: a secret pushed to a public GitHub repo
|
||||
|
||||
The naive mental model is "if I notice the leak before the PR is
|
||||
merged, I can force-push and the secret is gone." That mental model is
|
||||
wrong in two distinct ways, and both apply *the instant* `git push`
|
||||
completes — not at merge time.
|
||||
|
||||
### 1. The push itself publishes the secret
|
||||
|
||||
The moment a commit lands on a public GitHub repo — on any branch, in
|
||||
any open PR, on any fork — it is publicly fetchable by URL and
|
||||
broadcast by GitHub's event firehose:
|
||||
|
||||
- The commit blob is reachable at `github.com/<owner>/<repo>/commit/<sha>`
|
||||
and via the raw API. No login or merge required.
|
||||
- GitHub's public events API streams every push as it happens. The
|
||||
GH Archive project mirrors that stream and republishes it; multiple
|
||||
Common-Crawl-style datasets ingest from it on a continuous basis.
|
||||
- Independent scrapers (and there are *many*) sit on the events API
|
||||
watching specifically for commits whose diffs match high-value
|
||||
credential patterns — AWS keys, GitHub PATs, Stripe keys, OAuth
|
||||
tokens, OpenAI keys, etc. Empirically, observed time-to-abuse for a
|
||||
leaked AWS key on a public push is on the order of *seconds to
|
||||
minutes*. The window in which "I'll just force-push" still works
|
||||
effectively does not exist.
|
||||
- Even after a force-push, the orphaned commit remains reachable by
|
||||
SHA until GitHub's garbage collection runs (and the SHA itself
|
||||
leaks via the events API). Treat any secret that has touched a
|
||||
public remote as burned: rotate it.
|
||||
|
||||
This is the dominant risk, and on its own it is enough to justify
|
||||
pre-push secret scanning.
|
||||
|
||||
### 2. Render-time callbacks fire even on un-merged PRs
|
||||
|
||||
A separate, sneakier vector: a public PR doesn't have to be *merged*
|
||||
to cause an outbound request to a server the pusher controls. Opening
|
||||
the PR is enough. Rendering surfaces along the review path will fetch
|
||||
remote resources embedded in the diff or in PR metadata:
|
||||
|
||||
- **Markdown files in the diff** are rendered with image and link
|
||||
preview support. An ``
|
||||
in a new README pings the attacker the moment a reviewer (or
|
||||
GitHub's own renderer warming a cache) views the file. GitHub
|
||||
proxies most image fetches through `camo.githubusercontent.com`,
|
||||
which mitigates direct IP capture but *does not stop the fetch
|
||||
itself* — the attacker still gets the request, and the URL path
|
||||
can carry exfiltrated bytes.
|
||||
- **PR body, commit message, and issue body** all render markdown
|
||||
with the same image/link semantics. A poisoned commit message
|
||||
with an embedded image URL pings home when the PR is opened in the
|
||||
web UI, regardless of merge state.
|
||||
- **Link unfurls** in PR descriptions, Slack notifications wired to
|
||||
the repo, and various CI bots all dereference URLs at preview time.
|
||||
Each is an independent "ping home" opportunity.
|
||||
- **CI on PR open** runs on the fork's HEAD by default for many
|
||||
workflow types. Any `curl` in a build step is an exfil channel
|
||||
before a human ever looks at the PR.
|
||||
|
||||
The composite point: by the time a reviewer thinks "hmm, this looks
|
||||
suspicious, let me close without merging," the bytes that mattered are
|
||||
already on the attacker's box. Detection has to be at *commit* time
|
||||
(or *push* time at the latest), not at review time.
|
||||
|
||||
### Why this matters for claude-bottle
|
||||
|
||||
Two surfaces are exposed:
|
||||
|
||||
1. **The claude-bottle repo itself.** Development happens on a host
|
||||
with `CLAUDE_BOTTLE_OAUTH_TOKEN`, Gitea tokens, and other
|
||||
credentials in the environment. A fixture, test snapshot, log
|
||||
capture, or pasted-in debug output could carry one of them into a
|
||||
tracked file. The repo's Gitea remote is private, but mirrors or
|
||||
GitHub forks may not be.
|
||||
2. **Bottles that push to external remotes.** An agent inside a bottle
|
||||
that has been granted git push credentials to a public repo is one
|
||||
prompt-injection away from being induced to commit a captured token
|
||||
and push it. The egress guard catches *most* exfiltration channels
|
||||
(`statsig`, `npm`, arbitrary HTTPS), but `git push` to an
|
||||
allowlisted code host is by design an outbound channel that *works*.
|
||||
That is the whole point of letting the agent push code. A secret
|
||||
scanner inside the bottle's git hooks is the only layer that can
|
||||
intercept content before it leaves on an otherwise-legitimate
|
||||
channel.
|
||||
|
||||
The first surface is the primary one for this note. The second
|
||||
*looks* harder — anything inside the bottle is by assumption
|
||||
adversary-controlled, so an in-bottle hook can be bypassed with
|
||||
`--no-verify`, removed outright, or simply not configured — but it
|
||||
yields to the same trick already used for the OAuth token and HTTPS
|
||||
inspection: move the check across a kernel-enforced process boundary
|
||||
that the agent cannot reach.
|
||||
|
||||
### Solution for the bottle surface: a git gate
|
||||
|
||||
Route bottle-side `git push` through an intermediary git endpoint —
|
||||
a "git gate" — that runs gitleaks on the incoming objects before
|
||||
forwarding to the real remote:
|
||||
|
||||
- The bottle's git remote is rewritten to point at the gate (e.g.
|
||||
`git@gate.bottle.local:owner/repo.git`, or an HTTPS endpoint on a
|
||||
loopback / sidecar address). The agent inside the bottle sees only
|
||||
the gate URL.
|
||||
- The gate runs as a separate process / container that the bottle's
|
||||
unprivileged agent cannot read or modify. It holds the real push
|
||||
credential (deploy key, PAT, ssh agent socket) — the bottle never
|
||||
sees it, same as the auth-injecting proxy for `ANTHROPIC_BASE_URL`
|
||||
in `oauth-token-exposure-to-claude.md`.
|
||||
- On receive, the gate runs `gitleaks detect` against the incoming
|
||||
refs (and their message text) in a temporary working tree. Clean
|
||||
pushes are forwarded to the real remote. Findings cause the push to
|
||||
be rejected; the agent sees a normal-looking `remote: rejected`
|
||||
message.
|
||||
- Network egress out of the bottle is locked to the gate only.
|
||||
Direct pushes to `github.com`, `gitlab.com`, etc., fail at the
|
||||
egress layer, so the agent cannot bypass the gate by reconfiguring
|
||||
the remote URL. The egress guard already in `network-egress-guard.md`
|
||||
is the load-bearing control here.
|
||||
|
||||
Properties this gives:
|
||||
|
||||
- **`--no-verify` does not help the attacker** — the check runs on the
|
||||
*receiving* side, not in a client hook.
|
||||
- **Bypass attempts fail closed.** Removing the remote, force-pushing,
|
||||
pushing to a different host, or shelling out around git all hit the
|
||||
same egress allowlist.
|
||||
- **The push credential never enters the bottle.** Compromising the
|
||||
agent doesn't compromise the credential, only the ability to ask the
|
||||
gate to push on its behalf — and the gate refuses dirty pushes.
|
||||
- **Pattern is reusable.** The gate is shape-compatible with the
|
||||
auth-injecting proxy and with pipelock's TLS interception: a
|
||||
sidecar that holds the credential, enforces a policy on traffic,
|
||||
and is unreachable by the unprivileged in-bottle UID.
|
||||
|
||||
Open questions deferred to design work, not blockers for this note:
|
||||
|
||||
- SSH vs HTTPS for the bottle→gate hop. SSH lets the agent use a
|
||||
per-bottle key the gate authenticates; HTTPS over loopback is
|
||||
simpler and pairs naturally with pipelock.
|
||||
- Whether the gate should also enforce repo / branch allowlists per
|
||||
bottle, or stay narrowly focused on secret scanning. Probably
|
||||
narrow first, expand later.
|
||||
- Performance on large pushes. gitleaks on a few-MB diff is sub-second;
|
||||
on a large monorepo first push it may take seconds. Acceptable.
|
||||
|
||||
## Tool landscape (2026)
|
||||
|
||||
| Tool | Install | Full history | Commit msgs | Detection | Status | License |
|
||||
|---|---|---|---|---|---|---|
|
||||
| **gitleaks** | `brew install gitleaks` | yes (`--log-opts="--all"`) | yes | regex + entropy, ~150 rules | very active (v8.x) | MIT |
|
||||
| **TruffleHog v3** | `brew`, Docker, GH Action | yes | yes | regex + **live API verification** (700+ types) | very active | AGPL-3.0 |
|
||||
| **detect-secrets** (Yelp) | `pip install` | working tree + baseline | no | regex + entropy | active, slower cadence | Apache 2.0 |
|
||||
| **git-secrets** (AWS Labs) | `brew` | yes (`--scan-history`) | yes | regex only, AWS-focused | maintenance mode since ~2021 | Apache 2.0 |
|
||||
| **ggshield** (GitGuardian) | `pip install` | yes | yes | proprietary ML + regex, optional verified | very active, commercial | MIT CLI, SaaS engine |
|
||||
|
||||
Other names that surfaced but do not fit: Semgrep (general SAST, not
|
||||
git-history-native), `whispers` (Python, low adoption).
|
||||
|
||||
Notes on the contenders:
|
||||
|
||||
- **gitleaks** is the de-facto open-source standard. The `protect
|
||||
--staged` mode is purpose-built for pre-commit; the `detect
|
||||
--log-opts="--all"` mode walks every commit including message text.
|
||||
No network dependency.
|
||||
- **TruffleHog v3** is the strongest one-shot auditor because it can
|
||||
make live API calls to confirm whether a found credential is still
|
||||
valid. AGPL is fine for internal use but a consideration if the
|
||||
detection step gets embedded in a redistributable artifact. Best as
|
||||
a "run once on history" tool, not as the everyday hook.
|
||||
- **detect-secrets** is well-loved for its baseline-file workflow (you
|
||||
declare existing secrets as known and only fail on new ones), but
|
||||
commit-message scanning is not first-class.
|
||||
- **git-secrets** is effectively superseded by gitleaks. No reason to
|
||||
start a new project on it.
|
||||
- **ggshield** is good if you want a hosted dashboard and incident
|
||||
management, but it ships repo content to GitGuardian's API by
|
||||
default — wrong fit for a project whose entire premise is sandbox
|
||||
isolation.
|
||||
|
||||
## Commit message scanning
|
||||
|
||||
Several real-world leaks have come from secrets pasted into commit
|
||||
messages (often by automation that includes captured CLI output in
|
||||
auto-generated messages). gitleaks, TruffleHog, git-secrets, and
|
||||
ggshield all scan message text in history mode. detect-secrets does
|
||||
not — it is content-focused. Anyone using detect-secrets should pair
|
||||
it with a separate message-scanning step.
|
||||
|
||||
## Recommended path forward
|
||||
|
||||
In priority order, for the host claude-bottle repo:
|
||||
|
||||
1. **One-time retro scan** with gitleaks:
|
||||
`gitleaks detect --source . --log-opts="--all" --redact`.
|
||||
Catches anything currently in history including commit messages.
|
||||
`--redact` keeps any findings out of the rendered output. Treat any
|
||||
hit as a rotated credential, not as a "fix the commit" problem (see
|
||||
attack-vector section above for why).
|
||||
2. **Add a pre-commit hook in `.githooks/`** that runs
|
||||
`gitleaks protect --staged`. Fits the project's existing pattern
|
||||
(the conventional-commits `commit-msg` hook lives there), needs no
|
||||
new runtime, fully offline.
|
||||
3. **Optional one-time live-verification pass** with TruffleHog:
|
||||
`trufflehog git file://. --only-verified`. Tells you whether any
|
||||
historically leaked credential is still active — actionable in a
|
||||
way pattern-matching alone is not.
|
||||
|
||||
For the bottle surface, the follow-up is the git-gate design sketched
|
||||
above: a sidecar git endpoint that bottles push to, that runs
|
||||
gitleaks on incoming refs before forwarding to the real remote, with
|
||||
egress locked to the gate so the agent cannot push directly. Tracked
|
||||
as a future PRD; ordering it after the network egress guard is
|
||||
correct because the egress allowlist is the control that prevents
|
||||
bypass and is therefore the prerequisite.
|
||||
|
||||
## Sources
|
||||
|
||||
- [gitleaks — GitHub repository](https://github.com/gitleaks/gitleaks)
|
||||
- [TruffleHog — GitHub repository](https://github.com/trufflesecurity/trufflehog)
|
||||
- [detect-secrets — GitHub repository](https://github.com/Yelp/detect-secrets)
|
||||
- [git-secrets — AWS Labs GitHub repository](https://github.com/awslabs/git-secrets)
|
||||
- [GitGuardian ggshield](https://github.com/GitGuardian/ggshield)
|
||||
- [GH Archive — public GitHub event firehose mirror](https://www.gharchive.org/)
|
||||
- [GitHub camo image proxy documentation](https://github.com/atmos/camo)
|
||||
- [Truffle Security — "How fast are leaked secrets abused?"](https://trufflesecurity.com/blog/)
|
||||
Reference in New Issue
Block a user