247 lines
13 KiB
Markdown
247 lines
13 KiB
Markdown
# Git secret scanning as further hardening
|
|
|
|
Research into whether bot-bottle should add a secret-scanning step to
|
|
its git workflow — both on the host repo and (potentially) inside
|
|
bottles — and what tools exist for it. Motivated by the threat model
|
|
below: a secret accidentally `git push`ed to a public remote is
|
|
*already compromised* the moment the push hits the wire, well before
|
|
anyone clicks "merge."
|
|
|
|
## Summary
|
|
|
|
A pre-commit / pre-push secret scanner is a cheap, high-leverage layer
|
|
of defense-in-depth that doesn't replace any existing control
|
|
(`.gitignore`, environment-variable hygiene, network egress guards) but
|
|
catches the one case where everything else fails: a credential ending
|
|
up in a tracked file or commit message and being pushed to a public
|
|
remote. For bot-bottle specifically, `gitleaks` is the clearest fit
|
|
— Go binary, MIT, scans full history including commit messages, runs
|
|
fully offline, and integrates with the existing `.githooks/` directory
|
|
without adding a new runtime.
|
|
|
|
## Attack vector: a secret pushed to a public GitHub repo
|
|
|
|
The naive mental model is "if I notice the leak before the PR is
|
|
merged, I can force-push and the secret is gone." That mental model is
|
|
wrong in two distinct ways, and both apply *the instant* `git push`
|
|
completes — not at merge time.
|
|
|
|
### 1. The push itself publishes the secret
|
|
|
|
The moment a commit lands on a public GitHub repo — on any branch, in
|
|
any open PR, on any fork — it is publicly fetchable by URL and
|
|
broadcast by GitHub's event firehose:
|
|
|
|
- The commit blob is reachable at `github.com/<owner>/<repo>/commit/<sha>`
|
|
and via the raw API. No login or merge required.
|
|
- GitHub's public events API streams every push as it happens. The
|
|
GH Archive project mirrors that stream and republishes it; multiple
|
|
Common-Crawl-style datasets ingest from it on a continuous basis.
|
|
- Independent scrapers (and there are *many*) sit on the events API
|
|
watching specifically for commits whose diffs match high-value
|
|
credential patterns — AWS keys, GitHub PATs, Stripe keys, OAuth
|
|
tokens, OpenAI keys, etc. Empirically, observed time-to-abuse for a
|
|
leaked AWS key on a public push is on the order of *seconds to
|
|
minutes*. The window in which "I'll just force-push" still works
|
|
effectively does not exist.
|
|
- Even after a force-push, the orphaned commit remains reachable by
|
|
SHA until GitHub's garbage collection runs (and the SHA itself
|
|
leaks via the events API). Treat any secret that has touched a
|
|
public remote as burned: rotate it.
|
|
|
|
This is the dominant risk, and on its own it is enough to justify
|
|
pre-push secret scanning.
|
|
|
|
### 2. Render-time callbacks fire even on un-merged PRs
|
|
|
|
A separate, sneakier vector: a public PR doesn't have to be *merged*
|
|
to cause an outbound request to a server the pusher controls. Opening
|
|
the PR is enough. Rendering surfaces along the review path will fetch
|
|
remote resources embedded in the diff or in PR metadata:
|
|
|
|
- **Markdown files in the diff** are rendered with image and link
|
|
preview support. An ``
|
|
in a new README pings the attacker the moment a reviewer (or
|
|
GitHub's own renderer warming a cache) views the file. GitHub
|
|
proxies most image fetches through `camo.githubusercontent.com`,
|
|
which mitigates direct IP capture but *does not stop the fetch
|
|
itself* — the attacker still gets the request, and the URL path
|
|
can carry exfiltrated bytes.
|
|
- **PR body, commit message, and issue body** all render markdown
|
|
with the same image/link semantics. A poisoned commit message
|
|
with an embedded image URL pings home when the PR is opened in the
|
|
web UI, regardless of merge state.
|
|
- **Link unfurls** in PR descriptions, Slack notifications wired to
|
|
the repo, and various CI bots all dereference URLs at preview time.
|
|
Each is an independent "ping home" opportunity.
|
|
- **CI on PR open** runs on the fork's HEAD by default for many
|
|
workflow types. Any `curl` in a build step is an exfil channel
|
|
before a human ever looks at the PR.
|
|
|
|
The composite point: by the time a reviewer thinks "hmm, this looks
|
|
suspicious, let me close without merging," the bytes that mattered are
|
|
already on the attacker's box. Detection has to be at *commit* time
|
|
(or *push* time at the latest), not at review time.
|
|
|
|
### Why this matters for bot-bottle
|
|
|
|
Two surfaces are exposed:
|
|
|
|
1. **The bot-bottle repo itself.** Development happens on a host
|
|
with `BOT_BOTTLE_CLAUDE_OAUTH_TOKEN`, Gitea tokens, and other
|
|
credentials in the environment. A fixture, test snapshot, log
|
|
capture, or pasted-in debug output could carry one of them into a
|
|
tracked file. The repo's Gitea remote is private, but mirrors or
|
|
GitHub forks may not be.
|
|
2. **Bottles that push to external remotes.** An agent inside a bottle
|
|
that has been granted git push credentials to a public repo is one
|
|
prompt-injection away from being induced to commit a captured token
|
|
and push it. The egress guard catches *most* exfiltration channels
|
|
(`statsig`, `npm`, arbitrary HTTPS), but `git push` to an
|
|
allowlisted code host is by design an outbound channel that *works*.
|
|
That is the whole point of letting the agent push code. A secret
|
|
scanner inside the bottle's git hooks is the only layer that can
|
|
intercept content before it leaves on an otherwise-legitimate
|
|
channel.
|
|
|
|
The first surface is the primary one for this note. The second
|
|
*looks* harder — anything inside the bottle is by assumption
|
|
adversary-controlled, so an in-bottle hook can be bypassed with
|
|
`--no-verify`, removed outright, or simply not configured — but it
|
|
yields to the same trick already used for the OAuth token and HTTPS
|
|
inspection: move the check across a kernel-enforced process boundary
|
|
that the agent cannot reach.
|
|
|
|
### Solution for the bottle surface: a git gate
|
|
|
|
Route bottle-side `git push` through an intermediary git endpoint —
|
|
a "git gate" — that runs gitleaks on the incoming objects before
|
|
forwarding to the real remote:
|
|
|
|
- The bottle's git remote is rewritten to point at the gate (e.g.
|
|
`git@gate.bottle.local:owner/repo.git`, or an HTTPS endpoint on a
|
|
loopback / sidecar address). The agent inside the bottle sees only
|
|
the gate URL.
|
|
- The gate runs as a separate process / container that the bottle's
|
|
unprivileged agent cannot read or modify. It holds the real push
|
|
credential (deploy key, PAT, ssh agent socket) — the bottle never
|
|
sees it, same as the auth-injecting proxy for `ANTHROPIC_BASE_URL`
|
|
in `agent-credential-proxy-landscape.md`.
|
|
- On receive, the gate runs `gitleaks detect` against the incoming
|
|
refs (and their message text) in a temporary working tree. Clean
|
|
pushes are forwarded to the real remote. Findings cause the push to
|
|
be rejected; the agent sees a normal-looking `remote: rejected`
|
|
message.
|
|
- Network egress out of the bottle is locked to the gate only.
|
|
Direct pushes to `github.com`, `gitlab.com`, etc., fail at the
|
|
egress layer, so the agent cannot bypass the gate by reconfiguring
|
|
the remote URL. The egress guard already in `network-egress-guard.md`
|
|
is the load-bearing control here.
|
|
|
|
Properties this gives:
|
|
|
|
- **`--no-verify` does not help the attacker** — the check runs on the
|
|
*receiving* side, not in a client hook.
|
|
- **Bypass attempts fail closed.** Removing the remote, force-pushing,
|
|
pushing to a different host, or shelling out around git all hit the
|
|
same egress allowlist.
|
|
- **The push credential never enters the bottle.** Compromising the
|
|
agent doesn't compromise the credential, only the ability to ask the
|
|
gate to push on its behalf — and the gate refuses dirty pushes.
|
|
- **Pattern is reusable.** The gate is shape-compatible with the
|
|
auth-injecting proxy and with pipelock's TLS interception: a
|
|
sidecar that holds the credential, enforces a policy on traffic,
|
|
and is unreachable by the unprivileged in-bottle UID.
|
|
|
|
Open questions deferred to design work, not blockers for this note:
|
|
|
|
- SSH vs HTTPS for the bottle→gate hop. SSH lets the agent use a
|
|
per-bottle key the gate authenticates; HTTPS over loopback is
|
|
simpler and pairs naturally with pipelock.
|
|
- Whether the gate should also enforce repo / branch allowlists per
|
|
bottle, or stay narrowly focused on secret scanning. Probably
|
|
narrow first, expand later.
|
|
- Performance on large pushes. gitleaks on a few-MB diff is sub-second;
|
|
on a large monorepo first push it may take seconds. Acceptable.
|
|
|
|
## Tool landscape (2026)
|
|
|
|
| Tool | Install | Full history | Commit msgs | Detection | Status | License |
|
|
|---|---|---|---|---|---|---|
|
|
| **gitleaks** | `brew install gitleaks` | yes (`--log-opts="--all"`) | yes | regex + entropy, ~150 rules | very active (v8.x) | MIT |
|
|
| **TruffleHog v3** | `brew`, Docker, GH Action | yes | yes | regex + **live API verification** (700+ types) | very active | AGPL-3.0 |
|
|
| **detect-secrets** (Yelp) | `pip install` | working tree + baseline | no | regex + entropy | active, slower cadence | Apache 2.0 |
|
|
| **git-secrets** (AWS Labs) | `brew` | yes (`--scan-history`) | yes | regex only, AWS-focused | maintenance mode since ~2021 | Apache 2.0 |
|
|
| **ggshield** (GitGuardian) | `pip install` | yes | yes | proprietary ML + regex, optional verified | very active, commercial | MIT CLI, SaaS engine |
|
|
|
|
Other names that surfaced but do not fit: Semgrep (general SAST, not
|
|
git-history-native), `whispers` (Python, low adoption).
|
|
|
|
Notes on the contenders:
|
|
|
|
- **gitleaks** is the de-facto open-source standard. The `protect
|
|
--staged` mode is purpose-built for pre-commit; the `detect
|
|
--log-opts="--all"` mode walks every commit including message text.
|
|
No network dependency.
|
|
- **TruffleHog v3** is the strongest one-shot auditor because it can
|
|
make live API calls to confirm whether a found credential is still
|
|
valid. AGPL is fine for internal use but a consideration if the
|
|
detection step gets embedded in a redistributable artifact. Best as
|
|
a "run once on history" tool, not as the everyday hook.
|
|
- **detect-secrets** is well-loved for its baseline-file workflow (you
|
|
declare existing secrets as known and only fail on new ones), but
|
|
commit-message scanning is not first-class.
|
|
- **git-secrets** is effectively superseded by gitleaks. No reason to
|
|
start a new project on it.
|
|
- **ggshield** is good if you want a hosted dashboard and incident
|
|
management, but it ships repo content to GitGuardian's API by
|
|
default — wrong fit for a project whose entire premise is sandbox
|
|
isolation.
|
|
|
|
## Commit message scanning
|
|
|
|
Several real-world leaks have come from secrets pasted into commit
|
|
messages (often by automation that includes captured CLI output in
|
|
auto-generated messages). gitleaks, TruffleHog, git-secrets, and
|
|
ggshield all scan message text in history mode. detect-secrets does
|
|
not — it is content-focused. Anyone using detect-secrets should pair
|
|
it with a separate message-scanning step.
|
|
|
|
## Recommended path forward
|
|
|
|
In priority order, for the host bot-bottle repo:
|
|
|
|
1. **One-time retro scan** with gitleaks:
|
|
`gitleaks detect --source . --log-opts="--all" --redact`.
|
|
Catches anything currently in history including commit messages.
|
|
`--redact` keeps any findings out of the rendered output. Treat any
|
|
hit as a rotated credential, not as a "fix the commit" problem (see
|
|
attack-vector section above for why).
|
|
2. **Add a pre-commit hook in `.githooks/`** that runs
|
|
`gitleaks protect --staged`. Fits the project's existing pattern
|
|
(the conventional-commits `commit-msg` hook lives there), needs no
|
|
new runtime, fully offline.
|
|
3. **Optional one-time live-verification pass** with TruffleHog:
|
|
`trufflehog git file://. --only-verified`. Tells you whether any
|
|
historically leaked credential is still active — actionable in a
|
|
way pattern-matching alone is not.
|
|
|
|
For the bottle surface, the follow-up is the git-gate design sketched
|
|
above: a sidecar git endpoint that bottles push to, that runs
|
|
gitleaks on incoming refs before forwarding to the real remote, with
|
|
egress locked to the gate so the agent cannot push directly. Tracked
|
|
as a future PRD; ordering it after the network egress guard is
|
|
correct because the egress allowlist is the control that prevents
|
|
bypass and is therefore the prerequisite.
|
|
|
|
## Sources
|
|
|
|
- [gitleaks — GitHub repository](https://github.com/gitleaks/gitleaks)
|
|
- [TruffleHog — GitHub repository](https://github.com/trufflesecurity/trufflehog)
|
|
- [detect-secrets — GitHub repository](https://github.com/Yelp/detect-secrets)
|
|
- [git-secrets — AWS Labs GitHub repository](https://github.com/awslabs/git-secrets)
|
|
- [GitGuardian ggshield](https://github.com/GitGuardian/ggshield)
|
|
- [GH Archive — public GitHub event firehose mirror](https://www.gharchive.org/)
|
|
- [GitHub camo image proxy documentation](https://github.com/atmos/camo)
|
|
- [Truffle Security — "How fast are leaked secrets abused?"](https://trufflesecurity.com/blog/)
|