Assisted-by: Codex
13 KiB
Git secret scanning as further hardening
Research into whether bot-bottle should add a secret-scanning step to
its git workflow — both on the host repo and (potentially) inside
bottles — and what tools exist for it. Motivated by the threat model
below: a secret accidentally git pushed to a public remote is
already compromised the moment the push hits the wire, well before
anyone clicks "merge."
Summary
A pre-commit / pre-push secret scanner is a cheap, high-leverage layer
of defense-in-depth that doesn't replace any existing control
(.gitignore, environment-variable hygiene, network egress guards) but
catches the one case where everything else fails: a credential ending
up in a tracked file or commit message and being pushed to a public
remote. For bot-bottle specifically, gitleaks is the clearest fit
— Go binary, MIT, scans full history including commit messages, runs
fully offline, and integrates with the existing .githooks/ directory
without adding a new runtime.
Attack vector: a secret pushed to a public GitHub repo
The naive mental model is "if I notice the leak before the PR is
merged, I can force-push and the secret is gone." That mental model is
wrong in two distinct ways, and both apply the instant git push
completes — not at merge time.
1. The push itself publishes the secret
The moment a commit lands on a public GitHub repo — on any branch, in any open PR, on any fork — it is publicly fetchable by URL and broadcast by GitHub's event firehose:
- The commit blob is reachable at
github.com/<owner>/<repo>/commit/<sha>and via the raw API. No login or merge required. - GitHub's public events API streams every push as it happens. The GH Archive project mirrors that stream and republishes it; multiple Common-Crawl-style datasets ingest from it on a continuous basis.
- Independent scrapers (and there are many) sit on the events API watching specifically for commits whose diffs match high-value credential patterns — AWS keys, GitHub PATs, Stripe keys, OAuth tokens, OpenAI keys, etc. Empirically, observed time-to-abuse for a leaked AWS key on a public push is on the order of seconds to minutes. The window in which "I'll just force-push" still works effectively does not exist.
- Even after a force-push, the orphaned commit remains reachable by SHA until GitHub's garbage collection runs (and the SHA itself leaks via the events API). Treat any secret that has touched a public remote as burned: rotate it.
This is the dominant risk, and on its own it is enough to justify pre-push secret scanning.
2. Render-time callbacks fire even on un-merged PRs
A separate, sneakier vector: a public PR doesn't have to be merged to cause an outbound request to a server the pusher controls. Opening the PR is enough. Rendering surfaces along the review path will fetch remote resources embedded in the diff or in PR metadata:
- Markdown files in the diff are rendered with image and link
preview support. An
in a new README pings the attacker the moment a reviewer (or GitHub's own renderer warming a cache) views the file. GitHub proxies most image fetches throughcamo.githubusercontent.com, which mitigates direct IP capture but does not stop the fetch itself — the attacker still gets the request, and the URL path can carry exfiltrated bytes. - PR body, commit message, and issue body all render markdown with the same image/link semantics. A poisoned commit message with an embedded image URL pings home when the PR is opened in the web UI, regardless of merge state.
- Link unfurls in PR descriptions, Slack notifications wired to the repo, and various CI bots all dereference URLs at preview time. Each is an independent "ping home" opportunity.
- CI on PR open runs on the fork's HEAD by default for many
workflow types. Any
curlin a build step is an exfil channel before a human ever looks at the PR.
The composite point: by the time a reviewer thinks "hmm, this looks suspicious, let me close without merging," the bytes that mattered are already on the attacker's box. Detection has to be at commit time (or push time at the latest), not at review time.
Why this matters for bot-bottle
Two surfaces are exposed:
- The bot-bottle repo itself. Development happens on a host
with
BOT_BOTTLE_OAUTH_TOKEN, Gitea tokens, and other credentials in the environment. A fixture, test snapshot, log capture, or pasted-in debug output could carry one of them into a tracked file. The repo's Gitea remote is private, but mirrors or GitHub forks may not be. - Bottles that push to external remotes. An agent inside a bottle
that has been granted git push credentials to a public repo is one
prompt-injection away from being induced to commit a captured token
and push it. The egress guard catches most exfiltration channels
(
statsig,npm, arbitrary HTTPS), butgit pushto an allowlisted code host is by design an outbound channel that works. That is the whole point of letting the agent push code. A secret scanner inside the bottle's git hooks is the only layer that can intercept content before it leaves on an otherwise-legitimate channel.
The first surface is the primary one for this note. The second
looks harder — anything inside the bottle is by assumption
adversary-controlled, so an in-bottle hook can be bypassed with
--no-verify, removed outright, or simply not configured — but it
yields to the same trick already used for the OAuth token and HTTPS
inspection: move the check across a kernel-enforced process boundary
that the agent cannot reach.
Solution for the bottle surface: a git gate
Route bottle-side git push through an intermediary git endpoint —
a "git gate" — that runs gitleaks on the incoming objects before
forwarding to the real remote:
- The bottle's git remote is rewritten to point at the gate (e.g.
git@gate.bottle.local:owner/repo.git, or an HTTPS endpoint on a loopback / sidecar address). The agent inside the bottle sees only the gate URL. - The gate runs as a separate process / container that the bottle's
unprivileged agent cannot read or modify. It holds the real push
credential (deploy key, PAT, ssh agent socket) — the bottle never
sees it, same as the auth-injecting proxy for
ANTHROPIC_BASE_URLinagent-credential-proxy-landscape.md. - On receive, the gate runs
gitleaks detectagainst the incoming refs (and their message text) in a temporary working tree. Clean pushes are forwarded to the real remote. Findings cause the push to be rejected; the agent sees a normal-lookingremote: rejectedmessage. - Network egress out of the bottle is locked to the gate only.
Direct pushes to
github.com,gitlab.com, etc., fail at the egress layer, so the agent cannot bypass the gate by reconfiguring the remote URL. The egress guard already innetwork-egress-guard.mdis the load-bearing control here.
Properties this gives:
--no-verifydoes not help the attacker — the check runs on the receiving side, not in a client hook.- Bypass attempts fail closed. Removing the remote, force-pushing, pushing to a different host, or shelling out around git all hit the same egress allowlist.
- The push credential never enters the bottle. Compromising the agent doesn't compromise the credential, only the ability to ask the gate to push on its behalf — and the gate refuses dirty pushes.
- Pattern is reusable. The gate is shape-compatible with the auth-injecting proxy and with pipelock's TLS interception: a sidecar that holds the credential, enforces a policy on traffic, and is unreachable by the unprivileged in-bottle UID.
Open questions deferred to design work, not blockers for this note:
- SSH vs HTTPS for the bottle→gate hop. SSH lets the agent use a per-bottle key the gate authenticates; HTTPS over loopback is simpler and pairs naturally with pipelock.
- Whether the gate should also enforce repo / branch allowlists per bottle, or stay narrowly focused on secret scanning. Probably narrow first, expand later.
- Performance on large pushes. gitleaks on a few-MB diff is sub-second; on a large monorepo first push it may take seconds. Acceptable.
Tool landscape (2026)
| Tool | Install | Full history | Commit msgs | Detection | Status | License |
|---|---|---|---|---|---|---|
| gitleaks | brew install gitleaks |
yes (--log-opts="--all") |
yes | regex + entropy, ~150 rules | very active (v8.x) | MIT |
| TruffleHog v3 | brew, Docker, GH Action |
yes | yes | regex + live API verification (700+ types) | very active | AGPL-3.0 |
| detect-secrets (Yelp) | pip install |
working tree + baseline | no | regex + entropy | active, slower cadence | Apache 2.0 |
| git-secrets (AWS Labs) | brew |
yes (--scan-history) |
yes | regex only, AWS-focused | maintenance mode since ~2021 | Apache 2.0 |
| ggshield (GitGuardian) | pip install |
yes | yes | proprietary ML + regex, optional verified | very active, commercial | MIT CLI, SaaS engine |
Other names that surfaced but do not fit: Semgrep (general SAST, not
git-history-native), whispers (Python, low adoption).
Notes on the contenders:
- gitleaks is the de-facto open-source standard. The
protect --stagedmode is purpose-built for pre-commit; thedetect --log-opts="--all"mode walks every commit including message text. No network dependency. - TruffleHog v3 is the strongest one-shot auditor because it can make live API calls to confirm whether a found credential is still valid. AGPL is fine for internal use but a consideration if the detection step gets embedded in a redistributable artifact. Best as a "run once on history" tool, not as the everyday hook.
- detect-secrets is well-loved for its baseline-file workflow (you declare existing secrets as known and only fail on new ones), but commit-message scanning is not first-class.
- git-secrets is effectively superseded by gitleaks. No reason to start a new project on it.
- ggshield is good if you want a hosted dashboard and incident management, but it ships repo content to GitGuardian's API by default — wrong fit for a project whose entire premise is sandbox isolation.
Commit message scanning
Several real-world leaks have come from secrets pasted into commit messages (often by automation that includes captured CLI output in auto-generated messages). gitleaks, TruffleHog, git-secrets, and ggshield all scan message text in history mode. detect-secrets does not — it is content-focused. Anyone using detect-secrets should pair it with a separate message-scanning step.
Recommended path forward
In priority order, for the host bot-bottle repo:
- One-time retro scan with gitleaks:
gitleaks detect --source . --log-opts="--all" --redact. Catches anything currently in history including commit messages.--redactkeeps any findings out of the rendered output. Treat any hit as a rotated credential, not as a "fix the commit" problem (see attack-vector section above for why). - Add a pre-commit hook in
.githooks/that runsgitleaks protect --staged. Fits the project's existing pattern (the conventional-commitscommit-msghook lives there), needs no new runtime, fully offline. - Optional one-time live-verification pass with TruffleHog:
trufflehog git file://. --only-verified. Tells you whether any historically leaked credential is still active — actionable in a way pattern-matching alone is not.
For the bottle surface, the follow-up is the git-gate design sketched above: a sidecar git endpoint that bottles push to, that runs gitleaks on incoming refs before forwarding to the real remote, with egress locked to the gate so the agent cannot push directly. Tracked as a future PRD; ordering it after the network egress guard is correct because the egress allowlist is the control that prevents bypass and is therefore the prerequisite.
Sources
- gitleaks — GitHub repository
- TruffleHog — GitHub repository
- detect-secrets — GitHub repository
- git-secrets — AWS Labs GitHub repository
- GitGuardian ggshield
- GH Archive — public GitHub event firehose mirror
- GitHub camo image proxy documentation
- Truffle Security — "How fast are leaked secrets abused?"