Strengthen outbound exfil detection: canaries, broadened known-value set, fragmentation-resistant matching #259
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
Outbound exfiltration detection in
bot_bottle/dlp_detectors.pycurrently has two tiers:scan_token_patterns— 15 vendor token regexes (TOKEN_PATTERNS). This is the weak tier: format-matching is inherently evadable by an adaptive adversary, and adding patterns is whack-a-mole.scan_known_secrets— matches the actual provisioned secret value plus 8 encoded variants (base64 ±pad, urlsafe b64, percent-encode, hex upper/lower, base32, gzip+b64). This is the strong tier and the real edge: we control the env, so we know the exact secret bytes.This issue tracks strengthening the strong tier and adding complementary signals. The goal is not a better regex.
The honest ceiling
Against a fully adaptive adversary holding its own encryption key, content-based detection of a secret the agent legitimately possesses is unwinnable (AES-encrypted ciphertext is indistinguishable from random). So we build for what we can win:
Improvements, ranked by leverage
1. Canary / honeytokens (highest leverage; zero false positives)
Inject a unique fake secret with no legitimate use into every bottle's env (a marked string / fake
AKIA…). Any outbound appearance is a high-confidence exfil signal with no FP risk, and it reuses the existing exact-match machinery. Flips detection from "guess if this blob is a secret" to "we planted a tracer; did it move." Build first.2. Broaden the known-value set beyond
EGRESS_TOKEN_*Concrete gap:
scan_known_secretsonly iterateskey.startswith("EGRESS_TOKEN_")(dlp_detectors.py:161-162). Any other sensitive value reaching the agent env (forwarded host credential, providerauth_token, MCP server key) is not in the known-value set and is only maybe caught by the weak regex tier. Feed every sensitive env value (with a skip-allowlist of non-secret keys) into the matcher.3. Harden value-matching against fragmentation / separator injection
Three evasions defeat the current
text.find(variant)/ regexsearch:ghp_aaa->ghp_aa a(space/hyphen/zero-width). NFKD strips combining marks but not literal ASCII separators. Fix: also match against an alnum-only projection of both secret and text.4. Entropy / anomaly scoring (complement for unknown + encrypted blobs)
Shannon entropy over high-entropy substrings; flag base64/hex-ish blobs above ~4.5 bits/char that don't match a known-good shape. Only thing that flags ciphertext (encrypted exfil) and unknown secrets. FP-prone (UUIDs, minified JS, image data) -> make it a warn/score signal combined with destination + body size, never a standalone hard block.
5. Close the binary / multipart blind spot
scan_outbounddoesbody.decode("utf-8", errors="replace")(egress_addon_core.py:709). A file upload (secret in EXIF, a zip, protobuf) gets lossily mangled and exact-matching silently breaks. Detect binary/multipart content-types and either scan raw bytes for the secret/canary, or block-by-default large binary uploads to hosts that don't need them.What not to do
Suggested first cut
Canary injection + the Rabin-Karp partial-match detector, both behind the existing
ScanResultinterface and within the sidecar's stdlib-only stance. Then broaden the known-value set (#2), then entropy scoring (#4) as warn-only.Filed from a design discussion on improving token-exfiltration detection. Frames the regex tier as the weak backstop and the known-value/canary tier as the winnable game; see the egress DLP modules (
dlp_detectors.py,egress_addon_core.py) and PRD 0052/0053/0056.Empirical grounding for the ranking
A Jan 2026 large-scale study (Malicious Agent Skills in the Wild, arXiv:2602.06547) measured the real-world exfil distribution across a 98,380-skill snapshot. Confirmed malicious: 157 skills, 632 vulnerabilities.
What this locks in
The measured distribution confirms the proposed ordering — do these first:
EGRESS_TOKEN_*— exact-value matching defeats the ~50% doing plaintext transmission.Defer (future-proofing, not day-one):
Net: the data says naive exfil dominates, so the FP-free exact-match tier (canaries + known values) plus the existing allowlist is the high-ROI work; the encryption/fragmentation arms-race tier stays deferred.