Strengthen outbound exfil detection: canaries, broadened known-value set, fragmentation-resistant matching #259

Closed
opened 2026-06-24 09:21:39 -04:00 by didericis-claude · 1 comment
Collaborator

Context

Outbound exfiltration detection in bot_bottle/dlp_detectors.py currently has two tiers:

  • scan_token_patterns — 15 vendor token regexes (TOKEN_PATTERNS). This is the weak tier: format-matching is inherently evadable by an adaptive adversary, and adding patterns is whack-a-mole.
  • scan_known_secrets — matches the actual provisioned secret value plus 8 encoded variants (base64 ±pad, urlsafe b64, percent-encode, hex upper/lower, base32, gzip+b64). This is the strong tier and the real edge: we control the env, so we know the exact secret bytes.

This issue tracks strengthening the strong tier and adding complementary signals. The goal is not a better regex.

The honest ceiling

Against a fully adaptive adversary holding its own encryption key, content-based detection of a secret the agent legitimately possesses is unwinnable (AES-encrypted ciphertext is indistinguishable from random). So we build for what we can win:

  1. Prevention — keep secrets out of the agent env in the first place (architecture already does this; known-secret scan is the backstop).
  2. The destination allowlist is the primary control — even perfectly-encrypted exfil must leave to an allowlisted host. Content scanning is defense-in-depth on top of the allowlist, not the main event.
  3. Catch the naive 90% — prompt-injected agents and lazy leaks, which is what real incidents look like.

Improvements, ranked by leverage

1. Canary / honeytokens (highest leverage; zero false positives)

Inject a unique fake secret with no legitimate use into every bottle's env (a marked string / fake AKIA…). Any outbound appearance is a high-confidence exfil signal with no FP risk, and it reuses the existing exact-match machinery. Flips detection from "guess if this blob is a secret" to "we planted a tracer; did it move." Build first.

2. Broaden the known-value set beyond EGRESS_TOKEN_*

Concrete gap: scan_known_secrets only iterates key.startswith("EGRESS_TOKEN_") (dlp_detectors.py:161-162). Any other sensitive value reaching the agent env (forwarded host credential, provider auth_token, MCP server key) is not in the known-value set and is only maybe caught by the weak regex tier. Feed every sensitive env value (with a skip-allowlist of non-secret keys) into the matcher.

3. Harden value-matching against fragmentation / separator injection

Three evasions defeat the current text.find(variant) / regex search:

  • Separator injectionghp_aaa -> ghp_aa a (space/hyphen/zero-width). NFKD strips combining marks but not literal ASCII separators. Fix: also match against an alnum-only projection of both secret and text.
  • Chunking within a request — add partial-substring matching (flag any contiguous run of >= ~20 chars of the secret). Implement with Rabin-Karp over the secret's k-grams: precompute the set of k-gram hashes per secret, slide a window over outbound text, flag on hash hit. O(n), and hashing the projection also covers the separator case.
  • Split across requests — keep a small bounded rolling buffer of recent outbound bodies per host and scan the join.

4. Entropy / anomaly scoring (complement for unknown + encrypted blobs)

Shannon entropy over high-entropy substrings; flag base64/hex-ish blobs above ~4.5 bits/char that don't match a known-good shape. Only thing that flags ciphertext (encrypted exfil) and unknown secrets. FP-prone (UUIDs, minified JS, image data) -> make it a warn/score signal combined with destination + body size, never a standalone hard block.

5. Close the binary / multipart blind spot

scan_outbound does body.decode("utf-8", errors="replace") (egress_addon_core.py:709). A file upload (secret in EXIF, a zip, protobuf) gets lossily mangled and exact-matching silently breaks. Detect binary/multipart content-types and either scan raw bytes for the secret/canary, or block-by-default large binary uploads to hosts that don't need them.

What not to do

  • No more vendor regexes — whack-a-mole on the weak tier.
  • No ML/embedding secret detection — dependency-heavy, FP-prone, weaker than canaries + known-value matching for this threat model. (gitleaks already covers the git-gate push path; for egress, exact-value + canary beats generic scanning.)

Suggested first cut

Canary injection + the Rabin-Karp partial-match detector, both behind the existing ScanResult interface and within the sidecar's stdlib-only stance. Then broaden the known-value set (#2), then entropy scoring (#4) as warn-only.


Filed from a design discussion on improving token-exfiltration detection. Frames the regex tier as the weak backstop and the known-value/canary tier as the winnable game; see the egress DLP modules (dlp_detectors.py, egress_addon_core.py) and PRD 0052/0053/0056.

## Context Outbound exfiltration detection in `bot_bottle/dlp_detectors.py` currently has two tiers: - **`scan_token_patterns`** — 15 vendor token regexes (`TOKEN_PATTERNS`). This is the *weak* tier: format-matching is inherently evadable by an adaptive adversary, and adding patterns is whack-a-mole. - **`scan_known_secrets`** — matches the *actual provisioned secret value* plus 8 encoded variants (base64 ±pad, urlsafe b64, percent-encode, hex upper/lower, base32, gzip+b64). This is the **strong tier** and the real edge: we control the env, so we know the exact secret bytes. This issue tracks strengthening the strong tier and adding complementary signals. The goal is **not** a better regex. ## The honest ceiling Against a fully adaptive adversary holding its own encryption key, content-based detection of a secret the agent legitimately possesses is unwinnable (AES-encrypted ciphertext is indistinguishable from random). So we build for what we can win: 1. **Prevention** — keep secrets out of the agent env in the first place (architecture already does this; known-secret scan is the backstop). 2. **The destination allowlist is the primary control** — even perfectly-encrypted exfil must leave to an allowlisted host. Content scanning is defense-in-depth on top of the allowlist, not the main event. 3. **Catch the naive 90%** — prompt-injected agents and lazy leaks, which is what real incidents look like. ## Improvements, ranked by leverage ### 1. Canary / honeytokens (highest leverage; zero false positives) Inject a unique fake secret with no legitimate use into every bottle's env (a marked string / fake `AKIA…`). Any outbound appearance is a high-confidence exfil signal with no FP risk, and it reuses the existing exact-match machinery. Flips detection from "guess if this blob is a secret" to "we planted a tracer; did it move." Build first. ### 2. Broaden the known-value set beyond `EGRESS_TOKEN_*` Concrete gap: `scan_known_secrets` only iterates `key.startswith("EGRESS_TOKEN_")` (dlp_detectors.py:161-162). Any other sensitive value reaching the agent env (forwarded host credential, provider `auth_token`, MCP server key) is **not** in the known-value set and is only maybe caught by the weak regex tier. Feed every sensitive env value (with a skip-allowlist of non-secret keys) into the matcher. ### 3. Harden value-matching against fragmentation / separator injection Three evasions defeat the current `text.find(variant)` / regex `search`: - **Separator injection** — `ghp_aaa` -> `ghp_aa a` (space/hyphen/zero-width). NFKD strips combining marks but not literal ASCII separators. Fix: also match against an **alnum-only projection** of both secret and text. - **Chunking within a request** — add **partial-substring matching** (flag any contiguous run of >= ~20 chars of the secret). Implement with **Rabin-Karp over the secret's k-grams**: precompute the set of k-gram hashes per secret, slide a window over outbound text, flag on hash hit. O(n), and hashing the projection also covers the separator case. - **Split across requests** — keep a small bounded **rolling buffer of recent outbound bodies per host** and scan the join. ### 4. Entropy / anomaly scoring (complement for unknown + encrypted blobs) Shannon entropy over high-entropy substrings; flag base64/hex-ish blobs above ~4.5 bits/char that don't match a known-good shape. Only thing that flags ciphertext (encrypted exfil) and unknown secrets. FP-prone (UUIDs, minified JS, image data) -> make it a **warn/score** signal combined with destination + body size, never a standalone hard block. ### 5. Close the binary / multipart blind spot `scan_outbound` does `body.decode("utf-8", errors="replace")` (egress_addon_core.py:709). A file upload (secret in EXIF, a zip, protobuf) gets lossily mangled and exact-matching silently breaks. Detect binary/multipart content-types and either scan raw bytes for the secret/canary, or block-by-default large binary uploads to hosts that don't need them. ## What not to do - **No more vendor regexes** — whack-a-mole on the weak tier. - **No ML/embedding secret detection** — dependency-heavy, FP-prone, weaker than canaries + known-value matching for this threat model. (gitleaks already covers the git-gate push path; for egress, exact-value + canary beats generic scanning.) ## Suggested first cut Canary injection + the Rabin-Karp partial-match detector, both behind the existing `ScanResult` interface and within the sidecar's stdlib-only stance. Then broaden the known-value set (#2), then entropy scoring (#4) as warn-only. --- _Filed from a design discussion on improving token-exfiltration detection. Frames the regex tier as the weak backstop and the known-value/canary tier as the winnable game; see the egress DLP modules (`dlp_detectors.py`, `egress_addon_core.py`) and PRD 0052/0053/0056._
didericis-claude added the Kind/Security
Priority
Medium
3
labels 2026-06-24 09:21:39 -04:00
Author
Collaborator

Empirical grounding for the ranking

A Jan 2026 large-scale study (Malicious Agent Skills in the Wild, arXiv:2602.06547) measured the real-world exfil distribution across a 98,380-skill snapshot. Confirmed malicious: 157 skills, 632 vulnerabilities.

  • 112 / 157 (71.3%) were credential harvesters — secret theft is the dominant malicious behavior.
  • External transmission appears in 79 (50.3%), most commonly via hardcoded endpoints over plain HTTP. "Undocumented network endpoints" (47.2%) is the largest shadow-feature category — attackers hide the endpoint, they don't encrypt the data.
  • Only 15 (9.5%) used any code obfuscation at all (base64+exec 61%, marshal/pickle 22%, hex 17%) — and that obfuscates the code, not the exfiltrated data. True encrypted/fragmented data exfil is rarer still.
  • Key finding: "most concealment operates at the documentation level, not the code level."

What this locks in

The measured distribution confirms the proposed ordering — do these first:

  1. Canaries — catch the env-dump / plaintext-transmission majority at zero FP.
  2. Broaden the known-value set beyond EGRESS_TOKEN_* — exact-value matching defeats the ~50% doing plaintext transmission.
  3. The destination allowlist (already shipped) is the strongest control here: hardcoded attacker endpoint → not allowlisted → blocked regardless of payload encoding. Most observed attacks die at this gate.

Defer (future-proofing, not day-one):

  • Items 3 (fragmentation-resistant matching / Rabin-Karp) and 4 (entropy scoring) defend against <10% of observed attackers. Build them when detection deployment pushes attackers upmarket — not before. Correctly ranked below canaries + known-value + allowlist.

Net: the data says naive exfil dominates, so the FP-free exact-match tier (canaries + known values) plus the existing allowlist is the high-ROI work; the encryption/fragmentation arms-race tier stays deferred.

## Empirical grounding for the ranking A Jan 2026 large-scale study ([Malicious Agent Skills in the Wild](https://arxiv.org/html/2602.06547v1), arXiv:2602.06547) measured the real-world exfil distribution across a 98,380-skill snapshot. Confirmed malicious: **157 skills, 632 vulnerabilities.** - **112 / 157 (71.3%) were credential harvesters** — secret theft is the dominant malicious behavior. - **External transmission appears in 79 (50.3%)**, most commonly via **hardcoded endpoints over plain HTTP**. "Undocumented network endpoints" (47.2%) is the largest shadow-feature category — attackers **hide the endpoint, they don't encrypt the data.** - **Only 15 (9.5%) used any code obfuscation at all** (base64+exec 61%, marshal/pickle 22%, hex 17%) — and that obfuscates the *code*, not the exfiltrated *data*. True encrypted/fragmented data exfil is rarer still. - Key finding: **"most concealment operates at the documentation level, not the code level."** ## What this locks in The measured distribution confirms the proposed ordering — **do these first:** 1. **Canaries** — catch the env-dump / plaintext-transmission majority at zero FP. 2. **Broaden the known-value set** beyond `EGRESS_TOKEN_*` — exact-value matching defeats the ~50% doing plaintext transmission. 3. The **destination allowlist (already shipped)** is the strongest control here: hardcoded attacker endpoint → not allowlisted → blocked regardless of payload encoding. Most observed attacks die at this gate. **Defer (future-proofing, not day-one):** - Items 3 (fragmentation-resistant matching / Rabin-Karp) and 4 (entropy scoring) defend against **<10% of observed attackers**. Build them when detection deployment pushes attackers upmarket — not before. Correctly ranked *below* canaries + known-value + allowlist. Net: the data says naive exfil dominates, so the FP-free exact-match tier (canaries + known values) plus the existing allowlist is the high-ROI work; the encryption/fragmentation arms-race tier stays deferred.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: didericis/bot-bottle#259