c08b09dc9f
Assisted-by: Codex
460 lines
26 KiB
Markdown
460 lines
26 KiB
Markdown
# Secret exfiltration tripwires: detecting a known string in many encodings
|
||
|
||
Research into tooling that, given a known secret string, generates its
|
||
representation in many common encodings so that simple string or regex matchers
|
||
can detect the secret in outbound traffic or logged output — regardless of
|
||
which naive encoding a misbehaving agent uses.
|
||
|
||
## Summary
|
||
|
||
- **No off-the-shelf tool does this exactly.** A search of GitHub, PyPI, npm,
|
||
and security tool indexes (awesome-lists, awesome-honeypots, awesome-yara,
|
||
awesome-canaries) found no project whose stated purpose is "given a secret,
|
||
emit N encoded forms suitable for use as grep or regex patterns against
|
||
outbound traffic." The capability exists in fragments across several adjacent
|
||
categories, none of which compose into the complete picture.
|
||
- **The closest existing answer is YARA's `base64` / `base64wide` string
|
||
modifiers** (added in YARA 3.6). A single string declaration with both
|
||
modifiers generates all three base64 offset permutations automatically, in
|
||
plain and UTF-16LE form. Combined with separate `ascii` / `wide` variants,
|
||
this covers five of the most common encoding forms with no Python needed.
|
||
- **Secret scanners** (gitleaks, trufflehog, detect-secrets, ggshield) solve
|
||
the inverse problem: detecting *unknown* secrets matching known patterns.
|
||
They are not designed to scan for a specific known literal in N encodings.
|
||
Gitleaks (v8.20+) and TruffleHog do perform multi-encoding decodes before
|
||
running their detectors, but only as a preprocessing step — not as a way to
|
||
produce N encoded forms for downstream matchers.
|
||
- **Canary token services** (Thinkst Canarytokens, OpenCanary) are callback
|
||
canaries: detection fires when the token itself is accessed and phones home.
|
||
They do not scan outbound streams for encoded representations of the canary
|
||
value. They address a different threat model.
|
||
- **Enterprise DLP** products (Microsoft Purview, Symantec DLP/Broadcom,
|
||
Nightfall, Cyberhaven) perform encoding-aware matching internally as a
|
||
black-box feature. The capability is real but not exposed as an API and
|
||
not available for use in a self-hosted container sidecar. Symantec DLP
|
||
explicitly does not decode base64 or ROT13 in all inspection paths due
|
||
to processing overhead concerns.
|
||
- Rolling this in ~100 lines of Python is feasible and is probably the right
|
||
path for bot-bottle v1. The limiting factor is not the encoding logic
|
||
— that is straightforward — but the false-positive rate from common
|
||
base64 alphabet collisions and the zero coverage against any re-encoding
|
||
that involves a key (encryption) or destroys byte boundaries (packet
|
||
splitting).
|
||
|
||
---
|
||
|
||
## Encoding catalog
|
||
|
||
The following encodings are the realistic candidates for a naive agent
|
||
performing unintentional or low-sophistication exfiltration. For each,
|
||
the "FP risk" column notes the false-positive risk in a pattern matcher
|
||
scanning general HTTP request bodies or log lines.
|
||
|
||
| # | Encoding | Notes | FP risk |
|
||
|---|----------|-------|---------|
|
||
| 1 | Raw UTF-8 bytes (literal) | Baseline. Exact substring match. | Lowest |
|
||
| 2 | Base64 standard (RFC 4648 §4) with `=` padding | Up to 3 offset variants depending on byte alignment at encode boundary; YARA covers all three with `base64` modifier. | Low–medium; base64 collisions are rare for 16+ char secrets |
|
||
| 3 | Base64 without padding | Same byte content, trailing `=` stripped. Matches appear inside longer base64 blobs; needs anchor-free search. | Medium |
|
||
| 4 | URL-safe base64 (RFC 4648 §5) | `+` → `-`, `/` → `_`. Required by many JWT and OAuth token formats. A separate encoding class from standard base64. | Low |
|
||
| 5 | Base64 of UTF-16LE encoding | Re-encode the secret as UTF-16LE, then base64. Produces a different ciphertext than UTF-8 base64. YARA `base64wide` modifier covers this. | Low |
|
||
| 6 | Base64 of UTF-16BE encoding | Same as above but big-endian byte order. Rare in practice but trivial to generate; include for completeness. | Very low |
|
||
| 7 | Double base64 | Base64 applied twice. Some log-shipping pipelines or config-encoding layers do this naively. | Very low |
|
||
| 8 | Hex lowercase (`0-9a-f`) | Every byte becomes two hex digits; trivially detectable with a fixed-length hex string search. | Low |
|
||
| 9 | Hex uppercase (`0-9A-F`) | Same encoded value, different case. Separate regex needed if your matcher is case-sensitive. | Low |
|
||
| 10 | Percent/URL encoding of all bytes | `%xx` for every byte. Browsers and curl typically only encode special chars; a misbehaving agent might encode all bytes. | Low in raw bodies; medium in URL query strings |
|
||
| 11 | Percent encoding of non-ASCII bytes only | The common case. Only high-bytes and special chars encoded; printable ASCII runs verbatim. | Medium; overlaps with normal URL encoding |
|
||
| 12 | Double percent encoding | `%` → `%25`, so `%41` → `%2541`. Common WAF bypass; appears in logged URLs after server-side decode. | Very low |
|
||
| 13 | JSON string escaping | `"` → `\"`, `\` → `\\`, non-ASCII → `\uXXXX`. Relevant when a secret is serialized into a JSON payload body. | Medium; very common in request bodies |
|
||
| 14 | HTML/XML entity encoding | `&`, `&#xXX;`, `&#DDD;`. Relevant for form POST bodies and SOAP/XML egress. | Medium in HTML context; low in JSON |
|
||
| 15 | UTF-16LE raw bytes | Interleaved NUL bytes; `ABC` becomes `41 00 42 00 43 00`. Visible in PCAP or raw log hex dumps. YARA `wide` modifier covers this. | Low in text content; medium in binary/multipart streams |
|
||
| 16 | UTF-32LE raw bytes | Four bytes per character. Unusual in web payloads but trivial for an agent to produce via Python's `encode('utf-32-le')`. | Very low |
|
||
| 17 | ROT13 | Caesar cipher, shift 13, printable ASCII only. Shifts letters; digits and special chars unchanged. Weak obfuscation, cheap to detect. | Medium; common English words collide |
|
||
| 18 | ROT47 | Extends ROT13 over printable ASCII range 33–126. Transforms digits and symbols too. Less collision-prone than ROT13. | Low |
|
||
| 19 | gzip + base64 | `gzip(plaintext)` then base64-encode the binary output. Output is always recognizable by the `H4sI` base64 prefix for RFC 1952 gzip magic bytes. | Low; `H4sI` prefix is a cheap anchor |
|
||
| 20 | zlib/deflate + base64 | `zlib.compress()` (DEFLATE with zlib header) then base64. Output starts with `eJy` or similar zlib magic prefix in base64. | Low; magic prefix detectable |
|
||
| 21 | Leetspeak / character substitution | `e` → `3`, `a` → `@`, `o` → `0`, `i` → `1`, etc. No fixed mapping; generates many variants. High FP against non-secret content. | High; impractical to enumerate exhaustively |
|
||
| 22 | Reversed bytes | Secret reversed character-by-character. Trivial, occasionally used as a confusion layer. | Low |
|
||
| 23 | Space-separated characters | `SECRET` → `S E C R E T`. Defeats substring search; requires regex `S\s+E\s+C\s+R\s+E\s+T`. | Very low |
|
||
| 24 | Null-separated characters (wide variant) | Like space-separated but with literal `\x00` bytes. Same as UTF-16LE for ASCII-only secrets. | Very low |
|
||
| 25 | Base32 (RFC 4648 §6) | Used in TOTP seeds, some DNS exfil channels. Alphabet is `A-Z2-7`. Longer output than base64. | Low for secrets ≥ 10 chars |
|
||
|
||
**Diminishing returns note.** Encodings 1–13 cover the vast majority of
|
||
realistic naive-exfil vectors for an agent using standard Python or shell
|
||
tools. Encodings 14–25 are worth including in a comprehensive scanner but
|
||
individually contribute little marginal risk reduction.
|
||
|
||
---
|
||
|
||
## Adjacent category survey
|
||
|
||
### Secret scanners
|
||
|
||
The major open-source secret scanners — gitleaks, TruffleHog, detect-secrets,
|
||
and ggshield — all solve the *inverse* problem: given a body of text, find
|
||
strings that look like secrets using entropy analysis or regular expressions
|
||
trained on known credential formats. They are not designed to answer "does
|
||
this payload contain *this specific known value* in any encoding?"
|
||
|
||
The distinction matters. These tools look for patterns that match the
|
||
structural form of, e.g., an AWS access key (`AKIA...`). They are not
|
||
designed to take a user-provided literal and report all byte-equivalent
|
||
encodings of it.
|
||
|
||
That said, two of these tools have encoding-aware *preprocessing* steps
|
||
that are directly relevant:
|
||
|
||
- **Gitleaks** ([github.com/gitleaks/gitleaks](https://github.com/gitleaks/gitleaks))
|
||
added a `--max-decode-depth` flag in v8.20.0. When set to a non-zero
|
||
value, it recursively decodes segments of the input before running
|
||
regex detectors, supporting three encodings: hex, percent-encoding
|
||
(URL encoding), and base64. The purpose is to find secrets that have
|
||
been naively encoded before committing. This is functionally what
|
||
a content-tripwire needs to do, but hard-coded to Gitleaks' own
|
||
detector ruleset rather than user-supplied literals. The flag is off
|
||
by default.
|
||
|
||
- **TruffleHog** ([github.com/trufflesecurity/trufflehog](https://github.com/trufflesecurity/trufflehog))
|
||
decodes four encoding types before running its detectors: UTF-8, UTF-16,
|
||
Base64, and Escaped Unicode (`\uXXXX` form). It also detects secrets
|
||
hidden in archived and compressed files. A Truffle Security blog post
|
||
from October 2024 documents this in detail
|
||
([trufflesecurity.com/blog/secret-scanning-encoded-and-archived-data](https://trufflesecurity.com/blog/secret-scanning-encoded-and-archived-data)).
|
||
Same caveat as gitleaks: the decode step feeds the existing pattern
|
||
detectors, not a user-supplied literal search.
|
||
|
||
- **detect-secrets** ([github.com/Yelp/detect-secrets](https://github.com/Yelp/detect-secrets))
|
||
and **ggshield** ([github.com/GitGuardian/ggshield](https://github.com/GitGuardian/ggshield))
|
||
do not appear to have multi-encoding decode steps; they operate on
|
||
the input text as-is.
|
||
|
||
None of these tools expose a "match this literal in N encodings" API.
|
||
The closest workflow would be to feed a custom gitleaks rule that matches
|
||
pre-computed encoded variants, but that requires generating those variants
|
||
externally (i.e., the exact gap this research note addresses).
|
||
|
||
### Canary token services
|
||
|
||
Canary token services operate on a fundamentally different detection model
|
||
and should not be confused with matcher canaries.
|
||
|
||
**Callback canaries** work by embedding a unique URL or resource reference
|
||
in a document, credential file, or environment variable. When an agent
|
||
(or attacker) reads and uses the credential, the canary service receives
|
||
an HTTP callback. The detection signal is the *access of the resource*,
|
||
not the presence of an encoded form in an outbound byte stream.
|
||
|
||
- **Thinkst Canarytokens** ([canarytokens.org](https://canarytokens.org) /
|
||
[github.com/thinkst/canarytokens](https://github.com/thinkst/canarytokens))
|
||
offers AWS key canaries, Azure login canaries, PDF canaries, and
|
||
many others. All rely on callback detection. A Canarytokens bypass
|
||
issue ([github.com/thinkst/canarytokens/issues/36](https://github.com/thinkst/canarytokens/issues/36))
|
||
specifically documents that an attacker who extracts the canary value
|
||
and uses it without triggering the callback URL (e.g., by sending the
|
||
raw credential string to an external API over a non-canary channel) can
|
||
bypass the detection entirely. This is the exact gap that encoding-aware
|
||
content inspection would close.
|
||
|
||
- **OpenCanary** ([github.com/thinkst/opencanary](https://github.com/thinkst/opencanary))
|
||
is Thinkst's self-hosted daemon that mimics network services (SSH,
|
||
FTP, Telnet, HTTP, SMB, etc.) and alerts when they are probed. It is
|
||
a network-layer honeypot, not an outbound content scanner. Detection
|
||
is interaction-based, not encoding-aware content matching.
|
||
|
||
- **IndicatorOfCanary** by HackingLZ
|
||
([github.com/HackingLZ/IndicatorOfCanary](https://github.com/HackingLZ/IndicatorOfCanary))
|
||
is conceptually the nearest to what is needed: it is a red-team tool
|
||
for *detecting the presence of canary tokens inside files* before using
|
||
those files, to avoid triggering callback alerts. It searches for known
|
||
canary IoCs (callback domain patterns) in file metadata and content.
|
||
It is the adversary-side mirror image — red team detecting canaries
|
||
before they can be tripped — but it shows the art of the possible for
|
||
encoding-aware document inspection.
|
||
|
||
**The gap**: no canary service offers a "here is your secret; here are
|
||
12 encoded forms of it; ingest these into your egress scanner" API.
|
||
|
||
### Enterprise DLP
|
||
|
||
Enterprise DLP products do perform encoding-aware content matching, but
|
||
as an internal, closed-source capability:
|
||
|
||
- **Symantec DLP (Broadcom)**
|
||
([broadcom.com](https://www.broadcom.com/products/cybersecurity/information-protection/data-loss-prevention))
|
||
An official Broadcom knowledge base article explicitly states that
|
||
Symantec DLP is not able to inspect and alert on base64 and ROT13
|
||
encoded files in all inspection paths, citing processing overhead as
|
||
the reason
|
||
([knowledge.broadcom.com/external/article/184415](https://knowledge.broadcom.com/external/article/184415/is-symantec-data-loss-prevention-dlp-abl.html)).
|
||
This is a documented limitation, not marketing copy.
|
||
|
||
- **Microsoft Purview DLP**
|
||
([learn.microsoft.com/en-us/purview/dlp-policy-reference](https://learn.microsoft.com/en-us/purview/dlp-policy-reference))
|
||
supports custom sensitive information types and trainable classifiers
|
||
but the encoding-awareness of its content matching engine is not
|
||
publicly documented at the rule-authoring level. No public API exists
|
||
for generating encoded variant patterns.
|
||
|
||
- **Nightfall AI** ([nightfall.ai](https://www.nightfall.ai/))
|
||
uses deep-learning classifiers rather than regex, with 100+ AI-based
|
||
detectors. It offers a REST API that accepts arbitrary strings and files
|
||
and returns findings. Its encoding-awareness is model-dependent and
|
||
not configurable by the caller. No "user-supplied literal + encoding
|
||
sweep" mode is documented.
|
||
|
||
- **Cyberhaven** ([cyberhaven.com](https://www.cyberhaven.com/))
|
||
is notable for its data-lineage approach: it tracks data transformations
|
||
(copy, compress, rename, convert) and ties exfiltration events to
|
||
original sensitive files even after transformation. This is a more
|
||
powerful model than pure byte matching but requires a full endpoint
|
||
agent and cloud backend. Not suitable for a container sidecar.
|
||
|
||
The enterprise DLP space confirms that encoding-aware detection is a solved
|
||
problem at enterprise scale, but the implementations are either closed-source
|
||
SaaS products, require endpoint agents, or are not configurable with
|
||
user-supplied literals.
|
||
|
||
### Pentest / red-team encoding generators
|
||
|
||
Several red-team tools generate many encodings of a payload, treating
|
||
encoding as a *generation* problem rather than a detection problem. They
|
||
are directly useful for producing the encoding catalog needed to build
|
||
tripwire patterns.
|
||
|
||
- **hURL** ([github.com/fnord0/hURL](https://github.com/fnord0/hURL))
|
||
is a command-line encoder/decoder supporting URL encoding, double URL
|
||
encoding, base64, HTML entities, ASCII-to-hex, integer-to-hex, ROT13,
|
||
and SHA family hashes. It is packaged in Kali Linux (`apt install hurl`).
|
||
It does not produce a "all encodings of this string" output in one
|
||
command — each encoding is a separate invocation flag — but the
|
||
encoding catalog it covers aligns well with the practical catalog above.
|
||
|
||
- **CyberChef** ([gchq.github.io/CyberChef](https://gchq.github.io/CyberChef) /
|
||
[github.com/gchq/CyberChef](https://github.com/gchq/CyberChef))
|
||
is the GCHQ "cyber Swiss Army Knife," a browser-based tool with 400+
|
||
encoding/decoding/transformation operations. It can be scripted via
|
||
the `cyberchef-node` npm package
|
||
([github.com/nicowillis/cyberchef-node](https://github.com/nicowillis/cyberchef-node))
|
||
to generate many encodings programmatically. The community recipe list
|
||
([github.com/mattnotmax/cyberchef-recipes](https://github.com/mattnotmax/cyberchef-recipes))
|
||
is a good reference for the encoding chains real attackers use. CyberChef
|
||
is the best single reference for what an exhaustive encoding catalog
|
||
looks like in practice.
|
||
|
||
- **Burp Suite Intruder** (PortSwigger, commercial with community edition)
|
||
has a payload processing rule chain in its Intruder module that can apply
|
||
sequences of encoding transformations (URL, HTML, base64, ASCII hex,
|
||
built-in strings) to a wordlist. Not scriptable outside Burp; primarily
|
||
useful for interactive enumeration during a pentest.
|
||
|
||
- **wfuzz** ([github.com/xmendez/wfuzz](https://github.com/xmendez/wfuzz))
|
||
supports encoder plugins (base64, urlencode, md5, sha1, double-urlencode,
|
||
html, etc.) that can be chained with the `@` syntax in payload specs.
|
||
It is a brute-force fuzzer, not a pattern generator, but its encoder
|
||
catalog is a useful reference list.
|
||
|
||
None of these tools emit "N regex patterns for detecting this secret in
|
||
any of its encoded forms in an outbound stream." They are all generation
|
||
tools for attacks, not detection tools for defense.
|
||
|
||
---
|
||
|
||
## YARA string modifiers — the closest existing answer
|
||
|
||
YARA ([virustotal.github.io/yara-x](https://virustotal.github.io/yara-x) /
|
||
[yara.readthedocs.io](https://yara.readthedocs.io/en/stable/writingrules.html))
|
||
has the most complete existing treatment of "match this string in multiple
|
||
encoding forms" via its text-string modifier system. This was designed for
|
||
malware detection in binary files and network captures, but the same logic
|
||
applies to outbound traffic inspection.
|
||
|
||
### Available modifiers
|
||
|
||
Four modifiers apply directly to the encoding problem:
|
||
|
||
- **`ascii`** — match the string as raw ASCII/UTF-8 bytes. This is the default
|
||
when no modifiers are specified.
|
||
- **`wide`** — match the string in UTF-16LE form (each ASCII byte interleaved
|
||
with a NUL byte). Designed for detecting strings in Windows PE binaries.
|
||
- **`base64`** — generate all three base64 offset permutations of the string
|
||
and search for any of them. The three permutations arise because base64
|
||
encodes 3 bytes at a time; depending on where a string starts within the
|
||
3-byte boundary, its encoding shifts by 0, 1, or 2 base64 characters.
|
||
YARA computes all three at compile time and emits patterns for each,
|
||
so the rule author does not need to pre-compute them.
|
||
- **`base64wide`** — same as `base64`, but applied to the UTF-16LE form of
|
||
the string. Covers the case where the secret was stored as a wide string
|
||
(UTF-16LE) before being base64-encoded.
|
||
|
||
Modifiers can be combined on a single string declaration. A rule that
|
||
covers all four of these forms simultaneously looks like:
|
||
|
||
```yara
|
||
rule secret_tripwire {
|
||
strings:
|
||
$s = "my-secret-value" ascii wide base64 base64wide
|
||
condition:
|
||
$s
|
||
}
|
||
```
|
||
|
||
YARA will generate and search for (at minimum) seven patterns from
|
||
this single declaration: raw UTF-8, raw UTF-16LE, and three base64
|
||
variants of UTF-8, and three base64wide variants of UTF-16LE.
|
||
|
||
A fifth modifier, **`xor`** (added in YARA 3.8), searches for single-byte XOR
|
||
obfuscated variants of the string across all 255 non-zero keys. The `xor`
|
||
modifier cannot be combined with `base64` or `base64wide` in a single string
|
||
declaration (it causes a compiler error). To cover both XOR and base64, two
|
||
separate string declarations are required.
|
||
|
||
**Custom base64 alphabets:** The `base64` and `base64wide` modifiers accept
|
||
an optional 64-character custom alphabet string. This covers URL-safe
|
||
base64 (`-_` substituted for `+/`) and any custom alphabets.
|
||
|
||
### Limitations of YARA for this use case
|
||
|
||
- YARA does not natively cover hex encoding, percent encoding, JSON string
|
||
escaping, gzip+base64, ROT13, or the other entries in the encoding catalog
|
||
above. Those would require pre-computing the encoded forms externally and
|
||
writing them as explicit hex-pattern strings in the rule.
|
||
- YARA operates on files or byte buffers passed by the calling application;
|
||
it does not natively hook network streams. Integration with a proxy or
|
||
a log-scanning pipeline requires an application layer to call
|
||
`libyara` or the `yara-python` bindings on each captured request body.
|
||
- YARA's `base64` modifier has a documented minimum-length constraint: strings
|
||
shorter than three characters cannot be base64-matched reliably due to
|
||
the offset permutation math. This is unlikely to matter for real secrets
|
||
but worth noting.
|
||
|
||
### DissectMalware/base64_substring
|
||
|
||
The tool `base64_substring`
|
||
([github.com/DissectMalware/base64_substring](https://github.com/DissectMalware/base64_substring))
|
||
generates a YARA rule to find base64-encoded files containing a specific
|
||
keyword, by enumerating all three offset permutations and emitting them as
|
||
a YARA rule. This predates YARA's built-in `base64` modifier and is largely
|
||
superseded by it, but the repository is useful as a reference for the
|
||
permutation math.
|
||
|
||
---
|
||
|
||
## DIY sketch
|
||
|
||
There is no off-the-shelf tool that takes a known secret and emits N patterns
|
||
for outbound stream matching. The remaining question is how much work it would
|
||
be to write one.
|
||
|
||
A minimal `tripwire-encode` script in Python (~80–120 lines) would:
|
||
|
||
1. Accept a secret string on stdin or as a CLI argument.
|
||
2. Emit one encoded form per line (or a JSON object mapping encoding names
|
||
to encoded values) for encodings 1–20 from the catalog above.
|
||
3. The encoding logic for each form is 1–4 lines of Python using the
|
||
standard library (`base64`, `codecs`, `urllib.parse`, `gzip`, `io`);
|
||
no third-party dependencies are required.
|
||
4. For the YARA output mode, emit a `.yar` rule with one string declaration
|
||
per encoding (or use `ascii wide base64 base64wide` for the first four,
|
||
then add explicit hex-string patterns for the remaining forms).
|
||
|
||
A companion `tripwire-grep` script (~30–50 lines) would:
|
||
1. Accept the secret (or the pre-computed encoding list) and a stream on stdin.
|
||
2. Compile the encodings into a single `re.search` call or `bytes.find` loop.
|
||
3. Exit non-zero and print the matching line/offset if any form is found.
|
||
|
||
For a proxy integration, the same encoding set can be compiled once at
|
||
container startup and injected into a mitmproxy addon or a small filter
|
||
script that wraps the proxy log output.
|
||
|
||
**Rough size estimate:** the full encoding catalog (25 entries) in Python
|
||
requires roughly 80 lines of encoding logic, 30 lines of CLI scaffolding,
|
||
and 40 lines of grep/match logic. Total under 200 lines.
|
||
|
||
**Important caveat on ROT13 and leetspeak.** These encodings have
|
||
non-trivial false-positive rates for short secrets and for secrets
|
||
containing common English substrings. If the secret contains only
|
||
alphanumerics, ROT13 and ROT47 patterns should be included but flagged
|
||
at warning level rather than error level in the output.
|
||
|
||
**The honest limitation of this entire approach** is that it is effective
|
||
only against naive, deterministic encodings where a single fixed plaintext
|
||
always produces the same fixed ciphertext. Any exfiltration method that
|
||
involves:
|
||
- A secret key (AES, ChaCha, XOR with a non-trivial key),
|
||
- Compression with a novel dictionary,
|
||
- Splitting the secret across multiple messages or packets,
|
||
- Steganography (hiding data in image pixels, audio samples, whitespace),
|
||
|
||
...will produce output with no detectable relationship to the input plaintext.
|
||
For bot-bottle's threat model — an untrusted but not adversarially
|
||
sophisticated agent that might accidentally or naively encode secrets before
|
||
logging or posting them — naive-encoding detection covers the realistic concern.
|
||
A deliberately adversarial agent with knowledge of its own monitoring is a
|
||
different threat model entirely.
|
||
|
||
---
|
||
|
||
## Integration point in bot-bottle
|
||
|
||
The natural choke point for encoding-aware content inspection in bot-bottle
|
||
is the body of outbound HTTP/HTTPS requests, visible at the egress proxy layer.
|
||
The egress-guard note documents smokescreen (Stripe,
|
||
[github.com/stripe/smokescreen](https://github.com/stripe/smokescreen))
|
||
and squid as sidecar proxy options. Both can be configured to pass request
|
||
bodies through a content-inspection hook. Smokescreen has an experimental
|
||
MITM mode with detailed HTTP logging; squid supports `icap_service` ICAP
|
||
integration for content inspection. Either could be extended with a small
|
||
filter script that runs the encoded-form search against each captured body.
|
||
|
||
**SSH is opaque to a proxy.** SSH connections carry encrypted payloads;
|
||
the proxy cannot inspect their contents. Egress SSH (e.g., git push to a
|
||
permitted remote) is outside the reach of this defense unless the container
|
||
is configured to log git command arguments before they execute — which would
|
||
catch a literal secret passed on the command line but not a secret embedded
|
||
in a commit object or passed via an environment variable to a subprocess.
|
||
For v1, this defense should be scoped explicitly to HTTP/HTTPS request bodies
|
||
and to logged shell command output, and SSH egress should be called out as
|
||
a gap.
|
||
|
||
---
|
||
|
||
## Limitations and bypass surface
|
||
|
||
This defense catches secrets in transit only when the encoding is deterministic
|
||
and pre-computable. The following techniques defeat it entirely:
|
||
|
||
- **Symmetric encryption with a key** (AES-CBC, ChaCha20, XOR with a 16+
|
||
byte key). The ciphertext has no statistical relationship to the plaintext.
|
||
Detection requires key recovery, which is out of scope for a content scanner.
|
||
|
||
- **Splitting across messages or packets.** If a secret is split as
|
||
`SECRET[0:4]` in one request and `SECRET[4:]` in another, no single-
|
||
message content scanner will fire. Stateful stream reassembly would be
|
||
required, which is a much harder problem.
|
||
|
||
- **Compression with a novel or pre-shared dictionary** (LZ4, Brotli,
|
||
zstd). Standard gzip/zlib produce recognizable magic-byte prefixes in
|
||
base64, but a custom compression configuration can produce output that
|
||
is entropy-high and format-unrecognizable.
|
||
|
||
- **Steganography.** Encoding data into image LSBs, audio samples,
|
||
whitespace-as-binary in a text file, or Unicode homoglyphs. The output
|
||
looks like a legitimate artifact; no regex will find it.
|
||
|
||
- **Hash-chained exfiltration.** Sending the SHA-256 of secret fragments
|
||
in sequence; a receiver who knows the secret can verify, but no string
|
||
scanner can reconstruct the original.
|
||
|
||
- **Protocol-level encoding not visible at the HTTP body layer.** DNS
|
||
exfiltration (encoding data in query subdomains), ICMP covert channels,
|
||
or TCP sequence number steganography all operate below HTTP and are not
|
||
intercepted by an HTTP proxy at all. The egress-guard note covers DNS
|
||
separately via a controlled resolver.
|
||
|
||
In the bot-bottle context, the primary realistic concern is an agent
|
||
that naively embeds a secret in a log line, a curl argument, a JSON body,
|
||
or a shell heredoc without specifically intending to obfuscate. All of the
|
||
above bypass techniques require deliberate, adversarially-motivated engineering
|
||
from a process that is assumed to be untrusted but not intentionally malicious.
|
||
The encoding-aware tripwire is an appropriate control for this threat model,
|
||
provided its scope is communicated clearly.
|