Files

T

didericis-claude ea6bc5a170 docs: draft PRD prd-new for strengthen-outbound-exfil-detection

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-06-24 23:09:11 -04:00

7.1 KiB

Raw Blame History

PRD prd-new: Strengthen outbound exfiltration detection

Status: Draft
Author: claude
Created: 2026-06-25
Issue: #259

Summary

The current DLP system has two detection tiers: weak (vendor token regexes) and strong (exact matching of provisioned EGRESS_TOKEN_* values in multiple encodings). This PRD adds four complementary hardening layers in ranked order of impact:

Canary tokens — inject a per-session fake secret into the agent's environment; any outbound appearance is a zero-false-positive exfil signal.
Broadened known-value scanning — extend the strong tier beyond EGRESS_TOKEN_* via a BOT_BOTTLE_SENSITIVE_PREFIXES sidecar env var.
Fragmentation-resistant matching — catch separator injection and long-substring leaks via alnum projection and sliding-window substring search.
Entropy scoring — warn-only flag on high-entropy blobs as a complementary signal for encrypted exfil.

Problem

After PRD 0056 the outbound scan covers all HTTP request surfaces (host, path, query, headers, body). But the content-based strong tier only matches EGRESS_TOKEN_* values and their encodings. Three gaps remain:

Gap	How an agent exploits it
No canary	There is no "can't be false-positive" signal
Narrow prefix filter	Sensitive values provisioned under other env prefixes (MCP keys, API keys injected via the agent provider) are invisible to `scan_known_secrets`
Fragmentation blindness	Inserting `-`, space, or `\n` between secret characters turns an exact-match into a miss

Goals / Success Criteria

Each launched bottle has a unique canary token in the agent's environment (BOT_BOTTLE_CANARY) and the egress sidecar's environment (EGRESS_TOKEN_CANARY). Any outbound appearance of the canary blocks the request with reason "canary token".
scan_known_secrets accepts a sensitive_prefixes parameter (default: ("EGRESS_TOKEN_",)). scan_outbound reads BOT_BOTTLE_SENSITIVE_PREFIXES from environ and merges those prefixes in, so operators can mark additional env vars as scanned values without changing the manifest schema.
For every secret that passes exact-match, a secondary alnum-projection pass checks for the secret with all non-alphanumeric characters stripped. This catches separator-injection evasion (MY-SECRET → body contains MY SECRET).
A sliding-window partial-match pass checks for long-enough contiguous substrings of the secret's alnum projection in the text's alnum projection. Any match ≥ PARTIAL_MATCH_MIN_LEN (12 chars) blocks with reason "partial match".
A new scan_entropy detector flags outbound text windows with Shannon entropy ≥ ENTROPY_BLOCK_THRESHOLD (5.5 bits/char) at warn severity only. It is registered under the new detector name "entropy" in OUTBOUND_DETECTOR_NAMES and disabled by default (routes must opt in).
Binary request bodies are decoded via latin-1 instead of utf-8 errors="replace", preserving every byte value and allowing ASCII-range secrets to be found within binary payloads.
All new behaviour is unit-tested; existing tests pass unchanged.

Non-goals

Rolling per-host buffer for split-across-requests detection (state in the stateless addon is complex; deferred).
Additional vendor regexes.
ML / embedding-based detection.
Entropy-based hard blocks (warn only per the issue).

Design

Canary token flow

Egress.prepare()
  canary = secrets.token_urlsafe(32)
  EgressPlan(canary=canary, ...)

Docker compose render:
  sidecar env: EGRESS_TOKEN_CANARY=<canary>   ← scanned by existing known-secrets detector
  agent env:   BOT_BOTTLE_CANARY=<canary>      ← visible to agent as a "secret"

macos-container launch: same literals added to sidecar + agent env entries

EGRESS_TOKEN_CANARY matches the EGRESS_TOKEN_ prefix already scanned by scan_known_secrets, so no detector code changes are required for canary detection — only the injection path.

Broadened known-value scanning

scan_known_secrets gains a sensitive_prefixes parameter:

def scan_known_secrets(
    text: str,
    *,
    location: str = "body",
    env: Mapping[str, str] | None = None,
    sensitive_prefixes: tuple[str, ...] = ("EGRESS_TOKEN_",),
) -> ScanResult | None:

scan_outbound reads BOT_BOTTLE_SENSITIVE_PREFIXES (comma-separated list of additional prefixes) from environ and appends them:

extra = tuple(
    p for p in environ.get("BOT_BOTTLE_SENSITIVE_PREFIXES", "").split(",") if p
)
sensitive_prefixes = ("EGRESS_TOKEN_",) + extra

redact_tokens receives the same treatment for consistent redaction.

Fragmentation-resistant matching

A new helper _alnum_projection(text) strips all non-alphanumeric characters. scan_known_secrets runs two passes per secret:

Exact pass — existing encoded-variant loop (unchanged).
Alnum-projection pass — if the secret's alnum projection has ≥ 8 chars, check if it appears in the text's alnum projection. Match → block with "fragmented match (separator injection)" reason.
Partial-substring pass — if the secret's alnum projection has ≥ PARTIAL_MATCH_MIN_LEN chars (12), slide a window of that length across the secret's projection and look for each window in the text's alnum projection. First match → block with "partial match" reason.

All three passes run only for the "known_secrets" detector; the token-pattern and entropy detectors are unchanged.

Entropy scoring

New public function:

def scan_entropy(
    text: str,
    *,
    location: str = "body",
    window: int = ENTROPY_WINDOW,           # 64
    threshold: float = ENTROPY_BLOCK_THRESHOLD,  # 5.5
) -> ScanResult | None:

Slides a window of window characters across text in steps of window // 2. If any window's Shannon entropy exceeds threshold, returns a warn-severity ScanResult. Never blocks.

OUTBOUND_DETECTOR_NAMES gains "entropy". Routes opt in via their dlp block; entropy scanning is off by default to avoid false-positive noise on legitimate binary payloads.

Binary body handling

In scan_outbound, the bytes → str decoding changes from:

body.decode("utf-8", errors="replace")

to:

body.decode("utf-8") if body is str else body.decode("latin-1")

latin-1 is a bijective byte↔codepoint mapping; every byte value is preserved as its corresponding Latin-1 code point, so ASCII-range secret strings remain intact and str.find / regex still locate them correctly. The fallback from strict UTF-8 is tried first so valid UTF-8 bodies are decoded faithfully.

Implementation

Delivered in three commits on the same branch:

DLP detector changes — _alnum_projection, fragmentation passes, scan_entropy, broadened scan_known_secrets, updated scan_outbound and redact_tokens; all accompanying unit tests.
Canary injection — EgressPlan.canary, Egress.prepare(), Docker compose + macos-container backend injection.
PRD flip — Status: Draft → Active.

7.1 KiB Raw Blame History