From 49f77f2d1e52d73121992a44cf090ab9ea009c02 Mon Sep 17 00:00:00 2001 From: didericis Date: Thu, 4 Jun 2026 13:54:46 -0400 Subject: [PATCH] docs: accommodate PR feedback on detector architecture Per feedback from PR 192: - Restructure around outbound_detectors (requests to upstream) and inbound_detectors (responses from upstream) - Rename to 'secret exfiltration' detection for Phase 1 - Add 'known_secrets' detector for provisioned credentials - Make scanning enabled by default per detector type - Clarify that multiple encodings of secrets should be checked Phase 1 now focuses on preventing outbound credential leaks. Phase 2 handles inbound prompt injection attacks. Co-Authored-By: Claude Haiku 4.5 --- docs/research/dlp-alternatives-to-pipelock.md | 180 ++++++++++++++++-- 1 file changed, 163 insertions(+), 17 deletions(-) diff --git a/docs/research/dlp-alternatives-to-pipelock.md b/docs/research/dlp-alternatives-to-pipelock.md index 62f81af..71ad833 100644 --- a/docs/research/dlp-alternatives-to-pipelock.md +++ b/docs/research/dlp-alternatives-to-pipelock.md @@ -41,6 +41,20 @@ All of pipelock's scanning rules are global. If route A wants to skip `.whl` sca Pipelock's BIP-39 seed-phrase detector fires on any 12+ English words matching a checksum, which is common in LLM prompts/responses. Bot-bottle disables this detector globally, sacrificing protection. +### Issue 4: No prompt injection detection + +**Important clarification:** Pipelock does NOT detect prompt injections. It detects: +- Token patterns (regex) +- Entropy (random-looking strings) +- BIP-39 seed phrases (12+ word checksums) + +But it cannot detect semantic attacks like: +- Attempts to exfiltrate system prompts +- Jailbreak attempts ("ignore previous instructions") +- Model output that reveals internal system details + +This is a novel threat specific to LLM agents that pipelock wasn't designed for. + ## Replacement design: mitmproxy-based DLP addon Since bot-bottle already uses mitmproxy for egress (PRD 0017), we can extend the mitmproxy addon to do DLP scanning alongside egress rules: @@ -61,25 +75,32 @@ Egress (mitmproxy) ### Per-route configuration in manifest +Routes separately configure **outbound** (request to upstream) and **inbound** (response from upstream) scanning: + ```yaml egress: routes: - host: api.anthropic.com dlp: - enabled: true - detectors: [tokens, entropy] + outbound_detectors: [token_patterns, known_secrets] # default + inbound_detectors: [naive_injection_detection] # default - host: files.pythonhosted.org dlp: - enabled: true - request_only: true # Scan outbound, skip response - skip_extensions: [".whl", ".tar.gz"] + outbound_detectors: [token_patterns, known_secrets] + inbound_detectors: false # Skip response scanning (binary downloads) - host: internal-service.corp dlp: - enabled: false # Trusted internal, no scanning + outbound_detectors: false + inbound_detectors: false # Trusted internal, no scanning ``` +**Detectors:** +- `token_patterns` — API keys, GitHub tokens, AWS credentials, etc. +- `known_secrets` — Secrets we provisioned (API keys, OAuth tokens passed via cred-proxy) +- `naive_injection_detection` — Semantic attacks on system prompt (see section below) + ### Detector design Three core detectors, each with tunable sensitivity: @@ -154,22 +175,147 @@ response_body_scanning: **Fallback:** If performance testing shows unacceptable latency in the Python addon, revisit the pipelock fork approach. +## Naive prompt injection detector design + +Since pipelock doesn't detect prompt injections, we need a custom detector. Here's a permissive design that favors missing attacks over false positives: + +### What to detect + +**High confidence (block immediately):** +1. Response contains known credential pattern + "system prompt" phrase together +2. Response contains both "instructions" and a token pattern + +**Medium confidence (warn):** +1. Response contains prompt-disclosure phrases without credentials (might be innocent documentation) +2. Multiple jailbreak keywords in single response + +**Ignore (too noisy):** +- Single jailbreak keywords without additional context +- "system prompt" in documentation contexts +- Common phrases like "instructions provided" + +### Naive detector pseudocode + +```python +class PromptInjectionDetector: + # Phrases that suggest prompt exfiltration + DISCLOSURE_PHRASES = [ + r'(?i)(system\s+prompt|instructions\s+given|your\s+role\s+is|you\s+are\s+an?)', + r'(?i)(original\s+instructions|secret\s+instructions|hidden\s+rules)', + ] + + # Phrases suggesting jailbreak attempts + JAILBREAK_PHRASES = [ + r'(?i)(ignore\s+previous|forget\s+everything|disregard)', + r'(?i)(from\s+now\s+on|pretend|act\s+as)', + r'(?i)(bypass|circumvent|override)', + ] + + TOKEN_PATTERNS = [ + r'AKIA[0-9A-Z]{16}', # AWS + r'ghp_[A-Za-z0-9_]{36}', # GitHub + r'sk_live_[A-Za-z0-9]{24}', # Stripe + r'Bearer\s+[A-Za-z0-9._-]{50,}', # JWT-like tokens + ] + + def scan_response(self, response_body): + """Returns (severity, reason) or (None, None) if clean.""" + + # Rule 1: Disclosure + token = HIGH confidence block + disclosure_found = any( + re.search(phrase, response_body) + for phrase in self.DISCLOSURE_PHRASES + ) + token_found = any( + re.search(pattern, response_body) + for pattern in self.TOKEN_PATTERNS + ) + + if disclosure_found and token_found: + return ("BLOCK", "Prompt disclosure with embedded credential") + + # Rule 2: Multiple jailbreak keywords = WARN + jailbreak_count = sum( + 1 for phrase in self.JAILBREAK_PHRASES + if re.search(phrase, response_body) + ) + + if jailbreak_count >= 2: + return ("WARN", f"{jailbreak_count} jailbreak attempts detected") + + # Rule 3: Disclosure alone without tokens = WARN only if very explicit + if disclosure_found and "system prompt:" in response_body.lower(): + return ("WARN", "Explicit system prompt disclosure") + + # Otherwise: clean + return (None, None) +``` + +### Why this is permissive + +1. **Single keywords ignored** — "ignore previous instructions" in a legitimate conversation doesn't trigger +2. **Context required** — disclosure phrases need tokens or multiple jailbreak attempts +3. **Documentation exemption** — "instructions provided" in a help section won't block +4. **Warn vs. block** — Only block on high-confidence signals; warn on medium +5. **No entropy-based guessing** — We don't try to be clever about detecting obfuscated prompts + +### False negatives this misses + +This detector intentionally lets through: +- Prompt injections using novel phrasing we haven't seen +- Obfuscated jailbreak attempts ("behave differently", "role-play") +- Exfiltration via indirect methods ("describe the system", "what are your constraints") +- Sophisticated attacks that split the prompt across multiple exchanges + +**Rationale:** Better to miss a sophisticated jailbreak than block legitimate agent output 100 times/day. + +### Per-route configuration + +Routes can enable/disable prompt injection scanning: + +```yaml +egress: + routes: + - host: api.anthropic.com + dlp: + enabled: true + detectors: [tokens, prompt_injection] + + - host: internal-docs.corp + dlp: + enabled: true + detectors: [tokens] # Skip prompt injection (trusted internal) +``` + ## Implementation phases -### Phase 1: MVP (2-3 weeks) -- Token detector (regex for API key patterns) -- Entropy detector (reuse pipelock thresholds) -- Per-route `dlp: {enabled, request_only, skip_extensions}` config -- Block on token match, warn on entropy hit +### Phase 1: Secret exfiltration detection (2-3 weeks) +**Goal:** Prevent credentials from leaking to upstream services -### Phase 2: Prompt injection (1-2 weeks) -- Pattern detector for system prompt exfiltration -- Integrates with phase 1 config +- **Token patterns detector** — API keys, GitHub tokens, AWS credentials (regex-based) +- **Known secrets detector** — Check if provisioned credentials appear in outbound traffic + - Secrets passed to cred-proxy or agent environment + - Multiple encodings (base64, hex, URL-encoded variants) +- **Outbound scanning by default** — enabled for all routes unless explicitly disabled +- **Per-route config:** `outbound_detectors: [token_patterns, known_secrets]` +- **Action:** Block immediately on token match; warn on entropy threshold (tuned low to avoid false positives) -### Phase 3: Hardening (optional) -- Custom entropy heuristics for LLM payloads -- Sampling/audit mode for high-entropy responses +### Phase 2: Prompt injection detection (1-2 weeks) +**Goal:** Prevent agents from exfiltrating system prompts or being jailbroken + +- **Naive injection detector** — as sketched above +- **Inbound scanning by default** — enabled for all routes unless explicitly disabled +- **Per-route config:** `inbound_detectors: [naive_injection_detection]` +- **Actions:** + - BLOCK: Credential + prompt disclosure detected + - WARN: Multiple jailbreak keywords or explicit prompt disclosure + - ALLOW: Single keywords or documentation phrases + +### Phase 3: Hardening & tuning (2-3 weeks, optional) +- Real-world false positive analysis from Phase 1 & 2 - Rate limiting on DLP blocks +- Audit/sampling mode for flagged responses +- Additional encodings for known_secrets (GZIP, base32, etc.) ## Open questions