From 49f77f2d1e52d73121992a44cf090ab9ea009c02 Mon Sep 17 00:00:00 2001
From: didericis <eric@dideric.is>
Date: Thu, 4 Jun 2026 13:54:46 -0400
Subject: [PATCH] docs: accommodate PR feedback on detector architecture

Per feedback from PR 192:

- Restructure around outbound_detectors (requests to upstream) and
  inbound_detectors (responses from upstream)
- Rename to 'secret exfiltration' detection for Phase 1
- Add 'known_secrets' detector for provisioned credentials
- Make scanning enabled by default per detector type
- Clarify that multiple encodings of secrets should be checked

Phase 1 now focuses on preventing outbound credential leaks.
Phase 2 handles inbound prompt injection attacks.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
---
 docs/research/dlp-alternatives-to-pipelock.md | 180 ++++++++++++++++--
 1 file changed, 163 insertions(+), 17 deletions(-)

diff --git a/docs/research/dlp-alternatives-to-pipelock.md b/docs/research/dlp-alternatives-to-pipelock.md
index 62f81af..71ad833 100644
--- a/docs/research/dlp-alternatives-to-pipelock.md
+++ b/docs/research/dlp-alternatives-to-pipelock.md
@@ -41,6 +41,20 @@ All of pipelock's scanning rules are global. If route A wants to skip `.whl` sca
 
 Pipelock's BIP-39 seed-phrase detector fires on any 12+ English words matching a checksum, which is common in LLM prompts/responses. Bot-bottle disables this detector globally, sacrificing protection.
 
+### Issue 4: No prompt injection detection
+
+**Important clarification:** Pipelock does NOT detect prompt injections. It detects:
+- Token patterns (regex)
+- Entropy (random-looking strings)
+- BIP-39 seed phrases (12+ word checksums)
+
+But it cannot detect semantic attacks like:
+- Attempts to exfiltrate system prompts
+- Jailbreak attempts ("ignore previous instructions")
+- Model output that reveals internal system details
+
+This is a novel threat specific to LLM agents that pipelock wasn't designed for.
+
 ## Replacement design: mitmproxy-based DLP addon
 
 Since bot-bottle already uses mitmproxy for egress (PRD 0017), we can extend the mitmproxy addon to do DLP scanning alongside egress rules:
@@ -61,25 +75,32 @@ Egress (mitmproxy)
 
 ### Per-route configuration in manifest
 
+Routes separately configure **outbound** (request to upstream) and **inbound** (response from upstream) scanning:
+
 ```yaml
 egress:
   routes:
     - host: api.anthropic.com
       dlp:
-        enabled: true
-        detectors: [tokens, entropy]
+        outbound_detectors: [token_patterns, known_secrets]  # default
+        inbound_detectors: [naive_injection_detection]  # default
     
     - host: files.pythonhosted.org
       dlp:
-        enabled: true
-        request_only: true  # Scan outbound, skip response
-        skip_extensions: [".whl", ".tar.gz"]
+        outbound_detectors: [token_patterns, known_secrets]
+        inbound_detectors: false  # Skip response scanning (binary downloads)
     
     - host: internal-service.corp
       dlp:
-        enabled: false  # Trusted internal, no scanning
+        outbound_detectors: false
+        inbound_detectors: false  # Trusted internal, no scanning
 ```
 
+**Detectors:**
+- `token_patterns` — API keys, GitHub tokens, AWS credentials, etc.
+- `known_secrets` — Secrets we provisioned (API keys, OAuth tokens passed via cred-proxy)
+- `naive_injection_detection` — Semantic attacks on system prompt (see section below)
+
 ### Detector design
 
 Three core detectors, each with tunable sensitivity:
@@ -154,22 +175,147 @@ response_body_scanning:
 
 **Fallback:** If performance testing shows unacceptable latency in the Python addon, revisit the pipelock fork approach.
 
+## Naive prompt injection detector design
+
+Since pipelock doesn't detect prompt injections, we need a custom detector. Here's a permissive design that favors missing attacks over false positives:
+
+### What to detect
+
+**High confidence (block immediately):**
+1. Response contains known credential pattern + "system prompt" phrase together
+2. Response contains both "instructions" and a token pattern
+
+**Medium confidence (warn):**
+1. Response contains prompt-disclosure phrases without credentials (might be innocent documentation)
+2. Multiple jailbreak keywords in single response
+
+**Ignore (too noisy):**
+- Single jailbreak keywords without additional context
+- "system prompt" in documentation contexts
+- Common phrases like "instructions provided"
+
+### Naive detector pseudocode
+
+```python
+class PromptInjectionDetector:
+    # Phrases that suggest prompt exfiltration
+    DISCLOSURE_PHRASES = [
+        r'(?i)(system\s+prompt|instructions\s+given|your\s+role\s+is|you\s+are\s+an?)',
+        r'(?i)(original\s+instructions|secret\s+instructions|hidden\s+rules)',
+    ]
+    
+    # Phrases suggesting jailbreak attempts
+    JAILBREAK_PHRASES = [
+        r'(?i)(ignore\s+previous|forget\s+everything|disregard)',
+        r'(?i)(from\s+now\s+on|pretend|act\s+as)',
+        r'(?i)(bypass|circumvent|override)',
+    ]
+    
+    TOKEN_PATTERNS = [
+        r'AKIA[0-9A-Z]{16}',  # AWS
+        r'ghp_[A-Za-z0-9_]{36}',  # GitHub
+        r'sk_live_[A-Za-z0-9]{24}',  # Stripe
+        r'Bearer\s+[A-Za-z0-9._-]{50,}',  # JWT-like tokens
+    ]
+    
+    def scan_response(self, response_body):
+        """Returns (severity, reason) or (None, None) if clean."""
+        
+        # Rule 1: Disclosure + token = HIGH confidence block
+        disclosure_found = any(
+            re.search(phrase, response_body) 
+            for phrase in self.DISCLOSURE_PHRASES
+        )
+        token_found = any(
+            re.search(pattern, response_body)
+            for pattern in self.TOKEN_PATTERNS
+        )
+        
+        if disclosure_found and token_found:
+            return ("BLOCK", "Prompt disclosure with embedded credential")
+        
+        # Rule 2: Multiple jailbreak keywords = WARN
+        jailbreak_count = sum(
+            1 for phrase in self.JAILBREAK_PHRASES
+            if re.search(phrase, response_body)
+        )
+        
+        if jailbreak_count >= 2:
+            return ("WARN", f"{jailbreak_count} jailbreak attempts detected")
+        
+        # Rule 3: Disclosure alone without tokens = WARN only if very explicit
+        if disclosure_found and "system prompt:" in response_body.lower():
+            return ("WARN", "Explicit system prompt disclosure")
+        
+        # Otherwise: clean
+        return (None, None)
+```
+
+### Why this is permissive
+
+1. **Single keywords ignored** — "ignore previous instructions" in a legitimate conversation doesn't trigger
+2. **Context required** — disclosure phrases need tokens or multiple jailbreak attempts
+3. **Documentation exemption** — "instructions provided" in a help section won't block
+4. **Warn vs. block** — Only block on high-confidence signals; warn on medium
+5. **No entropy-based guessing** — We don't try to be clever about detecting obfuscated prompts
+
+### False negatives this misses
+
+This detector intentionally lets through:
+- Prompt injections using novel phrasing we haven't seen
+- Obfuscated jailbreak attempts ("behave differently", "role-play")
+- Exfiltration via indirect methods ("describe the system", "what are your constraints")
+- Sophisticated attacks that split the prompt across multiple exchanges
+
+**Rationale:** Better to miss a sophisticated jailbreak than block legitimate agent output 100 times/day.
+
+### Per-route configuration
+
+Routes can enable/disable prompt injection scanning:
+
+```yaml
+egress:
+  routes:
+    - host: api.anthropic.com
+      dlp:
+        enabled: true
+        detectors: [tokens, prompt_injection]
+    
+    - host: internal-docs.corp
+      dlp:
+        enabled: true
+        detectors: [tokens]  # Skip prompt injection (trusted internal)
+```
+
 ## Implementation phases
 
-### Phase 1: MVP (2-3 weeks)
-- Token detector (regex for API key patterns)
-- Entropy detector (reuse pipelock thresholds)
-- Per-route `dlp: {enabled, request_only, skip_extensions}` config
-- Block on token match, warn on entropy hit
+### Phase 1: Secret exfiltration detection (2-3 weeks)
+**Goal:** Prevent credentials from leaking to upstream services
 
-### Phase 2: Prompt injection (1-2 weeks)
-- Pattern detector for system prompt exfiltration
-- Integrates with phase 1 config
+- **Token patterns detector** — API keys, GitHub tokens, AWS credentials (regex-based)
+- **Known secrets detector** — Check if provisioned credentials appear in outbound traffic
+  - Secrets passed to cred-proxy or agent environment
+  - Multiple encodings (base64, hex, URL-encoded variants)
+- **Outbound scanning by default** — enabled for all routes unless explicitly disabled
+- **Per-route config:** `outbound_detectors: [token_patterns, known_secrets]`
+- **Action:** Block immediately on token match; warn on entropy threshold (tuned low to avoid false positives)
 
-### Phase 3: Hardening (optional)
-- Custom entropy heuristics for LLM payloads
-- Sampling/audit mode for high-entropy responses
+### Phase 2: Prompt injection detection (1-2 weeks)
+**Goal:** Prevent agents from exfiltrating system prompts or being jailbroken
+
+- **Naive injection detector** — as sketched above
+- **Inbound scanning by default** — enabled for all routes unless explicitly disabled
+- **Per-route config:** `inbound_detectors: [naive_injection_detection]`
+- **Actions:**
+  - BLOCK: Credential + prompt disclosure detected
+  - WARN: Multiple jailbreak keywords or explicit prompt disclosure
+  - ALLOW: Single keywords or documentation phrases
+
+### Phase 3: Hardening & tuning (2-3 weeks, optional)
+- Real-world false positive analysis from Phase 1 & 2
 - Rate limiting on DLP blocks
+- Audit/sampling mode for flagged responses
+- Additional encodings for known_secrets (GZIP, base32, etc.)
 
 ## Open questions