2026-06-04 14:27:32 -04:00
1 changed files with 505 additions and 0 deletions
@@ -0,0 +1,505 @@
+# DLP alternatives to pipelock: per-route configuration and response handling
+
+## Question
+
+Pipelock lacks support for per-route or per-host response scanning rules, making it impossible to skip DLP scanning for large binary downloads (e.g., `.whl` files) while keeping scanning enabled for other traffic on the same host. Should we replace pipelock with a purpose-built DLP/token-scanning proxy that supports granular per-route configuration?
+
+## Summary
+
+Yes. Pipelock's flat, global configuration is fundamentally at odds with the per-route model bot-bottle is built on. A custom or configurable DLP proxy built atop mitmproxy (which we already use for egress) would let us:
+
+1. **Skip DLP scanning selectively** — e.g., scan responses from PyPI for credentials but skip scanning `.whl` file contents
+2. **Configure scanning per-route** — different rules for different hosts/paths without global toggles
+3. **Reduce operational surface** — one proxy (egress) instead of two (egress + pipelock)
+4. **Target AI-specific threats** — focus on credential exfiltration and prompt injection instead of generic DLP
+
+**Tradeoff:** We'd need to maintain our own scanning logic. Pipelock provides out-of-the-box BIP-39 seed-phrase detection, entropy checks, and pluggable DLP rules. Building custom logic means we need to be explicit about what we're protecting against and keep that code auditable.
+
+## Current pipelock limitations
+
+### Issue 1: No per-route response scanning rules
+
+Pipelock's response scanning is part of TLS interception — a global feature with no per-host knobs:
+
+```yaml
+tls_interception:
+  enabled: true
+  passthrough_domains: [...]  # Can skip MITM, but not just response scanning
+```
+
+**Status:** Tested with pipelock v2.3.0. Confirmed that:
+- `response_body_scanning` config field doesn't exist
+- No way to set per-host response size limits
+- No way to skip scanning for specific file extensions
+- `tls_passthrough: true` disables both request AND response scanning (we want request scanning to stay on)
+
+### Issue 2: Global configuration only
+
+All of pipelock's scanning rules are global. If route A wants to skip `.whl` scanning and route B wants to skip `.tar.gz`, there's nowhere to express that distinction — the config is flat.
+
+### Issue 3: LLM prompt-specific false positives
+
+Pipelock's BIP-39 seed-phrase detector fires on any 12+ English words matching a checksum, which is common in LLM prompts/responses. Bot-bottle disables this detector globally, sacrificing protection.
+
+### Issue 4: No prompt injection detection
+
+**Important clarification:** Pipelock does NOT detect prompt injections. It detects:
+- Token patterns (regex)
+- Entropy (random-looking strings)
+- BIP-39 seed phrases (12+ word checksums)
+
+But it cannot detect semantic attacks like:
+- Attempts to exfiltrate system prompts
+- Jailbreak attempts ("ignore previous instructions")
+- Model output that reveals internal system details
+
+This is a novel threat specific to LLM agents that pipelock wasn't designed for.
+
+## Replacement design: mitmproxy-based DLP addon
+
+Since bot-bottle already uses mitmproxy for egress (PRD 0017), we can extend the mitmproxy addon to do DLP scanning alongside egress rules:
+
+### Architecture
+
+```
+Agent
+  ↓ (HTTP_PROXY=http://egress:8080)
+Egress (mitmproxy)
+  ├─ Addon 1: Path allowlisting (current)
+  ├─ Addon 2: Credential injection (current)
+  └─ Addon 3: DLP scanning (NEW)
+       ├─ Config: per-route scanning rules from manifest
+       ├─ Detectors: token patterns, prompt injection, entropy
+       └─ Action: block/warn based on route config
+```
+
+### Per-route configuration in manifest
+
+Routes separately configure **outbound** (request to upstream) and **inbound** (response from upstream) scanning:
+
+```yaml
+egress:
+  routes:
+    - host: api.anthropic.com
+      dlp:
+        outbound_detectors: [token_patterns, known_secrets]  # default
+        inbound_detectors: [naive_injection_detection]  # default
+    
+    - host: files.pythonhosted.org
+      dlp:
+        outbound_detectors: [token_patterns, known_secrets]
+        inbound_detectors: false  # Skip response scanning (binary downloads)
+    
+    - host: internal-service.corp
+      dlp:
+        outbound_detectors: false
+        inbound_detectors: false  # Trusted internal, no scanning
+```
+
+**Detectors:**
+- `token_patterns` — API keys, GitHub tokens, AWS credentials, etc.
+- `known_secrets` — Secrets we provisioned (API keys, OAuth tokens passed via cred-proxy)
+- `naive_injection_detection` — Semantic attacks on system prompt (see section below)
+
+### Detector design
+
+Three core detectors, each with tunable sensitivity:
+
+1. **Token detector**
+   - Regex patterns for API keys (AWS `AKIA`, GitHub `ghp_`, etc.)
+   - Anthropic/OpenAI API keys
+   - OAuth tokens (Bearer patterns)
+   - Action: Block immediately with no false-positive tolerance
+
+2. **Entropy detector**
+   - Shannon entropy threshold (bits/char)
+   - Flags high-entropy secrets (tunable per-route)
+   - Current pipelock default: 4.5 bits/char
+   - Action: Warn or block based on route config
+
+3. **Prompt injection detector** (phase 2)
+   - Detect attempts to exfiltrate system prompts via LLM outputs
+   - Pattern: responses containing "system prompt", "instructions", "directive" + credential
+   - Action: Block or sample for audit
+
+### Advantages over pipelock
+
+| Aspect | Pipelock | Mitmproxy addon |
+|--------|----------|-----------------|
+| Per-route rules | ❌ (global only) | ✅ (manifest-driven) |
+| Response-specific config | ❌ (all-or-nothing) | ✅ (request_only, skip_extensions) |
+| Request scanning overhead | ✅ (lightweight) | ~same |
+| Maintenance burden | Low (third-party) | High (custom code) |
+| Auditability | Closed source | ✅ (in-repo) |
+| AI-specific detection | Limited | ✅ (token patterns, prompt injection) |
+| Code reuse | None | ✅ (egress addon framework) |
+
+### Disadvantages
+
+1. **Maintenance responsibility** — We own the security logic. Any bugs in detector regexes or entropy thresholds are our problem.
+2. **Feature parity gap** — Pipelock's BIP-39 detector is sophisticated. We'd need to decide: replicate it, skip it, or ship a simplified version.
+3. **Performance** — Custom Python detectors will be slower than pipelock's Go implementation. Benchmarking needed.
+4. **Coverage breadth** — Pipelock covers generic DLP (credit cards, SSNs, etc.). We'd focus narrowly on AI/credential exfil.
+
+## Alternative: Configurable pipelock fork
+
+Rather than build from scratch, fork pipelock and add `response_body_scanning` config:
+
+```yaml
+response_body_scanning:
+  enabled: true
+  skip_extensions: [".whl", ".tar.gz"]
+  max_response_bytes: 104857600  # 100MB
+```
+
+**Pros:**
+- Reuses existing detectors and maturity
+- Lower maintenance burden
+- Clear path to upstream (could be PR'd)
+
+**Cons:**
+- Still maintains a fork
+- Pipelock's maintainers may not want global per-host rules
+- Go code is farther from our codebase (harder to audit)
+- Doesn't solve prompt-injection detection
+
+## Recommendation
+
+**Build the mitmproxy addon** (phase 1: tokens + entropy; phase 2: prompt injection).
+
+**Rationale:**
+1. Bot-bottle already owns the mitmproxy egress addon — extending it keeps security logic in-repo and auditable.
+2. Per-route DLP configuration aligns with bot-bottle's design (PRD 0017 is already per-route).
+3. Replacing pipelock reduces sidecar count and operational surface.
+4. AI-specific detectors (tokens, prompt injection) matter more than generic DLP for agent containment.
+
+**Fallback:** If performance testing shows unacceptable latency in the Python addon, revisit the pipelock fork approach.
+
+## Naive prompt injection detector design
+
+Since pipelock doesn't detect prompt injections, we need a custom detector. Here's a permissive design that favors missing attacks over false positives:
+
+### What to detect
+
+**High confidence (block immediately):**
+1. Response contains known credential pattern + "system prompt" phrase together
+2. Response contains both "instructions" and a token pattern
+
+**Medium confidence (warn):**
+1. Response contains prompt-disclosure phrases without credentials (might be innocent documentation)
+2. Multiple jailbreak keywords in single response
+
+**Ignore (too noisy):**
+- Single jailbreak keywords without additional context
+- "system prompt" in documentation contexts
+- Common phrases like "instructions provided"
+
+### Naive detector pseudocode
+
+```python
+class PromptInjectionDetector:
+    # Phrases that suggest prompt exfiltration
+    DISCLOSURE_PHRASES = [
+        r'(?i)(system\s+prompt|instructions\s+given|your\s+role\s+is|you\s+are\s+an?)',
+        r'(?i)(original\s+instructions|secret\s+instructions|hidden\s+rules)',
+    ]
+    
+    # Phrases suggesting jailbreak attempts
+    JAILBREAK_PHRASES = [
+        r'(?i)(ignore\s+previous|forget\s+everything|disregard)',
+        r'(?i)(from\s+now\s+on|pretend|act\s+as)',
+        r'(?i)(bypass|circumvent|override)',
+    ]
+    
+    TOKEN_PATTERNS = [
+        r'AKIA[0-9A-Z]{16}',  # AWS
+        r'ghp_[A-Za-z0-9_]{36}',  # GitHub
+        r'sk_live_[A-Za-z0-9]{24}',  # Stripe
+        r'Bearer\s+[A-Za-z0-9._-]{50,}',  # JWT-like tokens
+    ]
+    
+    def scan_response(self, response_body):
+        """Returns (severity, reason) or (None, None) if clean."""
+        
+        # Rule 1: Disclosure + token = HIGH confidence block
+        disclosure_found = any(
+            re.search(phrase, response_body) 
+            for phrase in self.DISCLOSURE_PHRASES
+        )
+        token_found = any(
+            re.search(pattern, response_body)
+            for pattern in self.TOKEN_PATTERNS
+        )
+        
+        if disclosure_found and token_found:
+            return ("BLOCK", "Prompt disclosure with embedded credential")
+        
+        # Rule 2: Multiple jailbreak keywords = WARN
+        jailbreak_count = sum(
+            1 for phrase in self.JAILBREAK_PHRASES
+            if re.search(phrase, response_body)
+        )
+        
+        if jailbreak_count >= 2:
+            return ("WARN", f"{jailbreak_count} jailbreak attempts detected")
+        
+        # Rule 3: Disclosure alone without tokens = WARN only if very explicit
+        if disclosure_found and "system prompt:" in response_body.lower():
+            return ("WARN", "Explicit system prompt disclosure")
+        
+        # Otherwise: clean
+        return (None, None)
+```
+
+### Why this is permissive
+
+1. **Single keywords ignored** — "ignore previous instructions" in a legitimate conversation doesn't trigger
+2. **Context required** — disclosure phrases need tokens or multiple jailbreak attempts
+3. **Documentation exemption** — "instructions provided" in a help section won't block
+4. **Warn vs. block** — Only block on high-confidence signals; warn on medium
+5. **No entropy-based guessing** — We don't try to be clever about detecting obfuscated prompts
+
+### False negatives this misses
+
+This detector intentionally lets through:
+- Prompt injections using novel phrasing we haven't seen
+- Obfuscated jailbreak attempts ("behave differently", "role-play")
+- Exfiltration via indirect methods ("describe the system", "what are your constraints")
+- Sophisticated attacks that split the prompt across multiple exchanges
+
+**Rationale:** Better to miss a sophisticated jailbreak than block legitimate agent output 100 times/day.
+
+### Per-route configuration
+
+Routes can enable/disable prompt injection scanning:
+
+```yaml
+egress:
+  routes:
+    - host: api.anthropic.com
+      dlp:
+        enabled: true
+        detectors: [tokens, prompt_injection]
+    
+    - host: internal-docs.corp
+      dlp:
+        enabled: true
+        detectors: [tokens]  # Skip prompt injection (trusted internal)
+```
+
+## Implementation phases
+
+### Phase 1: Secret exfiltration detection
+**Goal:** Prevent credentials from leaking to upstream services
+
+- **Token patterns detector** — API keys, GitHub tokens, AWS credentials (regex-based)
+- **Known secrets detector** — Check if provisioned credentials appear in outbound traffic
+  - Secrets passed to cred-proxy or agent environment
+  - Multiple encodings (base64, hex, URL-encoded variants)
+- **Outbound scanning by default** — enabled for all routes unless explicitly disabled
+- **Per-route config:** `outbound_detectors: [token_patterns, known_secrets]`
+- **Action:** Block immediately on token match; warn on entropy threshold (tuned low to avoid false positives)
+
+### Phase 2: Prompt injection detection
+**Goal:** Prevent agents from exfiltrating system prompts or being jailbroken
+
+#### Option A: Naive pattern-based detector
+- **Naive injection detector** — as sketched above
+- **Inbound scanning by default** — enabled for all routes unless explicitly disabled
+- **Per-route config:** `inbound_detectors: [naive_injection_detection]`
+- **Actions:**
+  - BLOCK: Credential + prompt disclosure detected
+  - WARN: Multiple jailbreak keywords or explicit prompt disclosure
+  - ALLOW: Single keywords or documentation phrases
+
+#### Option B: LLM-based semantic detector
+See section below on using a specialized LLM for prompt injection detection.
+
+### Phase 3: Hardening & tuning
+- Real-world false positive analysis from Phase 1 & 2
+- Rate limiting on DLP blocks
+- Audit/sampling mode for flagged responses
+- Additional encodings for known_secrets (GZIP, base32, etc.)
+
+## LLM-based prompt injection detection
+
+### Viability analysis
+
+**Tradeoff:** Using an LLM to detect prompt injections is semantically more powerful than regex, but has latency and resource costs.
+
+**Requirements for bot-bottle:**
+- Sub-100ms latency (add-on to HTTP proxy, can't block traffic significantly)
+- <1GB RAM footprint (runs in sidecar alongside mitmproxy)
+- Simple API (classify: safe/injection/suspicious)
+- Preferably quantized/distilled (not full-size models)
+
+**Feasibility:** Marginal. Regex patterns are faster, but an LLM could catch sophisticated attacks.
+
+### Existing models
+
+**Purpose-built prompt injection detectors:**
+1. **Rebuff.ai's Prompt Injection API** (closed-source, commercial)
+   - Hosted detection service
+   - ~50ms per request
+   - Not viable (external dependency, adds latency)
+
+2. **Microsoft's Presidio** + custom rules
+   - Entity recognition + PII detection
+   - Broader than prompt injection
+   - Would need custom training for jailbreak/disclosure patterns
+
+3. **HuggingFace models:**
+   - `roberta-large-openai-detector` — detects GPT-2 text (not injections)
+   - No off-the-shelf model specifically for prompt injection
+
+**Training a custom model:**
+- **Data:** Dataset of prompt injection attempts vs. legitimate responses (limited public datasets)
+- **Architecture:** Binary classifier (DistilBERT, ALBERT) fine-tuned on injection examples
+- **Size:** DistilBERT ~268MB, quantized ~67MB (acceptable footprint)
+- **Latency:** ~50-150ms per response on CPU (concerning for proxy)
+
+### Recommendation
+
+**Phase 2a: Use naive pattern detector** (regex-based, sketched above)
+- Fast (<5ms per response)
+- Low false positives with permissive rules
+- No external dependencies
+
+**Phase 2b (optional, if needed): Evaluate LLM approach**
+- Collect real-world false negatives from pattern detector
+- If sophisticated attacks slip through, consider DistilBERT-based classifier
+- Quantize + run locally in sidecar
+- Benchmark against 100ms latency budget
+- Fall back to patterns if latency unacceptable
+
+**Why not jump to LLM:**
+1. Latency: 50-150ms adds significant overhead to every response
+2. Complexity: Custom model training needed; no off-the-shelf solution
+3. Overkill: Pattern detector catches obvious attacks; sophisticated attacks are rare
+4. Unknown unknowns: Adversaries can evade LLM-based detectors via adversarial prompts
+
+### If we do build an LLM detector
+
+```python
+# Sketch of LLM-based detection
+class LLMPromptInjectionDetector:
+    def __init__(self):
+        # Quantized DistilBERT, fine-tuned on injection examples
+        self.model = load_model("prompt-injection-classifier-q4")  # ~67MB
+        self.tokenizer = load_tokenizer("distilbert-base-uncased")
+    
+    def scan_response(self, response_body, timeout_ms=100):
+        """
+        Returns: (verdict, confidence)
+        - verdict: "safe", "suspicious", "injection"
+        - confidence: 0.0-1.0
+        """
+        try:
+            # Timeout hard at 100ms to avoid proxy bottleneck
+            tokens = self.tokenizer.encode(response_body[:2000], truncation=True)
+            logits = self.model(tokens, timeout=timeout_ms)
+            
+            injection_score = logits["injection_class"]
+            
+            if injection_score > 0.9:
+                return ("injection", injection_score)
+            elif injection_score > 0.7:
+                return ("suspicious", injection_score)
+            else:
+                return ("safe", injection_score)
+        except TimeoutError:
+            # On timeout, fall back to pattern detector
+            return self.fallback_pattern_detector(response_body)
+```
+
+**Deployment questions:**
+1. Which LLM framework? (transformers, ONNX, TensorRT?)
+2. How to handle out-of-memory on large responses?
+3. How to update model if new jailbreak techniques emerge?
+4. Should we ensemble: LLM + patterns for high-confidence blocks?
+
+## CaMeL: Architectural prompt injection defense
+
+[CaMeL](https://github.com/google-research/camel-prompt-injection) (CApabilities for MachinE Learning) is a Google DeepMind research project that takes a fundamentally different approach to prompt injection. Rather than *detecting* injections at the network layer (like everything above), CaMeL *prevents* them by architecture — it makes prompt injections structurally unable to cause harm even when the underlying LLM is fully compromised.
+
+### How CaMeL works
+
+CaMeL wraps the agent in a protective execution layer with three components:
+
+1. **Privileged LLM (P-LLM):** Sees ONLY the user's trusted query. Generates a plan as restricted Python code — which tools to call, in what order, with what data flow. Never sees untrusted content (tool outputs, external data).
+
+2. **Quarantined LLM (Q-LLM):** Processes untrusted data (parsing emails, extracting names from documents). Has NO tool access. Even if prompt-injected, it can only return parsed data — it cannot change which tools run or how they're called.
+
+3. **CaMeL interpreter:** Executes the P-LLM's plan, maintains a **data flow graph** tracking every value's provenance (where it came from, who can read it), and enforces **capability-based security policies** before each tool call.
+
+Example flow for "Send Bob the document from our meeting notes":
+
+```
+P-LLM generates plan (trusted, no external data):
+    notes = find_meeting_notes()
+    doc_name = query_quarantined_llm(f"extract doc name from {notes}")
+    doc = fetch_document(doc_name)
+    address = query_quarantined_llm(f"extract Bob's email from {notes}")
+    send_email(doc, address)
+
+Interpreter executes:
+    1. find_meeting_notes() → tags result as source:notes, readers:[user]
+    2. Q-LLM extracts doc name → even if injected, Q-LLM can't call tools
+    3. fetch_document() → tags result with source:drive
+    4. Q-LLM extracts email → even if injected to return attacker@evil.com,
+       the capability check on send_email() catches that `address` came from
+       an untrusted source and blocks or prompts the user
+```
+
+The key insight: **control flow is determined by the trusted P-LLM's code, not by data content**. Untrusted data can never alter which tools run. And capabilities track provenance so that even manipulated *data values* get caught at tool-call time.
+
+### Performance
+
+On AgentDojo benchmark: **77% task completion with provable security guarantees** vs. 84% for an undefended agent. The 7% utility cost comes from the interpreter occasionally being too conservative (blocking legitimate operations where data provenance is ambiguous).
+
+### Integration with bot-bottle: assessment
+
+**CaMeL is NOT a replacement for pipelock or a network-layer DLP scanner.** It operates at a completely different layer — it's an agent execution framework, not a proxy. It wouldn't help with the original problem (scanning `.whl` downloads for credentials).
+
+However, CaMeL is deeply relevant to bot-bottle's broader security model:
+
+| Layer | Current bot-bottle | CaMeL equivalent |
+|-------|-------------------|------------------|
+| Network egress | Pipelock (hostname allowlist + DLP) | N/A (doesn't operate here) |
+| Credential injection | Egress addon (per-route auth) | N/A |
+| Tool access control | None (agent has full permissions) | **Capability-based policies** |
+| Data provenance | None | **Data flow graph** |
+| Control flow integrity | None (agent decides everything) | **P-LLM generates plan, interpreter enforces** |
+
+**What CaMeL would add that bot-bottle lacks today:**
+- **Data flow tracking** — bot-bottle controls *which hosts* an agent can reach, but not *what data* flows to those hosts. CaMeL tracks provenance per-value.
+- **Tool-call policies** — bot-bottle doesn't restrict which tools an agent calls or what arguments it passes. CaMeL enforces policies at every tool invocation.
+- **Separation of planning and execution** — bot-bottle gives the agent full autonomy. CaMeL splits planning (trusted) from data processing (untrusted).
+
+**Why CaMeL is NOT viable for bot-bottle today:**
+
+1. **Research artifact, not production software.** The README explicitly warns: "the interpreter implementation likely contains bugs...and might not be fully secure." Apache-2.0 licensed but no maintenance commitment.
+
+2. **Requires restructuring the agent.** CaMeL doesn't wrap an existing agent — it *replaces* the agent's execution model. Claude Code / Codex would need to be fundamentally rearchitected to generate CaMeL-compatible plans instead of directly calling tools. This is not a drop-in.
+
+3. **LLM overhead.** CaMeL requires two LLM calls per step (P-LLM for planning, Q-LLM for data parsing). For a coding agent that makes hundreds of tool calls per session, this doubles API costs and adds significant latency.
+
+4. **Utility cost.** 7% task completion loss on AgentDojo. For a coding agent where correctness matters, even small degradation in capability could be unacceptable.
+
+5. **Scope mismatch.** CaMeL protects against prompt injection via untrusted data sources. Bot-bottle's primary threat model is credential exfiltration and sandbox escape — different attack surface.
+
+### Verdict
+
+**Don't integrate CaMeL now.** It solves a real problem (prompt injection via data flow manipulation) but at a layer bot-bottle doesn't currently operate at, and with maturity/integration costs that are too high.
+
+**Watch it for the future.** If CaMeL matures into a production-ready library, its capability model could complement bot-bottle's network-layer controls — bot-bottle handles "which hosts can the agent reach" while CaMeL handles "what data can flow to those hosts." The combination would be defense-in-depth across both network and application layers.
+
+**For now, our phases stand:** Phase 1 (outbound secret exfiltration via DLP addon) and Phase 2 (inbound prompt injection via naive pattern detector) address bot-bottle's immediate needs at the network layer where we already operate.
+
+## Open questions
+
+1. **Performance:** How much latency does Python string-matching add? Benchmark against pipelock.
+2. **False positives:** Will entropy detector trip on legitimate high-entropy traffic (e.g., binary API responses)? Need real-world testing.
+3. **Coverage:** Are regex patterns sufficient, or do we need more sophisticated token detection (e.g., format validation)?
+4. **Upstream:** If we build this, should we upstream it as an option to pipelock, or keep it bot-bottle-specific?
+5. **CaMeL long-term:** Monitor the project for production readiness. If it stabilizes, evaluate as a complementary application-layer defense alongside our network-layer DLP.