bot-bottle/docs/research/dlp-alternatives-to-pipelock.md

# DLP alternatives to pipelock: per-route configuration and response handling

## Question

Pipelock lacks support for per-route or per-host response scanning rules, making it impossible to skip DLP scanning for large binary downloads (e.g., `.whl` files) while keeping scanning enabled for other traffic on the same host. Should we replace pipelock with a purpose-built DLP/token-scanning proxy that supports granular per-route configuration?

## Summary

Yes. Pipelock's flat, global configuration is fundamentally at odds with the per-route model bot-bottle is built on. A custom or configurable DLP proxy built atop mitmproxy (which we already use for egress) would let us:

1. **Skip DLP scanning selectively** — e.g., scan responses from PyPI for credentials but skip scanning `.whl` file contents
2. **Configure scanning per-route** — different rules for different hosts/paths without global toggles
3. **Reduce operational surface** — one proxy (egress) instead of two (egress + pipelock)
4. **Target AI-specific threats** — focus on credential exfiltration and prompt injection instead of generic DLP

**Tradeoff:** We'd need to maintain our own scanning logic. Pipelock provides out-of-the-box BIP-39 seed-phrase detection, entropy checks, and pluggable DLP rules. Building custom logic means we need to be explicit about what we're protecting against and keep that code auditable.

## Current pipelock limitations

### Issue 1: No per-route response scanning rules

Pipelock's response scanning is part of TLS interception — a global feature with no per-host knobs:

```yaml
tls_interception:
  enabled: true
  passthrough_domains: [...]  # Can skip MITM, but not just response scanning
```

**Status:** Tested with pipelock v2.3.0. Confirmed that:
- `response_body_scanning` config field doesn't exist
- No way to set per-host response size limits
- No way to skip scanning for specific file extensions
- `tls_passthrough: true` disables both request AND response scanning (we want request scanning to stay on)

### Issue 2: Global configuration only

All of pipelock's scanning rules are global. If route A wants to skip `.whl` scanning and route B wants to skip `.tar.gz`, there's nowhere to express that distinction — the config is flat.

### Issue 3: LLM prompt-specific false positives

Pipelock's BIP-39 seed-phrase detector fires on any 12+ English words matching a checksum, which is common in LLM prompts/responses. Bot-bottle disables this detector globally, sacrificing protection.

### Issue 4: No prompt injection detection

**Important clarification:** Pipelock does NOT detect prompt injections. It detects:
- Token patterns (regex)
- Entropy (random-looking strings)
- BIP-39 seed phrases (12+ word checksums)

But it cannot detect semantic attacks like:
- Attempts to exfiltrate system prompts
- Jailbreak attempts ("ignore previous instructions")
- Model output that reveals internal system details

This is a novel threat specific to LLM agents that pipelock wasn't designed for.

## Replacement design: mitmproxy-based DLP addon

Since bot-bottle already uses mitmproxy for egress (PRD 0017), we can extend the mitmproxy addon to do DLP scanning alongside egress rules:

### Architecture

```
Agent
  ↓ (HTTP_PROXY=http://egress:8080)
Egress (mitmproxy)
  ├─ Addon 1: Path allowlisting (current)
  ├─ Addon 2: Credential injection (current)
  └─ Addon 3: DLP scanning (NEW)
       ├─ Config: per-route scanning rules from manifest
       ├─ Detectors: token patterns, prompt injection, entropy
       └─ Action: block/warn based on route config
```

### Per-route configuration in manifest

Routes separately configure **outbound** (request to upstream) and **inbound** (response from upstream) scanning:

```yaml
egress:
  routes:
    - host: api.anthropic.com
      dlp:
        outbound_detectors: [token_patterns, known_secrets]  # default
        inbound_detectors: [naive_injection_detection]  # default

    - host: files.pythonhosted.org
      dlp:
        outbound_detectors: [token_patterns, known_secrets]
        inbound_detectors: false  # Skip response scanning (binary downloads)

    - host: internal-service.corp
      dlp:
        outbound_detectors: false
        inbound_detectors: false  # Trusted internal, no scanning
```

**Detectors:**
- `token_patterns` — API keys, GitHub tokens, AWS credentials, etc.
- `known_secrets` — Secrets we provisioned (API keys, OAuth tokens passed via cred-proxy)
- `naive_injection_detection` — Semantic attacks on system prompt (see section below)

### Detector design

Three core detectors, each with tunable sensitivity:

1. **Token detector**
   - Regex patterns for API keys (AWS `AKIA`, GitHub `ghp_`, etc.)
   - Anthropic/OpenAI API keys
   - OAuth tokens (Bearer patterns)
   - Action: Block immediately with no false-positive tolerance

2. **Entropy detector**
   - Shannon entropy threshold (bits/char)
   - Flags high-entropy secrets (tunable per-route)
   - Current pipelock default: 4.5 bits/char
   - Action: Warn or block based on route config

3. **Prompt injection detector** (phase 2)
   - Detect attempts to exfiltrate system prompts via LLM outputs
   - Pattern: responses containing "system prompt", "instructions", "directive" + credential
   - Action: Block or sample for audit

### Advantages over pipelock

| Aspect | Pipelock | Mitmproxy addon |
|--------|----------|-----------------|
| Per-route rules | ❌ (global only) | ✅ (manifest-driven) |
| Response-specific config | ❌ (all-or-nothing) | ✅ (request_only, skip_extensions) |
| Request scanning overhead | ✅ (lightweight) | ~same |
| Maintenance burden | Low (third-party) | High (custom code) |
| Auditability | Closed source | ✅ (in-repo) |
| AI-specific detection | Limited | ✅ (token patterns, prompt injection) |
| Code reuse | None | ✅ (egress addon framework) |

### Disadvantages

1. **Maintenance responsibility** — We own the security logic. Any bugs in detector regexes or entropy thresholds are our problem.
2. **Feature parity gap** — Pipelock's BIP-39 detector is sophisticated. We'd need to decide: replicate it, skip it, or ship a simplified version.
3. **Performance** — Custom Python detectors will be slower than pipelock's Go implementation. Benchmarking needed.
4. **Coverage breadth** — Pipelock covers generic DLP (credit cards, SSNs, etc.). We'd focus narrowly on AI/credential exfil.

## Alternative: Configurable pipelock fork

Rather than build from scratch, fork pipelock and add `response_body_scanning` config:

```yaml
response_body_scanning:
  enabled: true
  skip_extensions: [".whl", ".tar.gz"]
  max_response_bytes: 104857600  # 100MB
```

**Pros:**
- Reuses existing detectors and maturity
- Lower maintenance burden
- Clear path to upstream (could be PR'd)

**Cons:**
- Still maintains a fork
- Pipelock's maintainers may not want global per-host rules
- Go code is farther from our codebase (harder to audit)
- Doesn't solve prompt-injection detection

## Recommendation

**Build the mitmproxy addon** (phase 1: tokens + entropy; phase 2: prompt injection).

**Rationale:**
1. Bot-bottle already owns the mitmproxy egress addon — extending it keeps security logic in-repo and auditable.
2. Per-route DLP configuration aligns with bot-bottle's design (PRD 0017 is already per-route).
3. Replacing pipelock reduces sidecar count and operational surface.
4. AI-specific detectors (tokens, prompt injection) matter more than generic DLP for agent containment.

**Fallback:** If performance testing shows unacceptable latency in the Python addon, revisit the pipelock fork approach.

## Naive prompt injection detector design

Since pipelock doesn't detect prompt injections, we need a custom detector. Here's a permissive design that favors missing attacks over false positives:

### What to detect

**High confidence (block immediately):**
1. Response contains known credential pattern + "system prompt" phrase together
2. Response contains both "instructions" and a token pattern

**Medium confidence (warn):**
1. Response contains prompt-disclosure phrases without credentials (might be innocent documentation)
2. Multiple jailbreak keywords in single response

**Ignore (too noisy):**
- Single jailbreak keywords without additional context
- "system prompt" in documentation contexts
- Common phrases like "instructions provided"

### Naive detector pseudocode

```python
class PromptInjectionDetector:
    # Phrases that suggest prompt exfiltration
    DISCLOSURE_PHRASES = [
        r'(?i)(system\s+prompt|instructions\s+given|your\s+role\s+is|you\s+are\s+an?)',
        r'(?i)(original\s+instructions|secret\s+instructions|hidden\s+rules)',
    ]

    # Phrases suggesting jailbreak attempts
    JAILBREAK_PHRASES = [
        r'(?i)(ignore\s+previous|forget\s+everything|disregard)',
        r'(?i)(from\s+now\s+on|pretend|act\s+as)',
        r'(?i)(bypass|circumvent|override)',
    ]

    TOKEN_PATTERNS = [
        r'AKIA[0-9A-Z]{16}',  # AWS
        r'ghp_[A-Za-z0-9_]{36}',  # GitHub
        r'sk_live_[A-Za-z0-9]{24}',  # Stripe
        r'Bearer\s+[A-Za-z0-9._-]{50,}',  # JWT-like tokens
    ]

    def scan_response(self, response_body):
        """Returns (severity, reason) or (None, None) if clean."""

        # Rule 1: Disclosure + token = HIGH confidence block
        disclosure_found = any(
            re.search(phrase, response_body)
            for phrase in self.DISCLOSURE_PHRASES
        )
        token_found = any(
            re.search(pattern, response_body)
            for pattern in self.TOKEN_PATTERNS
        )

        if disclosure_found and token_found:
            return ("BLOCK", "Prompt disclosure with embedded credential")

        # Rule 2: Multiple jailbreak keywords = WARN
        jailbreak_count = sum(
            1 for phrase in self.JAILBREAK_PHRASES
            if re.search(phrase, response_body)
        )

        if jailbreak_count >= 2:
            return ("WARN", f"{jailbreak_count} jailbreak attempts detected")

        # Rule 3: Disclosure alone without tokens = WARN only if very explicit
        if disclosure_found and "system prompt:" in response_body.lower():
            return ("WARN", "Explicit system prompt disclosure")

        # Otherwise: clean
        return (None, None)
```

### Why this is permissive

1. **Single keywords ignored** — "ignore previous instructions" in a legitimate conversation doesn't trigger
2. **Context required** — disclosure phrases need tokens or multiple jailbreak attempts
3. **Documentation exemption** — "instructions provided" in a help section won't block
4. **Warn vs. block** — Only block on high-confidence signals; warn on medium
5. **No entropy-based guessing** — We don't try to be clever about detecting obfuscated prompts

### False negatives this misses

This detector intentionally lets through:
- Prompt injections using novel phrasing we haven't seen
- Obfuscated jailbreak attempts ("behave differently", "role-play")
- Exfiltration via indirect methods ("describe the system", "what are your constraints")
- Sophisticated attacks that split the prompt across multiple exchanges

**Rationale:** Better to miss a sophisticated jailbreak than block legitimate agent output 100 times/day.

### Per-route configuration

Routes can enable/disable prompt injection scanning:

```yaml
egress:
  routes:
    - host: api.anthropic.com
      dlp:
        enabled: true
        detectors: [tokens, prompt_injection]

    - host: internal-docs.corp
      dlp:
        enabled: true
        detectors: [tokens]  # Skip prompt injection (trusted internal)
```

## Implementation phases

### Phase 1: Secret exfiltration detection
**Goal:** Prevent credentials from leaking to upstream services

- **Token patterns detector** — API keys, GitHub tokens, AWS credentials (regex-based)
- **Known secrets detector** — Check if provisioned credentials appear in outbound traffic
  - Secrets passed to cred-proxy or agent environment
  - Multiple encodings (base64, hex, URL-encoded variants)
- **Outbound scanning by default** — enabled for all routes unless explicitly disabled
- **Per-route config:** `outbound_detectors: [token_patterns, known_secrets]`
- **Action:** Block immediately on token match; warn on entropy threshold (tuned low to avoid false positives)

### Phase 2: Prompt injection detection
**Goal:** Prevent agents from exfiltrating system prompts or being jailbroken

#### Option A: Naive pattern-based detector
- **Naive injection detector** — as sketched above
- **Inbound scanning by default** — enabled for all routes unless explicitly disabled
- **Per-route config:** `inbound_detectors: [naive_injection_detection]`
- **Actions:**
  - BLOCK: Credential + prompt disclosure detected
  - WARN: Multiple jailbreak keywords or explicit prompt disclosure
  - ALLOW: Single keywords or documentation phrases

#### Option B: LLM-based semantic detector
See section below on using a specialized LLM for prompt injection detection.

### Phase 3: Hardening & tuning
- Real-world false positive analysis from Phase 1 & 2
- Rate limiting on DLP blocks
- Audit/sampling mode for flagged responses
- Additional encodings for known_secrets (GZIP, base32, etc.)

## LLM-based prompt injection detection

### Viability analysis

**Tradeoff:** Using an LLM to detect prompt injections is semantically more powerful than regex, but has latency and resource costs.

**Requirements for bot-bottle:**
- Sub-100ms latency (add-on to HTTP proxy, can't block traffic significantly)
- <1GB RAM footprint (runs in sidecar alongside mitmproxy)
- Simple API (classify: safe/injection/suspicious)
- Preferably quantized/distilled (not full-size models)

**Feasibility:** Marginal. Regex patterns are faster, but an LLM could catch sophisticated attacks.

### Existing models

**Purpose-built prompt injection detectors:**
1. **Rebuff.ai's Prompt Injection API** (closed-source, commercial)
   - Hosted detection service
   - ~50ms per request
   - Not viable (external dependency, adds latency)

2. **Microsoft's Presidio** + custom rules
   - Entity recognition + PII detection
   - Broader than prompt injection
   - Would need custom training for jailbreak/disclosure patterns

3. **HuggingFace models:**
   - `roberta-large-openai-detector` — detects GPT-2 text (not injections)
   - No off-the-shelf model specifically for prompt injection

**Training a custom model:**
- **Data:** Dataset of prompt injection attempts vs. legitimate responses (limited public datasets)
- **Architecture:** Binary classifier (DistilBERT, ALBERT) fine-tuned on injection examples
- **Size:** DistilBERT ~268MB, quantized ~67MB (acceptable footprint)
- **Latency:** ~50-150ms per response on CPU (concerning for proxy)

### Recommendation

**Phase 2a: Use naive pattern detector** (regex-based, sketched above)
- Fast (<5ms per response)
- Low false positives with permissive rules
- No external dependencies

**Phase 2b (optional, if needed): Evaluate LLM approach**
- Collect real-world false negatives from pattern detector
- If sophisticated attacks slip through, consider DistilBERT-based classifier
- Quantize + run locally in sidecar
- Benchmark against 100ms latency budget
- Fall back to patterns if latency unacceptable

**Why not jump to LLM:**
1. Latency: 50-150ms adds significant overhead to every response
2. Complexity: Custom model training needed; no off-the-shelf solution
3. Overkill: Pattern detector catches obvious attacks; sophisticated attacks are rare
4. Unknown unknowns: Adversaries can evade LLM-based detectors via adversarial prompts

### If we do build an LLM detector

```python
# Sketch of LLM-based detection
class LLMPromptInjectionDetector:
    def __init__(self):
        # Quantized DistilBERT, fine-tuned on injection examples
        self.model = load_model("prompt-injection-classifier-q4")  # ~67MB
        self.tokenizer = load_tokenizer("distilbert-base-uncased")

    def scan_response(self, response_body, timeout_ms=100):
        """
        Returns: (verdict, confidence)
        - verdict: "safe", "suspicious", "injection"
        - confidence: 0.0-1.0
        """
        try:
            # Timeout hard at 100ms to avoid proxy bottleneck
            tokens = self.tokenizer.encode(response_body[:2000], truncation=True)
            logits = self.model(tokens, timeout=timeout_ms)

            injection_score = logits["injection_class"]

            if injection_score > 0.9:
                return ("injection", injection_score)
            elif injection_score > 0.7:
                return ("suspicious", injection_score)
            else:
                return ("safe", injection_score)
        except TimeoutError:
            # On timeout, fall back to pattern detector
            return self.fallback_pattern_detector(response_body)
```

**Deployment questions:**
1. Which LLM framework? (transformers, ONNX, TensorRT?)
2. How to handle out-of-memory on large responses?
3. How to update model if new jailbreak techniques emerge?
4. Should we ensemble: LLM + patterns for high-confidence blocks?

## Open questions

1. **Performance:** How much latency does Python string-matching add? Benchmark against pipelock.
2. **False positives:** Will entropy detector trip on legitimate high-entropy traffic (e.g., binary API responses)? Need real-world testing.
3. **Coverage:** Are regex patterns sufficient, or do we need more sophisticated token detection (e.g., format validation)?
4. **Upstream:** If we build this, should we upstream it as an option to pipelock, or keep it bot-bottle-specific?