didericis/bot-bottle

Fork 0

Files

T

didericis e6b3cd1824

test / unit (pull_request) Successful in 35s

Details

test / integration (pull_request) Successful in 45s

Details

docs: remove time estimates and add LLM-based detection analysis

- Remove all time estimates (2-3 weeks, 1-2 weeks, etc.)
- Add detailed analysis of using LLM for prompt injection detection
- Survey existing models (none purpose-built for this)
- Sketch DistilBERT fine-tuning approach (~67MB quantized)
- Analyze latency/footprint tradeoffs (50-150ms vs. <5ms for patterns)
- Recommend pattern-based Phase 2, with LLM as optional Phase 2b
- Include code sketch of LLM detector with timeout fallback
- List open questions for LLM deployment

Conclusion: Patterns are faster/simpler for now; LLM only if patterns
miss sophisticated attacks in production.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

2026-06-04 14:02:59 -04:00

17 KiB

Raw Blame History

DLP alternatives to pipelock: per-route configuration and response handling

Question

Pipelock lacks support for per-route or per-host response scanning rules, making it impossible to skip DLP scanning for large binary downloads (e.g., .whl files) while keeping scanning enabled for other traffic on the same host. Should we replace pipelock with a purpose-built DLP/token-scanning proxy that supports granular per-route configuration?

Summary

Yes. Pipelock's flat, global configuration is fundamentally at odds with the per-route model bot-bottle is built on. A custom or configurable DLP proxy built atop mitmproxy (which we already use for egress) would let us:

Skip DLP scanning selectively — e.g., scan responses from PyPI for credentials but skip scanning .whl file contents
Configure scanning per-route — different rules for different hosts/paths without global toggles
Reduce operational surface — one proxy (egress) instead of two (egress + pipelock)
Target AI-specific threats — focus on credential exfiltration and prompt injection instead of generic DLP

Tradeoff: We'd need to maintain our own scanning logic. Pipelock provides out-of-the-box BIP-39 seed-phrase detection, entropy checks, and pluggable DLP rules. Building custom logic means we need to be explicit about what we're protecting against and keep that code auditable.

Current pipelock limitations

Issue 1: No per-route response scanning rules

Pipelock's response scanning is part of TLS interception — a global feature with no per-host knobs:

tls_interception:
  enabled: true
  passthrough_domains: [...]  # Can skip MITM, but not just response scanning

Status: Tested with pipelock v2.3.0. Confirmed that:

response_body_scanning config field doesn't exist
No way to set per-host response size limits
No way to skip scanning for specific file extensions
tls_passthrough: true disables both request AND response scanning (we want request scanning to stay on)

Issue 2: Global configuration only

All of pipelock's scanning rules are global. If route A wants to skip .whl scanning and route B wants to skip .tar.gz, there's nowhere to express that distinction — the config is flat.

Issue 3: LLM prompt-specific false positives

Pipelock's BIP-39 seed-phrase detector fires on any 12+ English words matching a checksum, which is common in LLM prompts/responses. Bot-bottle disables this detector globally, sacrificing protection.

Issue 4: No prompt injection detection

Important clarification: Pipelock does NOT detect prompt injections. It detects:

Token patterns (regex)
Entropy (random-looking strings)
BIP-39 seed phrases (12+ word checksums)

But it cannot detect semantic attacks like:

Attempts to exfiltrate system prompts
Jailbreak attempts ("ignore previous instructions")
Model output that reveals internal system details

This is a novel threat specific to LLM agents that pipelock wasn't designed for.

Replacement design: mitmproxy-based DLP addon

Since bot-bottle already uses mitmproxy for egress (PRD 0017), we can extend the mitmproxy addon to do DLP scanning alongside egress rules:

Architecture

Agent
  ↓ (HTTP_PROXY=http://egress:8080)
Egress (mitmproxy)
  ├─ Addon 1: Path allowlisting (current)
  ├─ Addon 2: Credential injection (current)
  └─ Addon 3: DLP scanning (NEW)
       ├─ Config: per-route scanning rules from manifest
       ├─ Detectors: token patterns, prompt injection, entropy
       └─ Action: block/warn based on route config

Per-route configuration in manifest

Routes separately configure outbound (request to upstream) and inbound (response from upstream) scanning:

egress:
  routes:
    - host: api.anthropic.com
      dlp:
        outbound_detectors: [token_patterns, known_secrets]  # default
        inbound_detectors: [naive_injection_detection]  # default
    
    - host: files.pythonhosted.org
      dlp:
        outbound_detectors: [token_patterns, known_secrets]
        inbound_detectors: false  # Skip response scanning (binary downloads)
    
    - host: internal-service.corp
      dlp:
        outbound_detectors: false
        inbound_detectors: false  # Trusted internal, no scanning

Detectors:

token_patterns — API keys, GitHub tokens, AWS credentials, etc.
known_secrets — Secrets we provisioned (API keys, OAuth tokens passed via cred-proxy)
naive_injection_detection — Semantic attacks on system prompt (see section below)

Detector design

Three core detectors, each with tunable sensitivity:

Token detector
- Regex patterns for API keys (AWS AKIA, GitHub ghp_, etc.)
- Anthropic/OpenAI API keys
- OAuth tokens (Bearer patterns)
- Action: Block immediately with no false-positive tolerance
Entropy detector
- Shannon entropy threshold (bits/char)
- Flags high-entropy secrets (tunable per-route)
- Current pipelock default: 4.5 bits/char
- Action: Warn or block based on route config
Prompt injection detector (phase 2)
- Detect attempts to exfiltrate system prompts via LLM outputs
- Pattern: responses containing "system prompt", "instructions", "directive" + credential
- Action: Block or sample for audit

Advantages over pipelock

Aspect	Pipelock	Mitmproxy addon
Per-route rules	❌ (global only)	✅ (manifest-driven)
Response-specific config	❌ (all-or-nothing)	✅ (request_only, skip_extensions)
Request scanning overhead	✅ (lightweight)	~same
Maintenance burden	Low (third-party)	High (custom code)
Auditability	Closed source	✅ (in-repo)
AI-specific detection	Limited	✅ (token patterns, prompt injection)
Code reuse	None	✅ (egress addon framework)

Disadvantages

Maintenance responsibility — We own the security logic. Any bugs in detector regexes or entropy thresholds are our problem.
Feature parity gap — Pipelock's BIP-39 detector is sophisticated. We'd need to decide: replicate it, skip it, or ship a simplified version.
Performance — Custom Python detectors will be slower than pipelock's Go implementation. Benchmarking needed.
Coverage breadth — Pipelock covers generic DLP (credit cards, SSNs, etc.). We'd focus narrowly on AI/credential exfil.

Alternative: Configurable pipelock fork

Rather than build from scratch, fork pipelock and add response_body_scanning config:

response_body_scanning:
  enabled: true
  skip_extensions: [".whl", ".tar.gz"]
  max_response_bytes: 104857600  # 100MB

Pros:

Reuses existing detectors and maturity
Lower maintenance burden
Clear path to upstream (could be PR'd)

Cons:

Still maintains a fork
Pipelock's maintainers may not want global per-host rules
Go code is farther from our codebase (harder to audit)
Doesn't solve prompt-injection detection

Recommendation

Build the mitmproxy addon (phase 1: tokens + entropy; phase 2: prompt injection).

Rationale:

Bot-bottle already owns the mitmproxy egress addon — extending it keeps security logic in-repo and auditable.
Per-route DLP configuration aligns with bot-bottle's design (PRD 0017 is already per-route).
Replacing pipelock reduces sidecar count and operational surface.
AI-specific detectors (tokens, prompt injection) matter more than generic DLP for agent containment.

Fallback: If performance testing shows unacceptable latency in the Python addon, revisit the pipelock fork approach.

Naive prompt injection detector design

Since pipelock doesn't detect prompt injections, we need a custom detector. Here's a permissive design that favors missing attacks over false positives:

What to detect

High confidence (block immediately):

Response contains known credential pattern + "system prompt" phrase together
Response contains both "instructions" and a token pattern

Medium confidence (warn):

Response contains prompt-disclosure phrases without credentials (might be innocent documentation)
Multiple jailbreak keywords in single response

Ignore (too noisy):

Single jailbreak keywords without additional context
"system prompt" in documentation contexts
Common phrases like "instructions provided"

Naive detector pseudocode

class PromptInjectionDetector:
    # Phrases that suggest prompt exfiltration
    DISCLOSURE_PHRASES = [
        r'(?i)(system\s+prompt|instructions\s+given|your\s+role\s+is|you\s+are\s+an?)',
        r'(?i)(original\s+instructions|secret\s+instructions|hidden\s+rules)',
    ]
    
    # Phrases suggesting jailbreak attempts
    JAILBREAK_PHRASES = [
        r'(?i)(ignore\s+previous|forget\s+everything|disregard)',
        r'(?i)(from\s+now\s+on|pretend|act\s+as)',
        r'(?i)(bypass|circumvent|override)',
    ]
    
    TOKEN_PATTERNS = [
        r'AKIA[0-9A-Z]{16}',  # AWS
        r'ghp_[A-Za-z0-9_]{36}',  # GitHub
        r'sk_live_[A-Za-z0-9]{24}',  # Stripe
        r'Bearer\s+[A-Za-z0-9._-]{50,}',  # JWT-like tokens
    ]
    
    def scan_response(self, response_body):
        """Returns (severity, reason) or (None, None) if clean."""
        
        # Rule 1: Disclosure + token = HIGH confidence block
        disclosure_found = any(
            re.search(phrase, response_body) 
            for phrase in self.DISCLOSURE_PHRASES
        )
        token_found = any(
            re.search(pattern, response_body)
            for pattern in self.TOKEN_PATTERNS
        )
        
        if disclosure_found and token_found:
            return ("BLOCK", "Prompt disclosure with embedded credential")
        
        # Rule 2: Multiple jailbreak keywords = WARN
        jailbreak_count = sum(
            1 for phrase in self.JAILBREAK_PHRASES
            if re.search(phrase, response_body)
        )
        
        if jailbreak_count >= 2:
            return ("WARN", f"{jailbreak_count} jailbreak attempts detected")
        
        # Rule 3: Disclosure alone without tokens = WARN only if very explicit
        if disclosure_found and "system prompt:" in response_body.lower():
            return ("WARN", "Explicit system prompt disclosure")
        
        # Otherwise: clean
        return (None, None)

Why this is permissive

Single keywords ignored — "ignore previous instructions" in a legitimate conversation doesn't trigger
Context required — disclosure phrases need tokens or multiple jailbreak attempts
Documentation exemption — "instructions provided" in a help section won't block
Warn vs. block — Only block on high-confidence signals; warn on medium
No entropy-based guessing — We don't try to be clever about detecting obfuscated prompts

False negatives this misses

This detector intentionally lets through:

Prompt injections using novel phrasing we haven't seen
Obfuscated jailbreak attempts ("behave differently", "role-play")
Exfiltration via indirect methods ("describe the system", "what are your constraints")
Sophisticated attacks that split the prompt across multiple exchanges

Rationale: Better to miss a sophisticated jailbreak than block legitimate agent output 100 times/day.

Per-route configuration

Routes can enable/disable prompt injection scanning:

egress:
  routes:
    - host: api.anthropic.com
      dlp:
        enabled: true
        detectors: [tokens, prompt_injection]
    
    - host: internal-docs.corp
      dlp:
        enabled: true
        detectors: [tokens]  # Skip prompt injection (trusted internal)

Implementation phases

Phase 1: Secret exfiltration detection

Goal: Prevent credentials from leaking to upstream services

Token patterns detector — API keys, GitHub tokens, AWS credentials (regex-based)
Known secrets detector — Check if provisioned credentials appear in outbound traffic
- Secrets passed to cred-proxy or agent environment
- Multiple encodings (base64, hex, URL-encoded variants)
Outbound scanning by default — enabled for all routes unless explicitly disabled
Per-route config: outbound_detectors: [token_patterns, known_secrets]
Action: Block immediately on token match; warn on entropy threshold (tuned low to avoid false positives)

Phase 2: Prompt injection detection

Goal: Prevent agents from exfiltrating system prompts or being jailbroken

Option A: Naive pattern-based detector

Naive injection detector — as sketched above
Inbound scanning by default — enabled for all routes unless explicitly disabled
Per-route config: inbound_detectors: [naive_injection_detection]
Actions:
- BLOCK: Credential + prompt disclosure detected
- WARN: Multiple jailbreak keywords or explicit prompt disclosure
- ALLOW: Single keywords or documentation phrases

Option B: LLM-based semantic detector

See section below on using a specialized LLM for prompt injection detection.

Phase 3: Hardening & tuning

Real-world false positive analysis from Phase 1 & 2
Rate limiting on DLP blocks
Audit/sampling mode for flagged responses
Additional encodings for known_secrets (GZIP, base32, etc.)

LLM-based prompt injection detection

Viability analysis

Tradeoff: Using an LLM to detect prompt injections is semantically more powerful than regex, but has latency and resource costs.

Requirements for bot-bottle:

Sub-100ms latency (add-on to HTTP proxy, can't block traffic significantly)
<1GB RAM footprint (runs in sidecar alongside mitmproxy)
Simple API (classify: safe/injection/suspicious)
Preferably quantized/distilled (not full-size models)

Feasibility: Marginal. Regex patterns are faster, but an LLM could catch sophisticated attacks.

Existing models

Purpose-built prompt injection detectors:

Rebuff.ai's Prompt Injection API (closed-source, commercial)
- Hosted detection service
- ~50ms per request
- Not viable (external dependency, adds latency)
Microsoft's Presidio + custom rules
- Entity recognition + PII detection
- Broader than prompt injection
- Would need custom training for jailbreak/disclosure patterns
HuggingFace models:
- roberta-large-openai-detector — detects GPT-2 text (not injections)
- No off-the-shelf model specifically for prompt injection

Training a custom model:

Data: Dataset of prompt injection attempts vs. legitimate responses (limited public datasets)
Architecture: Binary classifier (DistilBERT, ALBERT) fine-tuned on injection examples
Size: DistilBERT ~268MB, quantized ~67MB (acceptable footprint)
Latency: ~50-150ms per response on CPU (concerning for proxy)

Recommendation

Phase 2a: Use naive pattern detector (regex-based, sketched above)

Fast (<5ms per response)
Low false positives with permissive rules
No external dependencies

Phase 2b (optional, if needed): Evaluate LLM approach

Collect real-world false negatives from pattern detector
If sophisticated attacks slip through, consider DistilBERT-based classifier
Quantize + run locally in sidecar
Benchmark against 100ms latency budget
Fall back to patterns if latency unacceptable

Why not jump to LLM:

Latency: 50-150ms adds significant overhead to every response
Complexity: Custom model training needed; no off-the-shelf solution
Overkill: Pattern detector catches obvious attacks; sophisticated attacks are rare
Unknown unknowns: Adversaries can evade LLM-based detectors via adversarial prompts

If we do build an LLM detector

# Sketch of LLM-based detection
class LLMPromptInjectionDetector:
    def __init__(self):
        # Quantized DistilBERT, fine-tuned on injection examples
        self.model = load_model("prompt-injection-classifier-q4")  # ~67MB
        self.tokenizer = load_tokenizer("distilbert-base-uncased")
    
    def scan_response(self, response_body, timeout_ms=100):
        """
        Returns: (verdict, confidence)
        - verdict: "safe", "suspicious", "injection"
        - confidence: 0.0-1.0
        """
        try:
            # Timeout hard at 100ms to avoid proxy bottleneck
            tokens = self.tokenizer.encode(response_body[:2000], truncation=True)
            logits = self.model(tokens, timeout=timeout_ms)
            
            injection_score = logits["injection_class"]
            
            if injection_score > 0.9:
                return ("injection", injection_score)
            elif injection_score > 0.7:
                return ("suspicious", injection_score)
            else:
                return ("safe", injection_score)
        except TimeoutError:
            # On timeout, fall back to pattern detector
            return self.fallback_pattern_detector(response_body)

Deployment questions:

Which LLM framework? (transformers, ONNX, TensorRT?)
How to handle out-of-memory on large responses?
How to update model if new jailbreak techniques emerge?
Should we ensemble: LLM + patterns for high-confidence blocks?

Open questions

Performance: How much latency does Python string-matching add? Benchmark against pipelock.
False positives: Will entropy detector trip on legitimate high-entropy traffic (e.g., binary API responses)? Need real-world testing.
Coverage: Are regex patterns sufficient, or do we need more sophisticated token detection (e.g., format validation)?
Upstream: If we build this, should we upstream it as an option to pipelock, or keep it bot-bottle-specific?

17 KiB Raw Blame History

DLP alternatives to pipelock: per-route configuration and response handling

Question

Summary

Current pipelock limitations

Issue 1: No per-route response scanning rules

Issue 2: Global configuration only

Issue 3: LLM prompt-specific false positives

Issue 4: No prompt injection detection

Replacement design: mitmproxy-based DLP addon

Architecture

Per-route configuration in manifest

Detector design

Advantages over pipelock

Disadvantages

Alternative: Configurable pipelock fork

Recommendation

Naive prompt injection detector design

What to detect

Naive detector pseudocode

Why this is permissive

False negatives this misses

Per-route configuration

Implementation phases

Phase 1: Secret exfiltration detection

Phase 2: Prompt injection detection

Option A: Naive pattern-based detector

Option B: LLM-based semantic detector

Phase 3: Hardening & tuning

LLM-based prompt injection detection

Viability analysis

Existing models

Recommendation

If we do build an LLM detector

Open questions

17 KiB

Raw Blame History