From e6b3cd1824481e321917126aa8dfeea62c15b10b Mon Sep 17 00:00:00 2001 From: didericis Date: Thu, 4 Jun 2026 14:02:59 -0400 Subject: [PATCH] docs: remove time estimates and add LLM-based detection analysis - Remove all time estimates (2-3 weeks, 1-2 weeks, etc.) - Add detailed analysis of using LLM for prompt injection detection - Survey existing models (none purpose-built for this) - Sketch DistilBERT fine-tuning approach (~67MB quantized) - Analyze latency/footprint tradeoffs (50-150ms vs. <5ms for patterns) - Recommend pattern-based Phase 2, with LLM as optional Phase 2b - Include code sketch of LLM detector with timeout fallback - List open questions for LLM deployment Conclusion: Patterns are faster/simpler for now; LLM only if patterns miss sophisticated attacks in production. Co-Authored-By: Claude Haiku 4.5 --- docs/research/dlp-alternatives-to-pipelock.md | 107 +++++++++++++++++- 1 file changed, 104 insertions(+), 3 deletions(-) diff --git a/docs/research/dlp-alternatives-to-pipelock.md b/docs/research/dlp-alternatives-to-pipelock.md index 71ad833..3f88694 100644 --- a/docs/research/dlp-alternatives-to-pipelock.md +++ b/docs/research/dlp-alternatives-to-pipelock.md @@ -289,7 +289,7 @@ egress: ## Implementation phases -### Phase 1: Secret exfiltration detection (2-3 weeks) +### Phase 1: Secret exfiltration detection **Goal:** Prevent credentials from leaking to upstream services - **Token patterns detector** — API keys, GitHub tokens, AWS credentials (regex-based) @@ -300,9 +300,10 @@ egress: - **Per-route config:** `outbound_detectors: [token_patterns, known_secrets]` - **Action:** Block immediately on token match; warn on entropy threshold (tuned low to avoid false positives) -### Phase 2: Prompt injection detection (1-2 weeks) +### Phase 2: Prompt injection detection **Goal:** Prevent agents from exfiltrating system prompts or being jailbroken +#### Option A: Naive pattern-based detector - **Naive injection detector** — as sketched above - **Inbound scanning by default** — enabled for all routes unless explicitly disabled - **Per-route config:** `inbound_detectors: [naive_injection_detection]` @@ -311,12 +312,112 @@ egress: - WARN: Multiple jailbreak keywords or explicit prompt disclosure - ALLOW: Single keywords or documentation phrases -### Phase 3: Hardening & tuning (2-3 weeks, optional) +#### Option B: LLM-based semantic detector +See section below on using a specialized LLM for prompt injection detection. + +### Phase 3: Hardening & tuning - Real-world false positive analysis from Phase 1 & 2 - Rate limiting on DLP blocks - Audit/sampling mode for flagged responses - Additional encodings for known_secrets (GZIP, base32, etc.) +## LLM-based prompt injection detection + +### Viability analysis + +**Tradeoff:** Using an LLM to detect prompt injections is semantically more powerful than regex, but has latency and resource costs. + +**Requirements for bot-bottle:** +- Sub-100ms latency (add-on to HTTP proxy, can't block traffic significantly) +- <1GB RAM footprint (runs in sidecar alongside mitmproxy) +- Simple API (classify: safe/injection/suspicious) +- Preferably quantized/distilled (not full-size models) + +**Feasibility:** Marginal. Regex patterns are faster, but an LLM could catch sophisticated attacks. + +### Existing models + +**Purpose-built prompt injection detectors:** +1. **Rebuff.ai's Prompt Injection API** (closed-source, commercial) + - Hosted detection service + - ~50ms per request + - Not viable (external dependency, adds latency) + +2. **Microsoft's Presidio** + custom rules + - Entity recognition + PII detection + - Broader than prompt injection + - Would need custom training for jailbreak/disclosure patterns + +3. **HuggingFace models:** + - `roberta-large-openai-detector` — detects GPT-2 text (not injections) + - No off-the-shelf model specifically for prompt injection + +**Training a custom model:** +- **Data:** Dataset of prompt injection attempts vs. legitimate responses (limited public datasets) +- **Architecture:** Binary classifier (DistilBERT, ALBERT) fine-tuned on injection examples +- **Size:** DistilBERT ~268MB, quantized ~67MB (acceptable footprint) +- **Latency:** ~50-150ms per response on CPU (concerning for proxy) + +### Recommendation + +**Phase 2a: Use naive pattern detector** (regex-based, sketched above) +- Fast (<5ms per response) +- Low false positives with permissive rules +- No external dependencies + +**Phase 2b (optional, if needed): Evaluate LLM approach** +- Collect real-world false negatives from pattern detector +- If sophisticated attacks slip through, consider DistilBERT-based classifier +- Quantize + run locally in sidecar +- Benchmark against 100ms latency budget +- Fall back to patterns if latency unacceptable + +**Why not jump to LLM:** +1. Latency: 50-150ms adds significant overhead to every response +2. Complexity: Custom model training needed; no off-the-shelf solution +3. Overkill: Pattern detector catches obvious attacks; sophisticated attacks are rare +4. Unknown unknowns: Adversaries can evade LLM-based detectors via adversarial prompts + +### If we do build an LLM detector + +```python +# Sketch of LLM-based detection +class LLMPromptInjectionDetector: + def __init__(self): + # Quantized DistilBERT, fine-tuned on injection examples + self.model = load_model("prompt-injection-classifier-q4") # ~67MB + self.tokenizer = load_tokenizer("distilbert-base-uncased") + + def scan_response(self, response_body, timeout_ms=100): + """ + Returns: (verdict, confidence) + - verdict: "safe", "suspicious", "injection" + - confidence: 0.0-1.0 + """ + try: + # Timeout hard at 100ms to avoid proxy bottleneck + tokens = self.tokenizer.encode(response_body[:2000], truncation=True) + logits = self.model(tokens, timeout=timeout_ms) + + injection_score = logits["injection_class"] + + if injection_score > 0.9: + return ("injection", injection_score) + elif injection_score > 0.7: + return ("suspicious", injection_score) + else: + return ("safe", injection_score) + except TimeoutError: + # On timeout, fall back to pattern detector + return self.fallback_pattern_detector(response_body) +``` + +**Deployment questions:** +1. Which LLM framework? (transformers, ONNX, TensorRT?) +2. How to handle out-of-memory on large responses? +3. How to update model if new jailbreak techniques emerge? +4. Should we ensemble: LLM + patterns for high-confidence blocks? + ## Open questions 1. **Performance:** How much latency does Python string-matching add? Benchmark against pipelock.