docs: remove time estimates and add LLM-based detection analysis
- Remove all time estimates (2-3 weeks, 1-2 weeks, etc.) - Add detailed analysis of using LLM for prompt injection detection - Survey existing models (none purpose-built for this) - Sketch DistilBERT fine-tuning approach (~67MB quantized) - Analyze latency/footprint tradeoffs (50-150ms vs. <5ms for patterns) - Recommend pattern-based Phase 2, with LLM as optional Phase 2b - Include code sketch of LLM detector with timeout fallback - List open questions for LLM deployment Conclusion: Patterns are faster/simpler for now; LLM only if patterns miss sophisticated attacks in production. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -289,7 +289,7 @@ egress:
|
|||||||
|
|
||||||
## Implementation phases
|
## Implementation phases
|
||||||
|
|
||||||
### Phase 1: Secret exfiltration detection (2-3 weeks)
|
### Phase 1: Secret exfiltration detection
|
||||||
**Goal:** Prevent credentials from leaking to upstream services
|
**Goal:** Prevent credentials from leaking to upstream services
|
||||||
|
|
||||||
- **Token patterns detector** — API keys, GitHub tokens, AWS credentials (regex-based)
|
- **Token patterns detector** — API keys, GitHub tokens, AWS credentials (regex-based)
|
||||||
@@ -300,9 +300,10 @@ egress:
|
|||||||
- **Per-route config:** `outbound_detectors: [token_patterns, known_secrets]`
|
- **Per-route config:** `outbound_detectors: [token_patterns, known_secrets]`
|
||||||
- **Action:** Block immediately on token match; warn on entropy threshold (tuned low to avoid false positives)
|
- **Action:** Block immediately on token match; warn on entropy threshold (tuned low to avoid false positives)
|
||||||
|
|
||||||
### Phase 2: Prompt injection detection (1-2 weeks)
|
### Phase 2: Prompt injection detection
|
||||||
**Goal:** Prevent agents from exfiltrating system prompts or being jailbroken
|
**Goal:** Prevent agents from exfiltrating system prompts or being jailbroken
|
||||||
|
|
||||||
|
#### Option A: Naive pattern-based detector
|
||||||
- **Naive injection detector** — as sketched above
|
- **Naive injection detector** — as sketched above
|
||||||
- **Inbound scanning by default** — enabled for all routes unless explicitly disabled
|
- **Inbound scanning by default** — enabled for all routes unless explicitly disabled
|
||||||
- **Per-route config:** `inbound_detectors: [naive_injection_detection]`
|
- **Per-route config:** `inbound_detectors: [naive_injection_detection]`
|
||||||
@@ -311,12 +312,112 @@ egress:
|
|||||||
- WARN: Multiple jailbreak keywords or explicit prompt disclosure
|
- WARN: Multiple jailbreak keywords or explicit prompt disclosure
|
||||||
- ALLOW: Single keywords or documentation phrases
|
- ALLOW: Single keywords or documentation phrases
|
||||||
|
|
||||||
### Phase 3: Hardening & tuning (2-3 weeks, optional)
|
#### Option B: LLM-based semantic detector
|
||||||
|
See section below on using a specialized LLM for prompt injection detection.
|
||||||
|
|
||||||
|
### Phase 3: Hardening & tuning
|
||||||
- Real-world false positive analysis from Phase 1 & 2
|
- Real-world false positive analysis from Phase 1 & 2
|
||||||
- Rate limiting on DLP blocks
|
- Rate limiting on DLP blocks
|
||||||
- Audit/sampling mode for flagged responses
|
- Audit/sampling mode for flagged responses
|
||||||
- Additional encodings for known_secrets (GZIP, base32, etc.)
|
- Additional encodings for known_secrets (GZIP, base32, etc.)
|
||||||
|
|
||||||
|
## LLM-based prompt injection detection
|
||||||
|
|
||||||
|
### Viability analysis
|
||||||
|
|
||||||
|
**Tradeoff:** Using an LLM to detect prompt injections is semantically more powerful than regex, but has latency and resource costs.
|
||||||
|
|
||||||
|
**Requirements for bot-bottle:**
|
||||||
|
- Sub-100ms latency (add-on to HTTP proxy, can't block traffic significantly)
|
||||||
|
- <1GB RAM footprint (runs in sidecar alongside mitmproxy)
|
||||||
|
- Simple API (classify: safe/injection/suspicious)
|
||||||
|
- Preferably quantized/distilled (not full-size models)
|
||||||
|
|
||||||
|
**Feasibility:** Marginal. Regex patterns are faster, but an LLM could catch sophisticated attacks.
|
||||||
|
|
||||||
|
### Existing models
|
||||||
|
|
||||||
|
**Purpose-built prompt injection detectors:**
|
||||||
|
1. **Rebuff.ai's Prompt Injection API** (closed-source, commercial)
|
||||||
|
- Hosted detection service
|
||||||
|
- ~50ms per request
|
||||||
|
- Not viable (external dependency, adds latency)
|
||||||
|
|
||||||
|
2. **Microsoft's Presidio** + custom rules
|
||||||
|
- Entity recognition + PII detection
|
||||||
|
- Broader than prompt injection
|
||||||
|
- Would need custom training for jailbreak/disclosure patterns
|
||||||
|
|
||||||
|
3. **HuggingFace models:**
|
||||||
|
- `roberta-large-openai-detector` — detects GPT-2 text (not injections)
|
||||||
|
- No off-the-shelf model specifically for prompt injection
|
||||||
|
|
||||||
|
**Training a custom model:**
|
||||||
|
- **Data:** Dataset of prompt injection attempts vs. legitimate responses (limited public datasets)
|
||||||
|
- **Architecture:** Binary classifier (DistilBERT, ALBERT) fine-tuned on injection examples
|
||||||
|
- **Size:** DistilBERT ~268MB, quantized ~67MB (acceptable footprint)
|
||||||
|
- **Latency:** ~50-150ms per response on CPU (concerning for proxy)
|
||||||
|
|
||||||
|
### Recommendation
|
||||||
|
|
||||||
|
**Phase 2a: Use naive pattern detector** (regex-based, sketched above)
|
||||||
|
- Fast (<5ms per response)
|
||||||
|
- Low false positives with permissive rules
|
||||||
|
- No external dependencies
|
||||||
|
|
||||||
|
**Phase 2b (optional, if needed): Evaluate LLM approach**
|
||||||
|
- Collect real-world false negatives from pattern detector
|
||||||
|
- If sophisticated attacks slip through, consider DistilBERT-based classifier
|
||||||
|
- Quantize + run locally in sidecar
|
||||||
|
- Benchmark against 100ms latency budget
|
||||||
|
- Fall back to patterns if latency unacceptable
|
||||||
|
|
||||||
|
**Why not jump to LLM:**
|
||||||
|
1. Latency: 50-150ms adds significant overhead to every response
|
||||||
|
2. Complexity: Custom model training needed; no off-the-shelf solution
|
||||||
|
3. Overkill: Pattern detector catches obvious attacks; sophisticated attacks are rare
|
||||||
|
4. Unknown unknowns: Adversaries can evade LLM-based detectors via adversarial prompts
|
||||||
|
|
||||||
|
### If we do build an LLM detector
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Sketch of LLM-based detection
|
||||||
|
class LLMPromptInjectionDetector:
|
||||||
|
def __init__(self):
|
||||||
|
# Quantized DistilBERT, fine-tuned on injection examples
|
||||||
|
self.model = load_model("prompt-injection-classifier-q4") # ~67MB
|
||||||
|
self.tokenizer = load_tokenizer("distilbert-base-uncased")
|
||||||
|
|
||||||
|
def scan_response(self, response_body, timeout_ms=100):
|
||||||
|
"""
|
||||||
|
Returns: (verdict, confidence)
|
||||||
|
- verdict: "safe", "suspicious", "injection"
|
||||||
|
- confidence: 0.0-1.0
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
# Timeout hard at 100ms to avoid proxy bottleneck
|
||||||
|
tokens = self.tokenizer.encode(response_body[:2000], truncation=True)
|
||||||
|
logits = self.model(tokens, timeout=timeout_ms)
|
||||||
|
|
||||||
|
injection_score = logits["injection_class"]
|
||||||
|
|
||||||
|
if injection_score > 0.9:
|
||||||
|
return ("injection", injection_score)
|
||||||
|
elif injection_score > 0.7:
|
||||||
|
return ("suspicious", injection_score)
|
||||||
|
else:
|
||||||
|
return ("safe", injection_score)
|
||||||
|
except TimeoutError:
|
||||||
|
# On timeout, fall back to pattern detector
|
||||||
|
return self.fallback_pattern_detector(response_body)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Deployment questions:**
|
||||||
|
1. Which LLM framework? (transformers, ONNX, TensorRT?)
|
||||||
|
2. How to handle out-of-memory on large responses?
|
||||||
|
3. How to update model if new jailbreak techniques emerge?
|
||||||
|
4. Should we ensemble: LLM + patterns for high-confidence blocks?
|
||||||
|
|
||||||
## Open questions
|
## Open questions
|
||||||
|
|
||||||
1. **Performance:** How much latency does Python string-matching add? Benchmark against pipelock.
|
1. **Performance:** How much latency does Python string-matching add? Benchmark against pipelock.
|
||||||
|
|||||||
Reference in New Issue
Block a user