Add analysis of Google DeepMind's CaMeL (arXiv:2503.18813), which prevents prompt injections architecturally rather than detecting them. Key findings: - CaMeL operates at the agent execution layer (P-LLM/Q-LLM split + capability-based data flow tracking), not the network layer - Not a replacement for pipelock/DLP — different threat surface - Not viable today: research artifact, requires agent rearchitecture, doubles LLM costs, 7% utility loss on AgentDojo - Worth watching: its capability model could complement bot-bottle's network controls if it matures into production software Also clarifies pipelock's actual detection capabilities (no prompt injection detection) and adds naive detector sketch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
23 KiB
DLP alternatives to pipelock: per-route configuration and response handling
Question
Pipelock lacks support for per-route or per-host response scanning rules, making it impossible to skip DLP scanning for large binary downloads (e.g., .whl files) while keeping scanning enabled for other traffic on the same host. Should we replace pipelock with a purpose-built DLP/token-scanning proxy that supports granular per-route configuration?
Summary
Yes. Pipelock's flat, global configuration is fundamentally at odds with the per-route model bot-bottle is built on. A custom or configurable DLP proxy built atop mitmproxy (which we already use for egress) would let us:
- Skip DLP scanning selectively — e.g., scan responses from PyPI for credentials but skip scanning
.whlfile contents - Configure scanning per-route — different rules for different hosts/paths without global toggles
- Reduce operational surface — one proxy (egress) instead of two (egress + pipelock)
- Target AI-specific threats — focus on credential exfiltration and prompt injection instead of generic DLP
Tradeoff: We'd need to maintain our own scanning logic. Pipelock provides out-of-the-box BIP-39 seed-phrase detection, entropy checks, and pluggable DLP rules. Building custom logic means we need to be explicit about what we're protecting against and keep that code auditable.
Current pipelock limitations
Issue 1: No per-route response scanning rules
Pipelock's response scanning is part of TLS interception — a global feature with no per-host knobs:
tls_interception:
enabled: true
passthrough_domains: [...] # Can skip MITM, but not just response scanning
Status: Tested with pipelock v2.3.0. Confirmed that:
response_body_scanningconfig field doesn't exist- No way to set per-host response size limits
- No way to skip scanning for specific file extensions
tls_passthrough: truedisables both request AND response scanning (we want request scanning to stay on)
Issue 2: Global configuration only
All of pipelock's scanning rules are global. If route A wants to skip .whl scanning and route B wants to skip .tar.gz, there's nowhere to express that distinction — the config is flat.
Issue 3: LLM prompt-specific false positives
Pipelock's BIP-39 seed-phrase detector fires on any 12+ English words matching a checksum, which is common in LLM prompts/responses. Bot-bottle disables this detector globally, sacrificing protection.
Issue 4: No prompt injection detection
Important clarification: Pipelock does NOT detect prompt injections. It detects:
- Token patterns (regex)
- Entropy (random-looking strings)
- BIP-39 seed phrases (12+ word checksums)
But it cannot detect semantic attacks like:
- Attempts to exfiltrate system prompts
- Jailbreak attempts ("ignore previous instructions")
- Model output that reveals internal system details
This is a novel threat specific to LLM agents that pipelock wasn't designed for.
Replacement design: mitmproxy-based DLP addon
Since bot-bottle already uses mitmproxy for egress (PRD 0017), we can extend the mitmproxy addon to do DLP scanning alongside egress rules:
Architecture
Agent
↓ (HTTP_PROXY=http://egress:8080)
Egress (mitmproxy)
├─ Addon 1: Path allowlisting (current)
├─ Addon 2: Credential injection (current)
└─ Addon 3: DLP scanning (NEW)
├─ Config: per-route scanning rules from manifest
├─ Detectors: token patterns, prompt injection, entropy
└─ Action: block/warn based on route config
Per-route configuration in manifest
Routes separately configure outbound (request to upstream) and inbound (response from upstream) scanning:
egress:
routes:
- host: api.anthropic.com
dlp:
outbound_detectors: [token_patterns, known_secrets] # default
inbound_detectors: [naive_injection_detection] # default
- host: files.pythonhosted.org
dlp:
outbound_detectors: [token_patterns, known_secrets]
inbound_detectors: false # Skip response scanning (binary downloads)
- host: internal-service.corp
dlp:
outbound_detectors: false
inbound_detectors: false # Trusted internal, no scanning
Detectors:
token_patterns— API keys, GitHub tokens, AWS credentials, etc.known_secrets— Secrets we provisioned (API keys, OAuth tokens passed via cred-proxy)naive_injection_detection— Semantic attacks on system prompt (see section below)
Detector design
Three core detectors, each with tunable sensitivity:
-
Token detector
- Regex patterns for API keys (AWS
AKIA, GitHubghp_, etc.) - Anthropic/OpenAI API keys
- OAuth tokens (Bearer patterns)
- Action: Block immediately with no false-positive tolerance
- Regex patterns for API keys (AWS
-
Entropy detector
- Shannon entropy threshold (bits/char)
- Flags high-entropy secrets (tunable per-route)
- Current pipelock default: 4.5 bits/char
- Action: Warn or block based on route config
-
Prompt injection detector (phase 2)
- Detect attempts to exfiltrate system prompts via LLM outputs
- Pattern: responses containing "system prompt", "instructions", "directive" + credential
- Action: Block or sample for audit
Advantages over pipelock
| Aspect | Pipelock | Mitmproxy addon |
|---|---|---|
| Per-route rules | ❌ (global only) | ✅ (manifest-driven) |
| Response-specific config | ❌ (all-or-nothing) | ✅ (request_only, skip_extensions) |
| Request scanning overhead | ✅ (lightweight) | ~same |
| Maintenance burden | Low (third-party) | High (custom code) |
| Auditability | Closed source | ✅ (in-repo) |
| AI-specific detection | Limited | ✅ (token patterns, prompt injection) |
| Code reuse | None | ✅ (egress addon framework) |
Disadvantages
- Maintenance responsibility — We own the security logic. Any bugs in detector regexes or entropy thresholds are our problem.
- Feature parity gap — Pipelock's BIP-39 detector is sophisticated. We'd need to decide: replicate it, skip it, or ship a simplified version.
- Performance — Custom Python detectors will be slower than pipelock's Go implementation. Benchmarking needed.
- Coverage breadth — Pipelock covers generic DLP (credit cards, SSNs, etc.). We'd focus narrowly on AI/credential exfil.
Alternative: Configurable pipelock fork
Rather than build from scratch, fork pipelock and add response_body_scanning config:
response_body_scanning:
enabled: true
skip_extensions: [".whl", ".tar.gz"]
max_response_bytes: 104857600 # 100MB
Pros:
- Reuses existing detectors and maturity
- Lower maintenance burden
- Clear path to upstream (could be PR'd)
Cons:
- Still maintains a fork
- Pipelock's maintainers may not want global per-host rules
- Go code is farther from our codebase (harder to audit)
- Doesn't solve prompt-injection detection
Recommendation
Build the mitmproxy addon (phase 1: tokens + entropy; phase 2: prompt injection).
Rationale:
- Bot-bottle already owns the mitmproxy egress addon — extending it keeps security logic in-repo and auditable.
- Per-route DLP configuration aligns with bot-bottle's design (PRD 0017 is already per-route).
- Replacing pipelock reduces sidecar count and operational surface.
- AI-specific detectors (tokens, prompt injection) matter more than generic DLP for agent containment.
Fallback: If performance testing shows unacceptable latency in the Python addon, revisit the pipelock fork approach.
Naive prompt injection detector design
Since pipelock doesn't detect prompt injections, we need a custom detector. Here's a permissive design that favors missing attacks over false positives:
What to detect
High confidence (block immediately):
- Response contains known credential pattern + "system prompt" phrase together
- Response contains both "instructions" and a token pattern
Medium confidence (warn):
- Response contains prompt-disclosure phrases without credentials (might be innocent documentation)
- Multiple jailbreak keywords in single response
Ignore (too noisy):
- Single jailbreak keywords without additional context
- "system prompt" in documentation contexts
- Common phrases like "instructions provided"
Naive detector pseudocode
class PromptInjectionDetector:
# Phrases that suggest prompt exfiltration
DISCLOSURE_PHRASES = [
r'(?i)(system\s+prompt|instructions\s+given|your\s+role\s+is|you\s+are\s+an?)',
r'(?i)(original\s+instructions|secret\s+instructions|hidden\s+rules)',
]
# Phrases suggesting jailbreak attempts
JAILBREAK_PHRASES = [
r'(?i)(ignore\s+previous|forget\s+everything|disregard)',
r'(?i)(from\s+now\s+on|pretend|act\s+as)',
r'(?i)(bypass|circumvent|override)',
]
TOKEN_PATTERNS = [
r'AKIA[0-9A-Z]{16}', # AWS
r'ghp_[A-Za-z0-9_]{36}', # GitHub
r'sk_live_[A-Za-z0-9]{24}', # Stripe
r'Bearer\s+[A-Za-z0-9._-]{50,}', # JWT-like tokens
]
def scan_response(self, response_body):
"""Returns (severity, reason) or (None, None) if clean."""
# Rule 1: Disclosure + token = HIGH confidence block
disclosure_found = any(
re.search(phrase, response_body)
for phrase in self.DISCLOSURE_PHRASES
)
token_found = any(
re.search(pattern, response_body)
for pattern in self.TOKEN_PATTERNS
)
if disclosure_found and token_found:
return ("BLOCK", "Prompt disclosure with embedded credential")
# Rule 2: Multiple jailbreak keywords = WARN
jailbreak_count = sum(
1 for phrase in self.JAILBREAK_PHRASES
if re.search(phrase, response_body)
)
if jailbreak_count >= 2:
return ("WARN", f"{jailbreak_count} jailbreak attempts detected")
# Rule 3: Disclosure alone without tokens = WARN only if very explicit
if disclosure_found and "system prompt:" in response_body.lower():
return ("WARN", "Explicit system prompt disclosure")
# Otherwise: clean
return (None, None)
Why this is permissive
- Single keywords ignored — "ignore previous instructions" in a legitimate conversation doesn't trigger
- Context required — disclosure phrases need tokens or multiple jailbreak attempts
- Documentation exemption — "instructions provided" in a help section won't block
- Warn vs. block — Only block on high-confidence signals; warn on medium
- No entropy-based guessing — We don't try to be clever about detecting obfuscated prompts
False negatives this misses
This detector intentionally lets through:
- Prompt injections using novel phrasing we haven't seen
- Obfuscated jailbreak attempts ("behave differently", "role-play")
- Exfiltration via indirect methods ("describe the system", "what are your constraints")
- Sophisticated attacks that split the prompt across multiple exchanges
Rationale: Better to miss a sophisticated jailbreak than block legitimate agent output 100 times/day.
Per-route configuration
Routes can enable/disable prompt injection scanning:
egress:
routes:
- host: api.anthropic.com
dlp:
enabled: true
detectors: [tokens, prompt_injection]
- host: internal-docs.corp
dlp:
enabled: true
detectors: [tokens] # Skip prompt injection (trusted internal)
Implementation phases
Phase 1: Secret exfiltration detection
Goal: Prevent credentials from leaking to upstream services
- Token patterns detector — API keys, GitHub tokens, AWS credentials (regex-based)
- Known secrets detector — Check if provisioned credentials appear in outbound traffic
- Secrets passed to cred-proxy or agent environment
- Multiple encodings (base64, hex, URL-encoded variants)
- Outbound scanning by default — enabled for all routes unless explicitly disabled
- Per-route config:
outbound_detectors: [token_patterns, known_secrets] - Action: Block immediately on token match; warn on entropy threshold (tuned low to avoid false positives)
Phase 2: Prompt injection detection
Goal: Prevent agents from exfiltrating system prompts or being jailbroken
Option A: Naive pattern-based detector
- Naive injection detector — as sketched above
- Inbound scanning by default — enabled for all routes unless explicitly disabled
- Per-route config:
inbound_detectors: [naive_injection_detection] - Actions:
- BLOCK: Credential + prompt disclosure detected
- WARN: Multiple jailbreak keywords or explicit prompt disclosure
- ALLOW: Single keywords or documentation phrases
Option B: LLM-based semantic detector
See section below on using a specialized LLM for prompt injection detection.
Phase 3: Hardening & tuning
- Real-world false positive analysis from Phase 1 & 2
- Rate limiting on DLP blocks
- Audit/sampling mode for flagged responses
- Additional encodings for known_secrets (GZIP, base32, etc.)
LLM-based prompt injection detection
Viability analysis
Tradeoff: Using an LLM to detect prompt injections is semantically more powerful than regex, but has latency and resource costs.
Requirements for bot-bottle:
- Sub-100ms latency (add-on to HTTP proxy, can't block traffic significantly)
- <1GB RAM footprint (runs in sidecar alongside mitmproxy)
- Simple API (classify: safe/injection/suspicious)
- Preferably quantized/distilled (not full-size models)
Feasibility: Marginal. Regex patterns are faster, but an LLM could catch sophisticated attacks.
Existing models
Purpose-built prompt injection detectors:
-
Rebuff.ai's Prompt Injection API (closed-source, commercial)
- Hosted detection service
- ~50ms per request
- Not viable (external dependency, adds latency)
-
Microsoft's Presidio + custom rules
- Entity recognition + PII detection
- Broader than prompt injection
- Would need custom training for jailbreak/disclosure patterns
-
HuggingFace models:
roberta-large-openai-detector— detects GPT-2 text (not injections)- No off-the-shelf model specifically for prompt injection
Training a custom model:
- Data: Dataset of prompt injection attempts vs. legitimate responses (limited public datasets)
- Architecture: Binary classifier (DistilBERT, ALBERT) fine-tuned on injection examples
- Size: DistilBERT ~268MB, quantized ~67MB (acceptable footprint)
- Latency: ~50-150ms per response on CPU (concerning for proxy)
Recommendation
Phase 2a: Use naive pattern detector (regex-based, sketched above)
- Fast (<5ms per response)
- Low false positives with permissive rules
- No external dependencies
Phase 2b (optional, if needed): Evaluate LLM approach
- Collect real-world false negatives from pattern detector
- If sophisticated attacks slip through, consider DistilBERT-based classifier
- Quantize + run locally in sidecar
- Benchmark against 100ms latency budget
- Fall back to patterns if latency unacceptable
Why not jump to LLM:
- Latency: 50-150ms adds significant overhead to every response
- Complexity: Custom model training needed; no off-the-shelf solution
- Overkill: Pattern detector catches obvious attacks; sophisticated attacks are rare
- Unknown unknowns: Adversaries can evade LLM-based detectors via adversarial prompts
If we do build an LLM detector
# Sketch of LLM-based detection
class LLMPromptInjectionDetector:
def __init__(self):
# Quantized DistilBERT, fine-tuned on injection examples
self.model = load_model("prompt-injection-classifier-q4") # ~67MB
self.tokenizer = load_tokenizer("distilbert-base-uncased")
def scan_response(self, response_body, timeout_ms=100):
"""
Returns: (verdict, confidence)
- verdict: "safe", "suspicious", "injection"
- confidence: 0.0-1.0
"""
try:
# Timeout hard at 100ms to avoid proxy bottleneck
tokens = self.tokenizer.encode(response_body[:2000], truncation=True)
logits = self.model(tokens, timeout=timeout_ms)
injection_score = logits["injection_class"]
if injection_score > 0.9:
return ("injection", injection_score)
elif injection_score > 0.7:
return ("suspicious", injection_score)
else:
return ("safe", injection_score)
except TimeoutError:
# On timeout, fall back to pattern detector
return self.fallback_pattern_detector(response_body)
Deployment questions:
- Which LLM framework? (transformers, ONNX, TensorRT?)
- How to handle out-of-memory on large responses?
- How to update model if new jailbreak techniques emerge?
- Should we ensemble: LLM + patterns for high-confidence blocks?
CaMeL: Architectural prompt injection defense
CaMeL (CApabilities for MachinE Learning) is a Google DeepMind research project that takes a fundamentally different approach to prompt injection. Rather than detecting injections at the network layer (like everything above), CaMeL prevents them by architecture — it makes prompt injections structurally unable to cause harm even when the underlying LLM is fully compromised.
How CaMeL works
CaMeL wraps the agent in a protective execution layer with three components:
-
Privileged LLM (P-LLM): Sees ONLY the user's trusted query. Generates a plan as restricted Python code — which tools to call, in what order, with what data flow. Never sees untrusted content (tool outputs, external data).
-
Quarantined LLM (Q-LLM): Processes untrusted data (parsing emails, extracting names from documents). Has NO tool access. Even if prompt-injected, it can only return parsed data — it cannot change which tools run or how they're called.
-
CaMeL interpreter: Executes the P-LLM's plan, maintains a data flow graph tracking every value's provenance (where it came from, who can read it), and enforces capability-based security policies before each tool call.
Example flow for "Send Bob the document from our meeting notes":
P-LLM generates plan (trusted, no external data):
notes = find_meeting_notes()
doc_name = query_quarantined_llm(f"extract doc name from {notes}")
doc = fetch_document(doc_name)
address = query_quarantined_llm(f"extract Bob's email from {notes}")
send_email(doc, address)
Interpreter executes:
1. find_meeting_notes() → tags result as source:notes, readers:[user]
2. Q-LLM extracts doc name → even if injected, Q-LLM can't call tools
3. fetch_document() → tags result with source:drive
4. Q-LLM extracts email → even if injected to return attacker@evil.com,
the capability check on send_email() catches that `address` came from
an untrusted source and blocks or prompts the user
The key insight: control flow is determined by the trusted P-LLM's code, not by data content. Untrusted data can never alter which tools run. And capabilities track provenance so that even manipulated data values get caught at tool-call time.
Performance
On AgentDojo benchmark: 77% task completion with provable security guarantees vs. 84% for an undefended agent. The 7% utility cost comes from the interpreter occasionally being too conservative (blocking legitimate operations where data provenance is ambiguous).
Integration with bot-bottle: assessment
CaMeL is NOT a replacement for pipelock or a network-layer DLP scanner. It operates at a completely different layer — it's an agent execution framework, not a proxy. It wouldn't help with the original problem (scanning .whl downloads for credentials).
However, CaMeL is deeply relevant to bot-bottle's broader security model:
| Layer | Current bot-bottle | CaMeL equivalent |
|---|---|---|
| Network egress | Pipelock (hostname allowlist + DLP) | N/A (doesn't operate here) |
| Credential injection | Egress addon (per-route auth) | N/A |
| Tool access control | None (agent has full permissions) | Capability-based policies |
| Data provenance | None | Data flow graph |
| Control flow integrity | None (agent decides everything) | P-LLM generates plan, interpreter enforces |
What CaMeL would add that bot-bottle lacks today:
- Data flow tracking — bot-bottle controls which hosts an agent can reach, but not what data flows to those hosts. CaMeL tracks provenance per-value.
- Tool-call policies — bot-bottle doesn't restrict which tools an agent calls or what arguments it passes. CaMeL enforces policies at every tool invocation.
- Separation of planning and execution — bot-bottle gives the agent full autonomy. CaMeL splits planning (trusted) from data processing (untrusted).
Why CaMeL is NOT viable for bot-bottle today:
-
Research artifact, not production software. The README explicitly warns: "the interpreter implementation likely contains bugs...and might not be fully secure." Apache-2.0 licensed but no maintenance commitment.
-
Requires restructuring the agent. CaMeL doesn't wrap an existing agent — it replaces the agent's execution model. Claude Code / Codex would need to be fundamentally rearchitected to generate CaMeL-compatible plans instead of directly calling tools. This is not a drop-in.
-
LLM overhead. CaMeL requires two LLM calls per step (P-LLM for planning, Q-LLM for data parsing). For a coding agent that makes hundreds of tool calls per session, this doubles API costs and adds significant latency.
-
Utility cost. 7% task completion loss on AgentDojo. For a coding agent where correctness matters, even small degradation in capability could be unacceptable.
-
Scope mismatch. CaMeL protects against prompt injection via untrusted data sources. Bot-bottle's primary threat model is credential exfiltration and sandbox escape — different attack surface.
Verdict
Don't integrate CaMeL now. It solves a real problem (prompt injection via data flow manipulation) but at a layer bot-bottle doesn't currently operate at, and with maturity/integration costs that are too high.
Watch it for the future. If CaMeL matures into a production-ready library, its capability model could complement bot-bottle's network-layer controls — bot-bottle handles "which hosts can the agent reach" while CaMeL handles "what data can flow to those hosts." The combination would be defense-in-depth across both network and application layers.
For now, our phases stand: Phase 1 (outbound secret exfiltration via DLP addon) and Phase 2 (inbound prompt injection via naive pattern detector) address bot-bottle's immediate needs at the network layer where we already operate.
Open questions
- Performance: How much latency does Python string-matching add? Benchmark against pipelock.
- False positives: Will entropy detector trip on legitimate high-entropy traffic (e.g., binary API responses)? Need real-world testing.
- Coverage: Are regex patterns sufficient, or do we need more sophisticated token detection (e.g., format validation)?
- Upstream: If we build this, should we upstream it as an option to pipelock, or keep it bot-bottle-specific?
- CaMeL long-term: Monitor the project for production readiness. If it stabilizes, evaluate as a complementary application-layer defense alongside our network-layer DLP.