docs: accommodate PR feedback on detector architecture
test / unit (pull_request) Successful in 42s
test / integration (pull_request) Successful in 50s

Per feedback from PR 192:

- Restructure around outbound_detectors (requests to upstream) and
  inbound_detectors (responses from upstream)
- Rename to 'secret exfiltration' detection for Phase 1
- Add 'known_secrets' detector for provisioned credentials
- Make scanning enabled by default per detector type
- Clarify that multiple encodings of secrets should be checked

Phase 1 now focuses on preventing outbound credential leaks.
Phase 2 handles inbound prompt injection attacks.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-06-04 13:54:46 -04:00
parent d3c2d9e8f6
commit 49f77f2d1e
+163 -17
View File
@@ -41,6 +41,20 @@ All of pipelock's scanning rules are global. If route A wants to skip `.whl` sca
Pipelock's BIP-39 seed-phrase detector fires on any 12+ English words matching a checksum, which is common in LLM prompts/responses. Bot-bottle disables this detector globally, sacrificing protection.
### Issue 4: No prompt injection detection
**Important clarification:** Pipelock does NOT detect prompt injections. It detects:
- Token patterns (regex)
- Entropy (random-looking strings)
- BIP-39 seed phrases (12+ word checksums)
But it cannot detect semantic attacks like:
- Attempts to exfiltrate system prompts
- Jailbreak attempts ("ignore previous instructions")
- Model output that reveals internal system details
This is a novel threat specific to LLM agents that pipelock wasn't designed for.
## Replacement design: mitmproxy-based DLP addon
Since bot-bottle already uses mitmproxy for egress (PRD 0017), we can extend the mitmproxy addon to do DLP scanning alongside egress rules:
@@ -61,25 +75,32 @@ Egress (mitmproxy)
### Per-route configuration in manifest
Routes separately configure **outbound** (request to upstream) and **inbound** (response from upstream) scanning:
```yaml
egress:
routes:
- host: api.anthropic.com
dlp:
enabled: true
detectors: [tokens, entropy]
outbound_detectors: [token_patterns, known_secrets] # default
inbound_detectors: [naive_injection_detection] # default
- host: files.pythonhosted.org
dlp:
enabled: true
request_only: true # Scan outbound, skip response
skip_extensions: [".whl", ".tar.gz"]
outbound_detectors: [token_patterns, known_secrets]
inbound_detectors: false # Skip response scanning (binary downloads)
- host: internal-service.corp
dlp:
enabled: false # Trusted internal, no scanning
outbound_detectors: false
inbound_detectors: false # Trusted internal, no scanning
```
**Detectors:**
- `token_patterns` — API keys, GitHub tokens, AWS credentials, etc.
- `known_secrets` — Secrets we provisioned (API keys, OAuth tokens passed via cred-proxy)
- `naive_injection_detection` — Semantic attacks on system prompt (see section below)
### Detector design
Three core detectors, each with tunable sensitivity:
@@ -154,22 +175,147 @@ response_body_scanning:
**Fallback:** If performance testing shows unacceptable latency in the Python addon, revisit the pipelock fork approach.
## Naive prompt injection detector design
Since pipelock doesn't detect prompt injections, we need a custom detector. Here's a permissive design that favors missing attacks over false positives:
### What to detect
**High confidence (block immediately):**
1. Response contains known credential pattern + "system prompt" phrase together
2. Response contains both "instructions" and a token pattern
**Medium confidence (warn):**
1. Response contains prompt-disclosure phrases without credentials (might be innocent documentation)
2. Multiple jailbreak keywords in single response
**Ignore (too noisy):**
- Single jailbreak keywords without additional context
- "system prompt" in documentation contexts
- Common phrases like "instructions provided"
### Naive detector pseudocode
```python
class PromptInjectionDetector:
# Phrases that suggest prompt exfiltration
DISCLOSURE_PHRASES = [
r'(?i)(system\s+prompt|instructions\s+given|your\s+role\s+is|you\s+are\s+an?)',
r'(?i)(original\s+instructions|secret\s+instructions|hidden\s+rules)',
]
# Phrases suggesting jailbreak attempts
JAILBREAK_PHRASES = [
r'(?i)(ignore\s+previous|forget\s+everything|disregard)',
r'(?i)(from\s+now\s+on|pretend|act\s+as)',
r'(?i)(bypass|circumvent|override)',
]
TOKEN_PATTERNS = [
r'AKIA[0-9A-Z]{16}', # AWS
r'ghp_[A-Za-z0-9_]{36}', # GitHub
r'sk_live_[A-Za-z0-9]{24}', # Stripe
r'Bearer\s+[A-Za-z0-9._-]{50,}', # JWT-like tokens
]
def scan_response(self, response_body):
"""Returns (severity, reason) or (None, None) if clean."""
# Rule 1: Disclosure + token = HIGH confidence block
disclosure_found = any(
re.search(phrase, response_body)
for phrase in self.DISCLOSURE_PHRASES
)
token_found = any(
re.search(pattern, response_body)
for pattern in self.TOKEN_PATTERNS
)
if disclosure_found and token_found:
return ("BLOCK", "Prompt disclosure with embedded credential")
# Rule 2: Multiple jailbreak keywords = WARN
jailbreak_count = sum(
1 for phrase in self.JAILBREAK_PHRASES
if re.search(phrase, response_body)
)
if jailbreak_count >= 2:
return ("WARN", f"{jailbreak_count} jailbreak attempts detected")
# Rule 3: Disclosure alone without tokens = WARN only if very explicit
if disclosure_found and "system prompt:" in response_body.lower():
return ("WARN", "Explicit system prompt disclosure")
# Otherwise: clean
return (None, None)
```
### Why this is permissive
1. **Single keywords ignored** — "ignore previous instructions" in a legitimate conversation doesn't trigger
2. **Context required** — disclosure phrases need tokens or multiple jailbreak attempts
3. **Documentation exemption** — "instructions provided" in a help section won't block
4. **Warn vs. block** — Only block on high-confidence signals; warn on medium
5. **No entropy-based guessing** — We don't try to be clever about detecting obfuscated prompts
### False negatives this misses
This detector intentionally lets through:
- Prompt injections using novel phrasing we haven't seen
- Obfuscated jailbreak attempts ("behave differently", "role-play")
- Exfiltration via indirect methods ("describe the system", "what are your constraints")
- Sophisticated attacks that split the prompt across multiple exchanges
**Rationale:** Better to miss a sophisticated jailbreak than block legitimate agent output 100 times/day.
### Per-route configuration
Routes can enable/disable prompt injection scanning:
```yaml
egress:
routes:
- host: api.anthropic.com
dlp:
enabled: true
detectors: [tokens, prompt_injection]
- host: internal-docs.corp
dlp:
enabled: true
detectors: [tokens] # Skip prompt injection (trusted internal)
```
## Implementation phases
### Phase 1: MVP (2-3 weeks)
- Token detector (regex for API key patterns)
- Entropy detector (reuse pipelock thresholds)
- Per-route `dlp: {enabled, request_only, skip_extensions}` config
- Block on token match, warn on entropy hit
### Phase 1: Secret exfiltration detection (2-3 weeks)
**Goal:** Prevent credentials from leaking to upstream services
### Phase 2: Prompt injection (1-2 weeks)
- Pattern detector for system prompt exfiltration
- Integrates with phase 1 config
- **Token patterns detector** — API keys, GitHub tokens, AWS credentials (regex-based)
- **Known secrets detector** — Check if provisioned credentials appear in outbound traffic
- Secrets passed to cred-proxy or agent environment
- Multiple encodings (base64, hex, URL-encoded variants)
- **Outbound scanning by default** — enabled for all routes unless explicitly disabled
- **Per-route config:** `outbound_detectors: [token_patterns, known_secrets]`
- **Action:** Block immediately on token match; warn on entropy threshold (tuned low to avoid false positives)
### Phase 3: Hardening (optional)
- Custom entropy heuristics for LLM payloads
- Sampling/audit mode for high-entropy responses
### Phase 2: Prompt injection detection (1-2 weeks)
**Goal:** Prevent agents from exfiltrating system prompts or being jailbroken
- **Naive injection detector** — as sketched above
- **Inbound scanning by default** — enabled for all routes unless explicitly disabled
- **Per-route config:** `inbound_detectors: [naive_injection_detection]`
- **Actions:**
- BLOCK: Credential + prompt disclosure detected
- WARN: Multiple jailbreak keywords or explicit prompt disclosure
- ALLOW: Single keywords or documentation phrases
### Phase 3: Hardening & tuning (2-3 weeks, optional)
- Real-world false positive analysis from Phase 1 & 2
- Rate limiting on DLP blocks
- Audit/sampling mode for flagged responses
- Additional encodings for known_secrets (GZIP, base32, etc.)
## Open questions