docs: research document on DLP alternatives to pipelock
Investigates replacing pipelock with a custom mitmproxy-based DLP addon that supports per-route configuration, response-specific rules, and AI-specific threat detection (tokens, prompt injection). Recommends building the addon in-repo to align with bot-bottle's per-route design model and keep security logic auditable. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,179 @@
|
||||
# DLP alternatives to pipelock: per-route configuration and response handling
|
||||
|
||||
## Question
|
||||
|
||||
Pipelock lacks support for per-route or per-host response scanning rules, making it impossible to skip DLP scanning for large binary downloads (e.g., `.whl` files) while keeping scanning enabled for other traffic on the same host. Should we replace pipelock with a purpose-built DLP/token-scanning proxy that supports granular per-route configuration?
|
||||
|
||||
## Summary
|
||||
|
||||
Yes. Pipelock's flat, global configuration is fundamentally at odds with the per-route model bot-bottle is built on. A custom or configurable DLP proxy built atop mitmproxy (which we already use for egress) would let us:
|
||||
|
||||
1. **Skip DLP scanning selectively** — e.g., scan responses from PyPI for credentials but skip scanning `.whl` file contents
|
||||
2. **Configure scanning per-route** — different rules for different hosts/paths without global toggles
|
||||
3. **Reduce operational surface** — one proxy (egress) instead of two (egress + pipelock)
|
||||
4. **Target AI-specific threats** — focus on credential exfiltration and prompt injection instead of generic DLP
|
||||
|
||||
**Tradeoff:** We'd need to maintain our own scanning logic. Pipelock provides out-of-the-box BIP-39 seed-phrase detection, entropy checks, and pluggable DLP rules. Building custom logic means we need to be explicit about what we're protecting against and keep that code auditable.
|
||||
|
||||
## Current pipelock limitations
|
||||
|
||||
### Issue 1: No per-route response scanning rules
|
||||
|
||||
Pipelock's response scanning is part of TLS interception — a global feature with no per-host knobs:
|
||||
|
||||
```yaml
|
||||
tls_interception:
|
||||
enabled: true
|
||||
passthrough_domains: [...] # Can skip MITM, but not just response scanning
|
||||
```
|
||||
|
||||
**Status:** Tested with pipelock v2.3.0. Confirmed that:
|
||||
- `response_body_scanning` config field doesn't exist
|
||||
- No way to set per-host response size limits
|
||||
- No way to skip scanning for specific file extensions
|
||||
- `tls_passthrough: true` disables both request AND response scanning (we want request scanning to stay on)
|
||||
|
||||
### Issue 2: Global configuration only
|
||||
|
||||
All of pipelock's scanning rules are global. If route A wants to skip `.whl` scanning and route B wants to skip `.tar.gz`, there's nowhere to express that distinction — the config is flat.
|
||||
|
||||
### Issue 3: LLM prompt-specific false positives
|
||||
|
||||
Pipelock's BIP-39 seed-phrase detector fires on any 12+ English words matching a checksum, which is common in LLM prompts/responses. Bot-bottle disables this detector globally, sacrificing protection.
|
||||
|
||||
## Replacement design: mitmproxy-based DLP addon
|
||||
|
||||
Since bot-bottle already uses mitmproxy for egress (PRD 0017), we can extend the mitmproxy addon to do DLP scanning alongside egress rules:
|
||||
|
||||
### Architecture
|
||||
|
||||
```
|
||||
Agent
|
||||
↓ (HTTP_PROXY=http://egress:8080)
|
||||
Egress (mitmproxy)
|
||||
├─ Addon 1: Path allowlisting (current)
|
||||
├─ Addon 2: Credential injection (current)
|
||||
└─ Addon 3: DLP scanning (NEW)
|
||||
├─ Config: per-route scanning rules from manifest
|
||||
├─ Detectors: token patterns, prompt injection, entropy
|
||||
└─ Action: block/warn based on route config
|
||||
```
|
||||
|
||||
### Per-route configuration in manifest
|
||||
|
||||
```yaml
|
||||
egress:
|
||||
routes:
|
||||
- host: api.anthropic.com
|
||||
dlp:
|
||||
enabled: true
|
||||
detectors: [tokens, entropy]
|
||||
|
||||
- host: files.pythonhosted.org
|
||||
dlp:
|
||||
enabled: true
|
||||
request_only: true # Scan outbound, skip response
|
||||
skip_extensions: [".whl", ".tar.gz"]
|
||||
|
||||
- host: internal-service.corp
|
||||
dlp:
|
||||
enabled: false # Trusted internal, no scanning
|
||||
```
|
||||
|
||||
### Detector design
|
||||
|
||||
Three core detectors, each with tunable sensitivity:
|
||||
|
||||
1. **Token detector**
|
||||
- Regex patterns for API keys (AWS `AKIA`, GitHub `ghp_`, etc.)
|
||||
- Anthropic/OpenAI API keys
|
||||
- OAuth tokens (Bearer patterns)
|
||||
- Action: Block immediately with no false-positive tolerance
|
||||
|
||||
2. **Entropy detector**
|
||||
- Shannon entropy threshold (bits/char)
|
||||
- Flags high-entropy secrets (tunable per-route)
|
||||
- Current pipelock default: 4.5 bits/char
|
||||
- Action: Warn or block based on route config
|
||||
|
||||
3. **Prompt injection detector** (phase 2)
|
||||
- Detect attempts to exfiltrate system prompts via LLM outputs
|
||||
- Pattern: responses containing "system prompt", "instructions", "directive" + credential
|
||||
- Action: Block or sample for audit
|
||||
|
||||
### Advantages over pipelock
|
||||
|
||||
| Aspect | Pipelock | Mitmproxy addon |
|
||||
|--------|----------|-----------------|
|
||||
| Per-route rules | ❌ (global only) | ✅ (manifest-driven) |
|
||||
| Response-specific config | ❌ (all-or-nothing) | ✅ (request_only, skip_extensions) |
|
||||
| Request scanning overhead | ✅ (lightweight) | ~same |
|
||||
| Maintenance burden | Low (third-party) | High (custom code) |
|
||||
| Auditability | Closed source | ✅ (in-repo) |
|
||||
| AI-specific detection | Limited | ✅ (token patterns, prompt injection) |
|
||||
| Code reuse | None | ✅ (egress addon framework) |
|
||||
|
||||
### Disadvantages
|
||||
|
||||
1. **Maintenance responsibility** — We own the security logic. Any bugs in detector regexes or entropy thresholds are our problem.
|
||||
2. **Feature parity gap** — Pipelock's BIP-39 detector is sophisticated. We'd need to decide: replicate it, skip it, or ship a simplified version.
|
||||
3. **Performance** — Custom Python detectors will be slower than pipelock's Go implementation. Benchmarking needed.
|
||||
4. **Coverage breadth** — Pipelock covers generic DLP (credit cards, SSNs, etc.). We'd focus narrowly on AI/credential exfil.
|
||||
|
||||
## Alternative: Configurable pipelock fork
|
||||
|
||||
Rather than build from scratch, fork pipelock and add `response_body_scanning` config:
|
||||
|
||||
```yaml
|
||||
response_body_scanning:
|
||||
enabled: true
|
||||
skip_extensions: [".whl", ".tar.gz"]
|
||||
max_response_bytes: 104857600 # 100MB
|
||||
```
|
||||
|
||||
**Pros:**
|
||||
- Reuses existing detectors and maturity
|
||||
- Lower maintenance burden
|
||||
- Clear path to upstream (could be PR'd)
|
||||
|
||||
**Cons:**
|
||||
- Still maintains a fork
|
||||
- Pipelock's maintainers may not want global per-host rules
|
||||
- Go code is farther from our codebase (harder to audit)
|
||||
- Doesn't solve prompt-injection detection
|
||||
|
||||
## Recommendation
|
||||
|
||||
**Build the mitmproxy addon** (phase 1: tokens + entropy; phase 2: prompt injection).
|
||||
|
||||
**Rationale:**
|
||||
1. Bot-bottle already owns the mitmproxy egress addon — extending it keeps security logic in-repo and auditable.
|
||||
2. Per-route DLP configuration aligns with bot-bottle's design (PRD 0017 is already per-route).
|
||||
3. Replacing pipelock reduces sidecar count and operational surface.
|
||||
4. AI-specific detectors (tokens, prompt injection) matter more than generic DLP for agent containment.
|
||||
|
||||
**Fallback:** If performance testing shows unacceptable latency in the Python addon, revisit the pipelock fork approach.
|
||||
|
||||
## Implementation phases
|
||||
|
||||
### Phase 1: MVP (2-3 weeks)
|
||||
- Token detector (regex for API key patterns)
|
||||
- Entropy detector (reuse pipelock thresholds)
|
||||
- Per-route `dlp: {enabled, request_only, skip_extensions}` config
|
||||
- Block on token match, warn on entropy hit
|
||||
|
||||
### Phase 2: Prompt injection (1-2 weeks)
|
||||
- Pattern detector for system prompt exfiltration
|
||||
- Integrates with phase 1 config
|
||||
|
||||
### Phase 3: Hardening (optional)
|
||||
- Custom entropy heuristics for LLM payloads
|
||||
- Sampling/audit mode for high-entropy responses
|
||||
- Rate limiting on DLP blocks
|
||||
|
||||
## Open questions
|
||||
|
||||
1. **Performance:** How much latency does Python string-matching add? Benchmark against pipelock.
|
||||
2. **False positives:** Will entropy detector trip on legitimate high-entropy traffic (e.g., binary API responses)? Need real-world testing.
|
||||
3. **Coverage:** Are regex patterns sufficient, or do we need more sophisticated token detection (e.g., format validation)?
|
||||
4. **Upstream:** If we build this, should we upstream it as an option to pipelock, or keep it bot-bottle-specific?
|
||||
Reference in New Issue
Block a user