Investigates replacing pipelock with a custom mitmproxy-based DLP addon that supports per-route configuration, response-specific rules, and AI-specific threat detection (tokens, prompt injection). Recommends building the addon in-repo to align with bot-bottle's per-route design model and keep security logic auditable. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
7.5 KiB
DLP alternatives to pipelock: per-route configuration and response handling
Question
Pipelock lacks support for per-route or per-host response scanning rules, making it impossible to skip DLP scanning for large binary downloads (e.g., .whl files) while keeping scanning enabled for other traffic on the same host. Should we replace pipelock with a purpose-built DLP/token-scanning proxy that supports granular per-route configuration?
Summary
Yes. Pipelock's flat, global configuration is fundamentally at odds with the per-route model bot-bottle is built on. A custom or configurable DLP proxy built atop mitmproxy (which we already use for egress) would let us:
- Skip DLP scanning selectively — e.g., scan responses from PyPI for credentials but skip scanning
.whlfile contents - Configure scanning per-route — different rules for different hosts/paths without global toggles
- Reduce operational surface — one proxy (egress) instead of two (egress + pipelock)
- Target AI-specific threats — focus on credential exfiltration and prompt injection instead of generic DLP
Tradeoff: We'd need to maintain our own scanning logic. Pipelock provides out-of-the-box BIP-39 seed-phrase detection, entropy checks, and pluggable DLP rules. Building custom logic means we need to be explicit about what we're protecting against and keep that code auditable.
Current pipelock limitations
Issue 1: No per-route response scanning rules
Pipelock's response scanning is part of TLS interception — a global feature with no per-host knobs:
tls_interception:
enabled: true
passthrough_domains: [...] # Can skip MITM, but not just response scanning
Status: Tested with pipelock v2.3.0. Confirmed that:
response_body_scanningconfig field doesn't exist- No way to set per-host response size limits
- No way to skip scanning for specific file extensions
tls_passthrough: truedisables both request AND response scanning (we want request scanning to stay on)
Issue 2: Global configuration only
All of pipelock's scanning rules are global. If route A wants to skip .whl scanning and route B wants to skip .tar.gz, there's nowhere to express that distinction — the config is flat.
Issue 3: LLM prompt-specific false positives
Pipelock's BIP-39 seed-phrase detector fires on any 12+ English words matching a checksum, which is common in LLM prompts/responses. Bot-bottle disables this detector globally, sacrificing protection.
Replacement design: mitmproxy-based DLP addon
Since bot-bottle already uses mitmproxy for egress (PRD 0017), we can extend the mitmproxy addon to do DLP scanning alongside egress rules:
Architecture
Agent
↓ (HTTP_PROXY=http://egress:8080)
Egress (mitmproxy)
├─ Addon 1: Path allowlisting (current)
├─ Addon 2: Credential injection (current)
└─ Addon 3: DLP scanning (NEW)
├─ Config: per-route scanning rules from manifest
├─ Detectors: token patterns, prompt injection, entropy
└─ Action: block/warn based on route config
Per-route configuration in manifest
egress:
routes:
- host: api.anthropic.com
dlp:
enabled: true
detectors: [tokens, entropy]
- host: files.pythonhosted.org
dlp:
enabled: true
request_only: true # Scan outbound, skip response
skip_extensions: [".whl", ".tar.gz"]
- host: internal-service.corp
dlp:
enabled: false # Trusted internal, no scanning
Detector design
Three core detectors, each with tunable sensitivity:
-
Token detector
- Regex patterns for API keys (AWS
AKIA, GitHubghp_, etc.) - Anthropic/OpenAI API keys
- OAuth tokens (Bearer patterns)
- Action: Block immediately with no false-positive tolerance
- Regex patterns for API keys (AWS
-
Entropy detector
- Shannon entropy threshold (bits/char)
- Flags high-entropy secrets (tunable per-route)
- Current pipelock default: 4.5 bits/char
- Action: Warn or block based on route config
-
Prompt injection detector (phase 2)
- Detect attempts to exfiltrate system prompts via LLM outputs
- Pattern: responses containing "system prompt", "instructions", "directive" + credential
- Action: Block or sample for audit
Advantages over pipelock
| Aspect | Pipelock | Mitmproxy addon |
|---|---|---|
| Per-route rules | ❌ (global only) | ✅ (manifest-driven) |
| Response-specific config | ❌ (all-or-nothing) | ✅ (request_only, skip_extensions) |
| Request scanning overhead | ✅ (lightweight) | ~same |
| Maintenance burden | Low (third-party) | High (custom code) |
| Auditability | Closed source | ✅ (in-repo) |
| AI-specific detection | Limited | ✅ (token patterns, prompt injection) |
| Code reuse | None | ✅ (egress addon framework) |
Disadvantages
- Maintenance responsibility — We own the security logic. Any bugs in detector regexes or entropy thresholds are our problem.
- Feature parity gap — Pipelock's BIP-39 detector is sophisticated. We'd need to decide: replicate it, skip it, or ship a simplified version.
- Performance — Custom Python detectors will be slower than pipelock's Go implementation. Benchmarking needed.
- Coverage breadth — Pipelock covers generic DLP (credit cards, SSNs, etc.). We'd focus narrowly on AI/credential exfil.
Alternative: Configurable pipelock fork
Rather than build from scratch, fork pipelock and add response_body_scanning config:
response_body_scanning:
enabled: true
skip_extensions: [".whl", ".tar.gz"]
max_response_bytes: 104857600 # 100MB
Pros:
- Reuses existing detectors and maturity
- Lower maintenance burden
- Clear path to upstream (could be PR'd)
Cons:
- Still maintains a fork
- Pipelock's maintainers may not want global per-host rules
- Go code is farther from our codebase (harder to audit)
- Doesn't solve prompt-injection detection
Recommendation
Build the mitmproxy addon (phase 1: tokens + entropy; phase 2: prompt injection).
Rationale:
- Bot-bottle already owns the mitmproxy egress addon — extending it keeps security logic in-repo and auditable.
- Per-route DLP configuration aligns with bot-bottle's design (PRD 0017 is already per-route).
- Replacing pipelock reduces sidecar count and operational surface.
- AI-specific detectors (tokens, prompt injection) matter more than generic DLP for agent containment.
Fallback: If performance testing shows unacceptable latency in the Python addon, revisit the pipelock fork approach.
Implementation phases
Phase 1: MVP (2-3 weeks)
- Token detector (regex for API key patterns)
- Entropy detector (reuse pipelock thresholds)
- Per-route
dlp: {enabled, request_only, skip_extensions}config - Block on token match, warn on entropy hit
Phase 2: Prompt injection (1-2 weeks)
- Pattern detector for system prompt exfiltration
- Integrates with phase 1 config
Phase 3: Hardening (optional)
- Custom entropy heuristics for LLM payloads
- Sampling/audit mode for high-entropy responses
- Rate limiting on DLP blocks
Open questions
- Performance: How much latency does Python string-matching add? Benchmark against pipelock.
- False positives: Will entropy detector trip on legitimate high-entropy traffic (e.g., binary API responses)? Need real-world testing.
- Coverage: Are regex patterns sufficient, or do we need more sophisticated token detection (e.g., format validation)?
- Upstream: If we build this, should we upstream it as an option to pipelock, or keep it bot-bottle-specific?