docs: PRD 0053 — egress DLP addon (token, secret, injection detection)

Adds the product requirements document for replacing pipelock's DLP capability with a per-route mitmproxy addon. Covers three implementation chunks: token-pattern detection, known-secret detection, and naive prompt injection scanning. References the research in PR #192 and issue #195.
2026-06-05 00:34:55 +00:00
parent eafd1c1fb2
commit f145203eee
1 changed files with 291 additions and 0 deletions
@@ -0,0 +1,291 @@
+# PRD 0053: Egress DLP addon
+
+- **Status:** Draft
+- **Author:** claude
+- **Created:** 2026-06-05
+- **Issue:** #195
+
+## Summary
+
+With pipelock removed (PR #193), the egress proxy no longer performs DLP
+scanning on traffic to or from the agent. This PRD implements a replacement
+directly inside the mitmproxy egress addon: per-route DLP detectors that
+scan outbound requests for credential leakage and inbound responses for
+prompt injection attempts. Configuration is expressed as a new `dlp` block
+on each `egress.routes` entry in the bottle manifest.
+
+The design follows the recommendation in the [DLP research document
+(PR #192)](https://gitea.dideric.is/didericis/bot-bottle/pulls/192) and
+covers all three remaining implementation phases from that plan:
+
+1. Token pattern detection (Phase 1a)
+2. Known-secrets detection (Phase 1b)
+3. Naive prompt injection detection (Phase 2)
+
+## Problem
+
+Pipelock was removed because it could not support per-route response
+scanning, blocking selective DLP policies (e.g., skip scanning `.whl`
+downloads while keeping scanning on API calls). Removing it left the egress
+proxy with no DLP capability at all. The egress addon already holds per-route
+logic for path allowlisting and credential injection; DLP rules belong in the
+same place.
+
+## Goals / Success Criteria
+
+1. Outbound request bodies and headers are scanned for known token patterns
+   (AWS, GitHub, Anthropic, etc.) before the request reaches the upstream.
+   Matches are blocked immediately.
+2. Outbound request bodies are scanned for provisioned secrets that the
+   agent should not have direct access to. Matches are blocked immediately.
+3. Inbound response bodies are scanned for prompt disclosure and jailbreak
+   signals. High-confidence matches are blocked; medium-confidence matches
+   emit a log warning and are forwarded.
+4. DLP scanning is enabled by default on every route. Individual routes can
+   selectively disable outbound detectors, inbound detectors, or both via a
+   `dlp` block in the manifest.
+5. All detector logic lives in `egress_addon_core.py` (pure Python, no
+   mitmproxy dependency) and is covered by unit tests on the host.
+6. Adding `dlp` configuration to a route that omits it entirely is
+   backward-compatible — the route behaves as if all detectors are enabled.
+
+## Non-goals
+
+- LLM-based semantic prompt injection detection (explicitly deferred to a
+  potential Phase 2b per the research doc).
+- Entropy-based secret detection (excluded from scope; too many false
+  positives on binary API responses and compressed payloads).
+- BIP-39 seed-phrase detection.
+- Generic DLP (credit cards, SSNs, PII) — scope is narrow: AI/credential
+  exfil relevant to agent containment.
+- Changes to the cred-proxy sidecar.
+- Streaming response scanning (scan buffered response body only).
+
+## Design
+
+### Manifest schema — `dlp` block
+
+Each `egress.routes` entry gains an optional `dlp` key:
+
+```yaml
+egress:
+  routes:
+    - host: api.anthropic.com
+      # dlp omitted → all detectors on (default)
+
+    - host: files.pythonhosted.org
+      dlp:
+        inbound_detectors: false   # skip response scanning (binary downloads)
+
+    - host: internal-docs.corp
+      dlp:
+        outbound_detectors: false
+        inbound_detectors: false   # trusted internal, no scanning
+```
+
+`outbound_detectors` controls scanning of the *request* body + headers
+leaving the agent. `inbound_detectors` controls scanning of the *response*
+body arriving from the upstream.
+
+Valid values per field:
+- Omitted (or `null`) — default: all detectors active.
+- `false` — scanning disabled for this direction on this route.
+- A list of detector names — only the listed detectors run.
+
+Named outbound detectors: `token_patterns`, `known_secrets`.
+Named inbound detectors: `naive_injection_detection`.
+
+The manifest parser (`manifest_egress.py`) validates the `dlp` block and
+rejects unknown detector names.
+
+### `EgressRoute` changes
+
+`EgressRoute` gains two new fields:
+
+```python
+@dataclass(frozen=True)
+class EgressRoute:
+    Host: str
+    PathAllowlist: tuple[str, ...] = ()
+    AuthScheme: str = ""
+    TokenRef: str = ""
+    Role: tuple[str, ...] = ()
+    OutboundDetectors: tuple[str, ...] | None = None   # None = all enabled
+    InboundDetectors: tuple[str, ...] | None = None    # None = all enabled
+```
+
+`None` means "use defaults" (all active); an empty `tuple[str, ...]` means
+"disabled". Named detectors use `tuple[str, ...]` with the detector name.
+
+`manifest_egress.py` uses `from_dict` to parse the new `dlp` block and
+populate these fields; unknown keys inside `dlp` are rejected.
+
+### `Route` changes in `egress_addon_core.py`
+
+The addon-side `Route` dataclass mirrors the manifest-side change:
+
+```python
+@dataclass(frozen=True)
+class Route:
+    host: str
+    path_allowlist: tuple[str, ...] = ()
+    auth_scheme: str = ""
+    token_env: str = ""
+    outbound_detectors: tuple[str, ...] | None = None
+    inbound_detectors: tuple[str, ...] | None = None
+```
+
+`parse_routes` / `_parse_one` grow the corresponding parsing logic.
+
+### Detector interface
+
+Each detector is a pure function:
+
+```python
+def scan(body: str | bytes, *, env: Mapping[str, str] = {}) -> ScanResult | None:
+    ...
+```
+
+`ScanResult` carries:
+
+```python
+@dataclass(frozen=True)
+class ScanResult:
+    severity: str   # "block" or "warn"
+    reason: str
+```
+
+`scan` returns `None` if the body is clean, `ScanResult` otherwise.
+
+### Detector: `token_patterns`
+
+Regex patterns for well-known credential formats, applied to the outbound
+request body and `Authorization` header (before the addon strips it — the
+strip happens after DLP scanning so that the scan sees any credential the
+agent tried to smuggle):
+
+| Token type | Pattern |
+|------------|---------|
+| AWS access key | `AKIA[0-9A-Z]{16}` |
+| GitHub token (classic) | `ghp_[A-Za-z0-9_]{36}` |
+| GitHub fine-grained | `github_pat_[A-Za-z0-9_]{82}` |
+| Anthropic API key | `sk-ant-[A-Za-z0-9\-_]{93}` |
+| OpenAI API key | `sk-[A-Za-z0-9]{48}` |
+| Stripe live key | `sk_live_[A-Za-z0-9]{24}` |
+| Generic Bearer JWT | `Bearer\s+[A-Za-z0-9._\-]{50,}` |
+
+Action: `"block"` on any match. No tolerance — a credential in an outbound
+request is always a violation.
+
+### Detector: `known_secrets`
+
+At request time the egress addon has access to `os.environ`, which includes
+all `token_env` values declared by route auth blocks. The detector:
+
+1. Collects all `EGRESS_TOKEN_*` values from the environment (the naming
+   contract established by `manifest_egress.py`'s `TokenRef` rendering).
+2. For each secret value, derives encoded variants: raw, base64, URL-encoded,
+   hex.
+3. Scans the outbound request body for any variant.
+
+Action: `"block"` on match.
+
+This detector does **not** accept a custom detector name in the YAML — it
+is always named `known_secrets`. The environment is passed in via the `env`
+keyword argument to `scan`.
+
+### Detector: `naive_injection_detection`
+
+Pattern-based inbound response scanner. Uses two tiers:
+
+**Tier 1 — BLOCK (credential + disclosure together):**
+- Response contains a token-pattern match (reuses `token_patterns` regex
+  set) AND a prompt-disclosure phrase (e.g., `system prompt`, `my instructions
+  are`, `hidden rules`).
+
+**Tier 2 — WARN (multiple jailbreak signals):**
+- Two or more jailbreak phrases detected (e.g., `ignore previous`,
+  `forget everything`, `pretend you are`, `act as`).
+- OR explicit prompt disclosure (`system prompt:`) without a credential.
+
+**Tier 3 — ALLOW:**
+- Single jailbreak keyword without additional context.
+- Common documentation phrases.
+
+See the research doc for the full phrase lists and pseudocode.
+
+### Wiring into `egress_addon.py`
+
+Two new mitmproxy hooks are added alongside the existing `request` hook:
+
+```python
+def request(self, flow: http.HTTPFlow) -> None:
+    # ... existing path-allowlist + auth-injection logic ...
+    # After route decision, if action == "forward":
+    result = scan_outbound(route, flow.request, os.environ)
+    if result and result.severity == "block":
+        flow.response = http.Response.make(403, result.reason.encode(), ...)
+        return
+
+def response(self, flow: http.HTTPFlow) -> None:
+    route = match_route(self.routes, flow.request.pretty_host)
+    if route is None:
+        return  # already blocked at request time
+    result = scan_inbound(route, flow.response)
+    if result and result.severity == "block":
+        flow.response = http.Response.make(403, result.reason.encode(), ...)
+    elif result and result.severity == "warn":
+        sys.stderr.write(f"egress DLP warn: {result.reason}\n")
+```
+
+`scan_outbound` and `scan_inbound` are pure functions in
+`egress_addon_core.py` that dispatch to the per-route detector list.
+
+### Ordering: auth strip vs. DLP scan
+
+The DLP outbound scan sees the *agent's original* `Authorization` header
+before the addon strips it. This ensures that a token the agent smuggled
+in the header is caught. The strip + optional re-injection still happens
+afterward, preserving the existing credential-injection security model.
+
+## Implementation chunks
+
+1. **Manifest `dlp` block + `EgressRoute` fields.**
+   `manifest_egress.py`: parse `dlp`, add `OutboundDetectors` /
+   `InboundDetectors` to `EgressRoute`. Extend
+   `tests/unit/test_manifest_egress.py` with `dlp` valid/invalid cases.
+   `egress_addon_core.py`: add `outbound_detectors` / `inbound_detectors`
+   to `Route`; update `_parse_one` and `parse_routes`; extend
+   `tests/unit/test_egress_addon_core.py`.
+
+2. **Token-patterns detector (Phase 1a).**
+   New module `bot_bottle/dlp_detectors.py` (host-importable) and
+   companion flat copy for the sidecar bundle. Add `TokenPatternsDetector`
+   with the regex set above. Wire `scan_outbound` into the `request` hook
+   in `egress_addon.py`. Unit tests in
+   `tests/unit/test_dlp_detectors.py`.
+
+3. **Known-secrets detector (Phase 1b).**
+   Add `KnownSecretsDetector` to `dlp_detectors.py`. Collect
+   `EGRESS_TOKEN_*` from env; derive encoded variants; scan request body.
+   Extend unit tests. Wire into `scan_outbound`.
+
+4. **Naive prompt injection detector (Phase 2).**
+   Add `NaiveInjectionDetector` to `dlp_detectors.py`. Wire
+   `scan_inbound` into the new `response` hook in `egress_addon.py`.
+   Extend unit tests. Activate PRD 0053 (`Status: Draft → Active`) in
+   this commit.
+
+## Open questions
+
+1. **Response body buffering:** mitmproxy's `response` hook already has
+   the full body for non-streaming responses. For streaming (chunked)
+   responses the body may be empty or incomplete at hook time. Scope for
+   now: log a warning and skip scanning on streaming responses; revisit
+   if needed.
+2. **Encoding breadth for `known_secrets`:** Start with raw + base64 +
+   URL-encoded + hex. Add GZIP / base32 if real-world evasion attempts
+   appear.
+3. **`EGRESS_TOKEN_*` naming contract:** The detector relies on the
+   env-var naming convention from `manifest_egress.py`. If that contract
+   changes, the detector must be updated in lock-step.