docs: PRD 0053 — egress DLP addon (token, secret, injection detection)

Adds the product requirements document for replacing pipelock's DLP capability with a per-route mitmproxy addon. Covers three implementation chunks: token-pattern detection, known-secret detection, and naive prompt injection scanning. References the research in PR #192 and issue #195.
2026-06-05 00:34:55 +00:00
parent eafd1c1fb2
commit f145203eee
1 changed files with 291 additions and 0 deletions
@@ -0,0 +1,291 @@
 # PRD 0053: Egress DLP addon
 - **Status:** Draft
 - **Author:** claude
 - **Created:** 2026-06-05
 - **Issue:** #195
 ## Summary
 With pipelock removed (PR #193), the egress proxy no longer performs DLP
 scanning on traffic to or from the agent. This PRD implements a replacement
 directly inside the mitmproxy egress addon: per-route DLP detectors that
 scan outbound requests for credential leakage and inbound responses for
 prompt injection attempts. Configuration is expressed as a new `dlp` block
 on each `egress.routes` entry in the bottle manifest.
 The design follows the recommendation in the [DLP research document
 (PR #192)](https://gitea.dideric.is/didericis/bot-bottle/pulls/192) and
 covers all three remaining implementation phases from that plan:
 1. Token pattern detection (Phase 1a)
 2. Known-secrets detection (Phase 1b)
 3. Naive prompt injection detection (Phase 2)
 ## Problem
 Pipelock was removed because it could not support per-route response
 scanning, blocking selective DLP policies (e.g., skip scanning `.whl`
 downloads while keeping scanning on API calls). Removing it left the egress
 proxy with no DLP capability at all. The egress addon already holds per-route
 logic for path allowlisting and credential injection; DLP rules belong in the
 same place.
 ## Goals / Success Criteria
 1. Outbound request bodies and headers are scanned for known token patterns
   (AWS, GitHub, Anthropic, etc.) before the request reaches the upstream.
   Matches are blocked immediately.
 2. Outbound request bodies are scanned for provisioned secrets that the
   agent should not have direct access to. Matches are blocked immediately.
 3. Inbound response bodies are scanned for prompt disclosure and jailbreak
   signals. High-confidence matches are blocked; medium-confidence matches
   emit a log warning and are forwarded.
 4. DLP scanning is enabled by default on every route. Individual routes can
   selectively disable outbound detectors, inbound detectors, or both via a
   `dlp` block in the manifest.
 5. All detector logic lives in `egress_addon_core.py` (pure Python, no
   mitmproxy dependency) and is covered by unit tests on the host.
 6. Adding `dlp` configuration to a route that omits it entirely is
   backward-compatible — the route behaves as if all detectors are enabled.
 ## Non-goals
 - LLM-based semantic prompt injection detection (explicitly deferred to a
  potential Phase 2b per the research doc).
 - Entropy-based secret detection (excluded from scope; too many false
  positives on binary API responses and compressed payloads).
 - BIP-39 seed-phrase detection.
 - Generic DLP (credit cards, SSNs, PII) — scope is narrow: AI/credential
  exfil relevant to agent containment.
 - Changes to the cred-proxy sidecar.
 - Streaming response scanning (scan buffered response body only).
 ## Design
 ### Manifest schema — `dlp` block
 Each `egress.routes` entry gains an optional `dlp` key:
 ```yaml
 egress:
  routes:
    - host: api.anthropic.com
      # dlp omitted → all detectors on (default)
    - host: files.pythonhosted.org
      dlp:
        inbound_detectors: false   # skip response scanning (binary downloads)
    - host: internal-docs.corp
      dlp:
        outbound_detectors: false
        inbound_detectors: false   # trusted internal, no scanning
 ```
 `outbound_detectors` controls scanning of the *request* body + headers
 leaving the agent. `inbound_detectors` controls scanning of the *response*
 body arriving from the upstream.
 Valid values per field:
 - Omitted (or `null`) — default: all detectors active.
 - `false` — scanning disabled for this direction on this route.
 - A list of detector names — only the listed detectors run.
 Named outbound detectors: `token_patterns`, `known_secrets`.
 Named inbound detectors: `naive_injection_detection`.
 The manifest parser (`manifest_egress.py`) validates the `dlp` block and
 rejects unknown detector names.
 ### `EgressRoute` changes
 `EgressRoute` gains two new fields:
 ```python
@dataclass(frozen=True)
 class EgressRoute:
    Host: str
    PathAllowlist: tuple[str, ...] = ()
    AuthScheme: str = ""
    TokenRef: str = ""
    Role: tuple[str, ...] = ()
    OutboundDetectors: tuple[str, ...] | None = None   # None = all enabled
    InboundDetectors: tuple[str, ...] | None = None    # None = all enabled
 ```
 `None` means "use defaults" (all active); an empty `tuple[str, ...]` means
 "disabled". Named detectors use `tuple[str, ...]` with the detector name.
 `manifest_egress.py` uses `from_dict` to parse the new `dlp` block and
 populate these fields; unknown keys inside `dlp` are rejected.
 ### `Route` changes in `egress_addon_core.py`
 The addon-side `Route` dataclass mirrors the manifest-side change:
 ```python
@dataclass(frozen=True)
 class Route:
    host: str
    path_allowlist: tuple[str, ...] = ()
    auth_scheme: str = ""
    token_env: str = ""
    outbound_detectors: tuple[str, ...] | None = None
    inbound_detectors: tuple[str, ...] | None = None
 ```
 `parse_routes` / `_parse_one` grow the corresponding parsing logic.
 ### Detector interface
 Each detector is a pure function:
 ```python
 def scan(body: str | bytes, *, env: Mapping[str, str] = {}) -> ScanResult | None:
    ...
 ```
 `ScanResult` carries:
 ```python
@dataclass(frozen=True)
 class ScanResult:
    severity: str   # "block" or "warn"
    reason: str
 ```
 `scan` returns `None` if the body is clean, `ScanResult` otherwise.
 ### Detector: `token_patterns`
 Regex patterns for well-known credential formats, applied to the outbound
 request body and `Authorization` header (before the addon strips it — the
 strip happens after DLP scanning so that the scan sees any credential the
 agent tried to smuggle):
 | Token type | Pattern |
 |------------|---------|
 | AWS access key | `AKIA[0-9A-Z]{16}` |
 | GitHub token (classic) | `ghp_[A-Za-z0-9_]{36}` |
 | GitHub fine-grained | `github_pat_[A-Za-z0-9_]{82}` |
 | Anthropic API key | `sk-ant-[A-Za-z0-9\-_]{93}` |
 | OpenAI API key | `sk-[A-Za-z0-9]{48}` |
 | Stripe live key | `sk_live_[A-Za-z0-9]{24}` |
 | Generic Bearer JWT | `Bearer\s+[A-Za-z0-9._\-]{50,}` |
 Action: `"block"` on any match. No tolerance — a credential in an outbound
 request is always a violation.
 ### Detector: `known_secrets`
 At request time the egress addon has access to `os.environ`, which includes
 all `token_env` values declared by route auth blocks. The detector:
 1. Collects all `EGRESS_TOKEN_*` values from the environment (the naming
   contract established by `manifest_egress.py`'s `TokenRef` rendering).
 2. For each secret value, derives encoded variants: raw, base64, URL-encoded,
   hex.
 3. Scans the outbound request body for any variant.
 Action: `"block"` on match.
 This detector does **not** accept a custom detector name in the YAML — it
 is always named `known_secrets`. The environment is passed in via the `env`
 keyword argument to `scan`.
 ### Detector: `naive_injection_detection`
 Pattern-based inbound response scanner. Uses two tiers:
 **Tier 1 — BLOCK (credential + disclosure together):**
 - Response contains a token-pattern match (reuses `token_patterns` regex
  set) AND a prompt-disclosure phrase (e.g., `system prompt`, `my instructions
  are`, `hidden rules`).
 **Tier 2 — WARN (multiple jailbreak signals):**
 - Two or more jailbreak phrases detected (e.g., `ignore previous`,
  `forget everything`, `pretend you are`, `act as`).
 - OR explicit prompt disclosure (`system prompt:`) without a credential.
 **Tier 3 — ALLOW:**
 - Single jailbreak keyword without additional context.
 - Common documentation phrases.
 See the research doc for the full phrase lists and pseudocode.
 ### Wiring into `egress_addon.py`
 Two new mitmproxy hooks are added alongside the existing `request` hook:
 ```python
 def request(self, flow: http.HTTPFlow) -> None:
    # ... existing path-allowlist + auth-injection logic ...
    # After route decision, if action == "forward":
    result = scan_outbound(route, flow.request, os.environ)
    if result and result.severity == "block":
        flow.response = http.Response.make(403, result.reason.encode(), ...)
        return
 def response(self, flow: http.HTTPFlow) -> None:
    route = match_route(self.routes, flow.request.pretty_host)
    if route is None:
        return  # already blocked at request time
    result = scan_inbound(route, flow.response)
    if result and result.severity == "block":
        flow.response = http.Response.make(403, result.reason.encode(), ...)
    elif result and result.severity == "warn":
        sys.stderr.write(f"egress DLP warn: {result.reason}\n")
 ```
 `scan_outbound` and `scan_inbound` are pure functions in
 `egress_addon_core.py` that dispatch to the per-route detector list.
 ### Ordering: auth strip vs. DLP scan
 The DLP outbound scan sees the *agent's original* `Authorization` header
 before the addon strips it. This ensures that a token the agent smuggled
 in the header is caught. The strip + optional re-injection still happens
 afterward, preserving the existing credential-injection security model.
 ## Implementation chunks
 1. **Manifest `dlp` block + `EgressRoute` fields.**
   `manifest_egress.py`: parse `dlp`, add `OutboundDetectors` /
   `InboundDetectors` to `EgressRoute`. Extend
   `tests/unit/test_manifest_egress.py` with `dlp` valid/invalid cases.
   `egress_addon_core.py`: add `outbound_detectors` / `inbound_detectors`
   to `Route`; update `_parse_one` and `parse_routes`; extend
   `tests/unit/test_egress_addon_core.py`.
 2. **Token-patterns detector (Phase 1a).**
   New module `bot_bottle/dlp_detectors.py` (host-importable) and
   companion flat copy for the sidecar bundle. Add `TokenPatternsDetector`
   with the regex set above. Wire `scan_outbound` into the `request` hook
   in `egress_addon.py`. Unit tests in
   `tests/unit/test_dlp_detectors.py`.
 3. **Known-secrets detector (Phase 1b).**
   Add `KnownSecretsDetector` to `dlp_detectors.py`. Collect
   `EGRESS_TOKEN_*` from env; derive encoded variants; scan request body.
   Extend unit tests. Wire into `scan_outbound`.
 4. **Naive prompt injection detector (Phase 2).**
   Add `NaiveInjectionDetector` to `dlp_detectors.py`. Wire
   `scan_inbound` into the new `response` hook in `egress_addon.py`.
   Extend unit tests. Activate PRD 0053 (`Status: Draft → Active`) in
   this commit.
 ## Open questions
 1. **Response body buffering:** mitmproxy's `response` hook already has
   the full body for non-streaming responses. For streaming (chunked)
   responses the body may be empty or incomplete at hook time. Scope for
   now: log a warning and skip scanning on streaming responses; revisit
   if needed.
 2. **Encoding breadth for `known_secrets`:** Start with raw + base64 +
   URL-encoded + hex. Add GZIP / base32 if real-world evasion attempts
   appear.
 3. **`EGRESS_TOKEN_*` naming contract:** The detector relies on the
   env-var naming convention from `manifest_egress.py`. If that contract
   changes, the detector must be updated in lock-step.