From 9f3991164c874b48b55cc3c0f8a34b2aaa08cb1e Mon Sep 17 00:00:00 2001 From: claude Date: Sat, 6 Jun 2026 17:40:58 +0000 Subject: [PATCH] docs(prd): PRD 0053 extended outbound DLP scan surfaces --- docs/prds/0053-extended-outbound-scan.md | 146 +++++++++++++++++++++++ 1 file changed, 146 insertions(+) create mode 100644 docs/prds/0053-extended-outbound-scan.md diff --git a/docs/prds/0053-extended-outbound-scan.md b/docs/prds/0053-extended-outbound-scan.md new file mode 100644 index 0000000..2cbcf4c --- /dev/null +++ b/docs/prds/0053-extended-outbound-scan.md @@ -0,0 +1,146 @@ +# PRD 0053: Extended outbound DLP scan surfaces + +- **Status:** Draft +- **Author:** claude +- **Created:** 2026-06-06 +- **Issue:** #204 + +## Summary + +The outbound DLP scan introduced in PRD 0052 covers only the request body and +the `Authorization` header. An agent can exfiltrate a provisioned secret or +credential token via several other request surfaces that are currently +unscanned: arbitrary HTTP headers, URL query parameters, URL path segments, +and DNS-level hostnames (DNS tunnelling). This PRD closes those gaps by +extending the scan corpus to cover all four surfaces. + +## Problem + +After PRD 0052, `egress_addon.py` builds the DLP scan text as: + +```python +auth_header = flow.request.headers.get("authorization", "") +scan_text = body +if auth_header: + scan_text = auth_header + "\n" + body +``` + +This leaves the following attack surfaces unscanned: + +| Surface | Example exfiltration vector | +|---|---| +| Other request headers | `X-Api-Key: `, `Cookie: token=` | +| Query parameters | `GET /search?api_key=` | +| URL path segments | `GET /proxy//endpoint` | +| DNS hostname | `.attacker.com` (DNS tunnelling) | + +Each surface allows a determined agent to bypass the current scan and leak +a provisioned `EGRESS_TOKEN_*` value or a known token pattern to an +upstream attacker. + +## Goals / Success Criteria + +1. All four surfaces (headers, query params, path, hostname) are included in + the outbound DLP scan text for every route that has outbound scanning + enabled. +2. A pure helper `build_outbound_scan_text(host, path, query, headers, body)` + in `egress_addon_core.py` assembles the scan corpus so the logic is fully + unit-testable without a mitmproxy dependency. +3. Unit tests demonstrate that `scan_outbound` blocks a request when a known + token pattern or provisioned secret appears in each surface independently. +4. No manifest schema changes — the `dlp` block's `outbound_detectors` + field continues to control which detectors run; all surfaces are scanned + by whichever detectors are active. +5. The auth-strip ordering invariant from PRD 0052 is preserved: the + outbound scan sees the original `Authorization` header before the addon + strips it. + +## Non-goals + +- Scanning inbound response URLs or headers (inbound scan covers response + body only; response URL is the same as the outbound request URL and is + already scanned there). +- Structured query-param parsing (treating `?k=v` as key/value pairs for + per-param matching) — scanning the raw query string is sufficient. +- Changes to the `dlp` block schema or detector names. +- Scanning outbound request bodies for prompt injection (inbound only, + per PRD 0052 design). + +## Design + +### `build_outbound_scan_text` in `egress_addon_core.py` + +A new pure function assembles all request surfaces into a single newline- +delimited string suitable for passing to `scan_outbound`: + +```python +def build_outbound_scan_text( + host: str, + path: str, + query: str, + headers: typing.Mapping[str, str], + body: str, +) -> str: + parts: list[str] = [host, path] + if query: + parts.append(query) + for name, value in headers.items(): + parts.append(f"{name}: {value}") + if body: + parts.append(body) + return "\n".join(parts) +``` + +**Why hostname in the scan corpus?** +DNS tunnelling encodes data into subdomain labels +(`.attacker.com`). The mitmproxy `request` hook sees the +`pretty_host` field before the TCP connection is fully established, so +scanning it catches this vector. Both the `token_patterns` and +`known_secrets` detectors handle encoded variants (raw, base64, URL-encoded, +hex), so the existing encoding-variant logic in `_encoded_variants` already +covers common DNS-tunnelling encodings. + +### `egress_addon.py` update + +The narrow scan-text construction is replaced with a call to +`build_outbound_scan_text`, which the addon has already split `path` and +`query` from `flow.request.path` at the top of `request()`: + +```python +# Build full scan corpus: hostname + path + query + all headers + body +body = flow.request.get_text(strict=False) or "" +scan_text = build_outbound_scan_text( + flow.request.pretty_host, + request_path, + query, + dict(flow.request.headers), + body, +) +dlp_result = scan_outbound(route, scan_text, os.environ) +``` + +The `Authorization` header is present in `flow.request.headers` at this +point (the strip happens below on line 115), so the auth-strip ordering +invariant is automatically preserved. + +### Test additions + +`tests/unit/test_egress_addon_core.py` gains: + +- `TestBuildOutboundScanText` — verifies hostname, path, query, headers, and + body each appear in the assembled text; checks that empty query and body + are omitted. +- `TestScanOutbound` — verifies `scan_outbound` blocks when a known token + pattern appears in each surface independently (hostname, path, query, + non-auth header, body), and returns `None` for a clean request. + +## Implementation + +Single commit: + +1. Add `build_outbound_scan_text` to `egress_addon_core.py` and its + `__all__`. +2. Update `egress_addon.py` to import and call it. +3. Add `TestBuildOutboundScanText` and `TestScanOutbound` to + `tests/unit/test_egress_addon_core.py`. +4. Flip this PRD `Status: Draft → Active`.