feat(dlp): websocket scanning, response headers, extended encoding variants, sk-proj pattern (PRD 0053)

2026-06-06 17:59:36 +00:00
parent 76e38b24e6
commit 1ecef55fea
6 changed files with 300 additions and 33 deletions
@@ -57,14 +57,15 @@ upstream attacker.

 ## Non-goals

- Scanning inbound response URLs or headers (inbound scan covers response
-  body only; response URL is the same as the outbound request URL and is
-  already scanned there).
- Structured query-param parsing (treating `?k=v` as key/value pairs for
-  per-param matching) — scanning the raw query string is sufficient.
+- Raw UDP/DNS queries — these bypass the HTTP proxy entirely and require a
+  network-level DNS sinkhole (tracked separately in issue #205).
+- Structured query-param parsing — scanning the raw query string is
+  sufficient.
 - Changes to the `dlp` block schema or detector names.
 - Scanning outbound request bodies for prompt injection (inbound only,
  per PRD 0052 design).
+- LLM-based semantic detection or entropy-based secret scanning (deferred,
+  per PRD 0052 non-goals).

 ## Design

@@ -123,24 +124,47 @@ The `Authorization` header is present in `flow.request.headers` at this
 point (the strip happens below on line 115), so the auth-strip ordering
 invariant is automatically preserved.

-### Test additions
+### `build_inbound_scan_text` in `egress_addon_core.py`

-`tests/unit/test_egress_addon_core.py` gains:
+An analogous helper assembles the inbound response corpus (all response
+headers + body) for `scan_inbound`. The `response()` hook now passes this
+combined text instead of the body alone, closing the response-header
+injection vector.

- `TestBuildOutboundScanText` — verifies hostname, path, query, headers, and
-  body each appear in the assembled text; checks that empty query and body
-  are omitted.
- `TestScanOutbound` — verifies `scan_outbound` blocks when a known token
-  pattern appears in each surface independently (hostname, path, query,
-  non-auth header, body), and returns `None` for a clean request.
+### WebSocket frame scanning
+
+A new `websocket_message` hook in `EgressAddon` scans every frame after the
+HTTP 101 upgrade. Outbound frames (`from_client=True`) are scanned for
+credential patterns and known secrets; inbound frames are scanned for prompt
+injection. On a block the entire WebSocket connection is killed via
+`flow.kill()` (there is no HTTP response surface to write to after upgrade).
+
+### Extended encoding variants in `_encoded_variants`
+
+`_encoded_variants` is extended from 4 to 9 encoding forms:
+
+| Added encoding | Rationale |
+|---|---|
+| Standard base64 without padding | Common in log lines where `=` is stripped |
+| URL-safe base64 with padding | JWT / OAuth standard alphabet |
+| URL-safe base64 without padding | Same, padding stripped |
+| Hex uppercase | Complements existing hex-lowercase variant |
+| Base32 | TOTP seeds; some DNS-exfil channels use base32 subdomains |
+| gzip + base64 | Recognisable by `H4sI` prefix; naive compression before encode |
+
+### OpenAI project key pattern
+
+`TOKEN_PATTERNS` gains `sk-proj-[A-Za-z0-9_\-]{48,}` covering OpenAI's
+newer project-scoped API key format.

 ## Implementation

-Single commit:
+Delivered across three commits on the same branch:

-1. Add `build_outbound_scan_text` to `egress_addon_core.py` and its
-   `__all__`.
-2. Update `egress_addon.py` to import and call it.
-3. Add `TestBuildOutboundScanText` and `TestScanOutbound` to
-   `tests/unit/test_egress_addon_core.py`.
-4. Flip this PRD `Status: Draft → Active`.
+1. **Outbound scan surfaces** — `build_outbound_scan_text`, `egress_addon.py`
+   `request()` rewrite, `TestBuildOutboundScanText`, `TestScanOutbound`.
+2. **Remaining gaps** — extended `_encoded_variants`, `sk-proj-` pattern,
+   `build_inbound_scan_text`, response-header scanning, `websocket_message`
+   hook, and matching unit tests.
+3. **PRD flip** — `Status: Draft → Active` (committed with the first
+   implementation commit; updated here to reflect final scope).