feat(dlp): websocket scanning, response headers, extended encoding variants, sk-proj pattern (PRD 0053)

This commit is contained in:
2026-06-06 17:59:36 +00:00
committed by didericis
parent 76e38b24e6
commit 1ecef55fea
6 changed files with 300 additions and 33 deletions
+44 -20
View File
@@ -57,14 +57,15 @@ upstream attacker.
## Non-goals
- Scanning inbound response URLs or headers (inbound scan covers response
body only; response URL is the same as the outbound request URL and is
already scanned there).
- Structured query-param parsing (treating `?k=v` as key/value pairs for
per-param matching) — scanning the raw query string is sufficient.
- Raw UDP/DNS queries — these bypass the HTTP proxy entirely and require a
network-level DNS sinkhole (tracked separately in issue #205).
- Structured query-param parsing — scanning the raw query string is
sufficient.
- Changes to the `dlp` block schema or detector names.
- Scanning outbound request bodies for prompt injection (inbound only,
per PRD 0052 design).
- LLM-based semantic detection or entropy-based secret scanning (deferred,
per PRD 0052 non-goals).
## Design
@@ -123,24 +124,47 @@ The `Authorization` header is present in `flow.request.headers` at this
point (the strip happens below on line 115), so the auth-strip ordering
invariant is automatically preserved.
### Test additions
### `build_inbound_scan_text` in `egress_addon_core.py`
`tests/unit/test_egress_addon_core.py` gains:
An analogous helper assembles the inbound response corpus (all response
headers + body) for `scan_inbound`. The `response()` hook now passes this
combined text instead of the body alone, closing the response-header
injection vector.
- `TestBuildOutboundScanText` — verifies hostname, path, query, headers, and
body each appear in the assembled text; checks that empty query and body
are omitted.
- `TestScanOutbound` — verifies `scan_outbound` blocks when a known token
pattern appears in each surface independently (hostname, path, query,
non-auth header, body), and returns `None` for a clean request.
### WebSocket frame scanning
A new `websocket_message` hook in `EgressAddon` scans every frame after the
HTTP 101 upgrade. Outbound frames (`from_client=True`) are scanned for
credential patterns and known secrets; inbound frames are scanned for prompt
injection. On a block the entire WebSocket connection is killed via
`flow.kill()` (there is no HTTP response surface to write to after upgrade).
### Extended encoding variants in `_encoded_variants`
`_encoded_variants` is extended from 4 to 9 encoding forms:
| Added encoding | Rationale |
|---|---|
| Standard base64 without padding | Common in log lines where `=` is stripped |
| URL-safe base64 with padding | JWT / OAuth standard alphabet |
| URL-safe base64 without padding | Same, padding stripped |
| Hex uppercase | Complements existing hex-lowercase variant |
| Base32 | TOTP seeds; some DNS-exfil channels use base32 subdomains |
| gzip + base64 | Recognisable by `H4sI` prefix; naive compression before encode |
### OpenAI project key pattern
`TOKEN_PATTERNS` gains `sk-proj-[A-Za-z0-9_\-]{48,}` covering OpenAI's
newer project-scoped API key format.
## Implementation
Single commit:
Delivered across three commits on the same branch:
1. Add `build_outbound_scan_text` to `egress_addon_core.py` and its
`__all__`.
2. Update `egress_addon.py` to import and call it.
3. Add `TestBuildOutboundScanText` and `TestScanOutbound` to
`tests/unit/test_egress_addon_core.py`.
4. Flip this PRD `Status: Draft → Active`.
1. **Outbound scan surfaces** `build_outbound_scan_text`, `egress_addon.py`
`request()` rewrite, `TestBuildOutboundScanText`, `TestScanOutbound`.
2. **Remaining gaps** — extended `_encoded_variants`, `sk-proj-` pattern,
`build_inbound_scan_text`, response-header scanning, `websocket_message`
hook, and matching unit tests.
3. **PRD flip** `Status: Draft → Active` (committed with the first
implementation commit; updated here to reflect final scope).