From 5265e25f9bbd39019d373ebb4e3a30877abc26ce Mon Sep 17 00:00:00 2001 From: claude Date: Fri, 5 Jun 2026 00:52:57 +0000 Subject: [PATCH] docs: address PR #196 review; update research decisions and PRD MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Research doc: close open questions with decisions from review — hard cutover on path_allowlist, drop glob (regex sufficient), stick with Gateway API OR semantics for headers, case-insensitive method names. PRD 0053: adopt Gateway API HTTPRoute match vocabulary (paths, methods, headers) as the route schema replacement for path_allowlist. Add MatchEntry / PathMatch / HeaderMatch types to EgressRoute design; cite the route matching research doc; fold match restructure into chunk 1 alongside the dlp block. --- docs/prds/0053-egress-dlp-addon.md | 182 ++++++++++++++++--- docs/research/yaml-route-matching-formats.md | 51 +++--- 2 files changed, 179 insertions(+), 54 deletions(-) diff --git a/docs/prds/0053-egress-dlp-addon.md b/docs/prds/0053-egress-dlp-addon.md index 8112700..8718d63 100644 --- a/docs/prds/0053-egress-dlp-addon.md +++ b/docs/prds/0053-egress-dlp-addon.md @@ -11,12 +11,19 @@ With pipelock removed (PR #193), the egress proxy no longer performs DLP scanning on traffic to or from the agent. This PRD implements a replacement directly inside the mitmproxy egress addon: per-route DLP detectors that scan outbound requests for credential leakage and inbound responses for -prompt injection attempts. Configuration is expressed as a new `dlp` block -on each `egress.routes` entry in the bottle manifest. +prompt injection attempts. -The design follows the recommendation in the [DLP research document -(PR #192)](https://gitea.dideric.is/didericis/bot-bottle/pulls/192) and -covers all three remaining implementation phases from that plan: +The manifest route schema is also upgraded in this PRD from the flat +`path_allowlist` field to a structured `matches` block modelled on the +[Kubernetes Gateway API `HTTPRoute`](https://gateway-api.sigs.k8s.io/reference/spec/#gateway.networking.k8s.io/v1.HTTPRouteMatch) +match vocabulary. This upgrade is a hard cutover — no compatibility shim +for the old format. The rationale and format survey are in the +[YAML route matching formats research doc](https://gitea.dideric.is/didericis/bot-bottle/src/branch/main/docs/research/yaml-route-matching-formats.md). +DLP detectors attach to the new `matches`-based routes directly. + +The design follows the recommendation in the +[DLP research document (PR #192)](https://gitea.dideric.is/didericis/bot-bottle/pulls/192) +and covers all three remaining implementation phases from that plan: 1. Token pattern detection (Phase 1a) 2. Known-secrets detection (Phase 1b) @@ -31,6 +38,11 @@ proxy with no DLP capability at all. The egress addon already holds per-route logic for path allowlisting and credential injection; DLP rules belong in the same place. +The existing `path_allowlist` field is also limiting: it only supports path +prefixes, with no way to express exact-path, regex, method, or header +constraints. The Gateway API match vocabulary is a well-specified, widely +deployed standard that covers all of these without inventing new syntax. + ## Goals / Success Criteria 1. Outbound request bodies and headers are scanned for known token patterns @@ -46,8 +58,13 @@ same place. `dlp` block in the manifest. 5. All detector logic lives in `egress_addon_core.py` (pure Python, no mitmproxy dependency) and is covered by unit tests on the host. -6. Adding `dlp` configuration to a route that omits it entirely is - backward-compatible — the route behaves as if all detectors are enabled. +6. Each route's `matches` block supports path (exact/prefix/regex), HTTP + method, and header predicates using Gateway API match semantics. +7. The manifest change is a hard cutover: `path_allowlist` is removed with + no fallback, no deprecation alias, and no loud exception for old-format + manifests. Old manifests that use `path_allowlist` will fail validation + at load time with an unknown-key error (same as any other unrecognised + key today). ## Non-goals @@ -60,12 +77,96 @@ same place. exfil relevant to agent containment. - Changes to the cred-proxy sidecar. - Streaming response scanning (scan buffered response body only). +- Glob-style path matching — regex covers every case glob would handle + without adding a third path-matching language. ## Design +### Route matching: Gateway API `matches` vocabulary + +The existing `path_allowlist` field is replaced by a `matches` list. The +vocabulary mirrors Kubernetes Gateway API `HTTPRouteMatch` (see the +[route matching research doc](https://gitea.dideric.is/didericis/bot-bottle/src/branch/main/docs/research/yaml-route-matching-formats.md) +for a full format survey and rationale). Gateway API was chosen because it +is spec-backed, implementation-tested across multiple proxies, and its +`{type, value}` pattern is consistent and schema-validatable. + +**AND/OR semantics** (same as Gateway API): +- Predicates *within* a single `matches` entry are ANDed. +- Multiple entries in the `matches` list are ORed — the route matches if + any entry matches. + +```yaml +egress: + routes: + # Bare route — all traffic to this host is forwarded (no path/method/header + # constraints). Equivalent to the old path_allowlist-omitted case. + - host: api.anthropic.com + auth: + scheme: Bearer + token_ref: EGRESS_TOKEN_0 + + # Two match entries (OR): GET/HEAD on /packages/** OR POST on /upload + - host: files.pythonhosted.org + matches: + - paths: + - type: prefix + value: /packages/ + methods: [GET, HEAD] + - paths: + - type: exact + value: /upload + methods: [POST] + dlp: + inbound_detectors: false # skip response scanning (binary downloads) + + # Header + regex path — only JSON API responses on versioned endpoints + - host: internal-api.corp + matches: + - paths: + - type: regex + value: "^/v[0-9]+/" + headers: + - name: Content-Type + type: exact + value: application/json + dlp: + outbound_detectors: false + inbound_detectors: false +``` + +#### Path matching types + +| `type` | Semantics | +|--------|-----------| +| `exact` | Full path must equal `value` exactly | +| `prefix` | Path must start with `value` at a segment boundary (matches `/api/v1` for value `/api/v1`, rejects `/api/v10`) | +| `regex` | RE2 regex; rejected at load time if pattern fails to compile. Use for wildcard needs: `/api/[^/]+/data` instead of glob | + +`type` defaults to `prefix` when omitted (preserves the semantic of the +old `path_allowlist`). + +#### Method matching + +`methods` is a list of HTTP method names, case-insensitive at parse time — +`get`, `GET`, and `Get` are all accepted and stored as uppercase internally. +An absent or empty `methods` list means all methods are permitted. + +#### Header matching + +`headers` is a list of `{name, value, type}` objects. ALL listed headers +must match (AND semantics). To OR on header values, use multiple `matches` +entries. + +| `type` | Semantics | +|--------|-----------| +| `exact` | Header value equals `value` (default when `type` omitted) | +| `regex` | Header value matches RE2 regex | + ### Manifest schema — `dlp` block -Each `egress.routes` entry gains an optional `dlp` key: +Each `egress.routes` entry gains an optional `dlp` key alongside `matches` +and `auth`: ```yaml egress: @@ -100,13 +201,34 @@ rejects unknown detector names. ### `EgressRoute` changes -`EgressRoute` gains two new fields: +`EgressRoute` replaces `PathAllowlist` with `Matches` and gains two new +DLP fields. `MatchEntry` captures one AND-predicate block: ```python +@dataclass(frozen=True) +class PathMatch: + type: str # "exact" | "prefix" | "regex" + value: str + + +@dataclass(frozen=True) +class HeaderMatch: + name: str + value: str + type: str = "exact" # "exact" | "regex" + + +@dataclass(frozen=True) +class MatchEntry: + paths: tuple[PathMatch, ...] = () # empty = match any path + methods: tuple[str, ...] = () # empty = match any method (uppercase) + headers: tuple[HeaderMatch, ...] = () # empty = match any headers + + @dataclass(frozen=True) class EgressRoute: Host: str - PathAllowlist: tuple[str, ...] = () + Matches: tuple[MatchEntry, ...] = () # empty = match all requests AuthScheme: str = "" TokenRef: str = "" Role: tuple[str, ...] = () @@ -114,28 +236,30 @@ class EgressRoute: InboundDetectors: tuple[str, ...] | None = None # None = all enabled ``` -`None` means "use defaults" (all active); an empty `tuple[str, ...]` means -"disabled". Named detectors use `tuple[str, ...]` with the detector name. - -`manifest_egress.py` uses `from_dict` to parse the new `dlp` block and -populate these fields; unknown keys inside `dlp` are rejected. +`manifest_egress.py`'s `from_dict` parses the new `matches` block and `dlp` +block; `path_allowlist` is no longer a recognised key and will be rejected +by the unknown-key check. ### `Route` changes in `egress_addon_core.py` -The addon-side `Route` dataclass mirrors the manifest-side change: +The addon-side `Route` and its helper types mirror the manifest-side changes. +`match_route` is extended to evaluate the `Matches` list: ```python @dataclass(frozen=True) class Route: host: str - path_allowlist: tuple[str, ...] = () + matches: tuple[MatchEntry, ...] = () auth_scheme: str = "" token_env: str = "" outbound_detectors: tuple[str, ...] | None = None inbound_detectors: tuple[str, ...] | None = None ``` -`parse_routes` / `_parse_one` grow the corresponding parsing logic. +`decide()` feeds through `match_route` (unchanged host lookup) then +evaluates the match entries in order; if the route has no `matches` entries +all requests pass. Path `prefix` type uses segment-boundary checking +(`/api/v1` matches `/api/v1/foo` but not `/api/v10`). ### Detector interface @@ -212,7 +336,7 @@ Pattern-based inbound response scanner. Uses two tiers: - Single jailbreak keyword without additional context. - Common documentation phrases. -See the research doc for the full phrase lists and pseudocode. +See the DLP research doc for the full phrase lists and pseudocode. ### Wiring into `egress_addon.py` @@ -220,7 +344,7 @@ Two new mitmproxy hooks are added alongside the existing `request` hook: ```python def request(self, flow: http.HTTPFlow) -> None: - # ... existing path-allowlist + auth-injection logic ... + # ... existing match + auth-injection logic ... # After route decision, if action == "forward": result = scan_outbound(route, flow.request, os.environ) if result and result.severity == "block": @@ -250,20 +374,20 @@ afterward, preserving the existing credential-injection security model. ## Implementation chunks -1. **Manifest `dlp` block + `EgressRoute` fields.** - `manifest_egress.py`: parse `dlp`, add `OutboundDetectors` / - `InboundDetectors` to `EgressRoute`. Extend - `tests/unit/test_manifest_egress.py` with `dlp` valid/invalid cases. - `egress_addon_core.py`: add `outbound_detectors` / `inbound_detectors` - to `Route`; update `_parse_one` and `parse_routes`; extend - `tests/unit/test_egress_addon_core.py`. +1. **New `matches` block + `EgressRoute` / `Route` restructure.** + Remove `path_allowlist` from `manifest_egress.py` and `egress_addon_core.py`. + Add `MatchEntry`, `PathMatch`, `HeaderMatch` types. Parse `matches` in + `EgressRoute.from_dict` and `_parse_one`; unknown-key rejection handles + old `path_allowlist` manifests. Add `OutboundDetectors` / `InboundDetectors` + to `EgressRoute` and `Route`; parse `dlp` block. Extend + `tests/unit/test_manifest_egress.py` and `tests/unit/test_egress_addon_core.py` + with match and dlp valid/invalid cases. 2. **Token-patterns detector (Phase 1a).** New module `bot_bottle/dlp_detectors.py` (host-importable) and companion flat copy for the sidecar bundle. Add `TokenPatternsDetector` with the regex set above. Wire `scan_outbound` into the `request` hook - in `egress_addon.py`. Unit tests in - `tests/unit/test_dlp_detectors.py`. + in `egress_addon.py`. Unit tests in `tests/unit/test_dlp_detectors.py`. 3. **Known-secrets detector (Phase 1b).** Add `KnownSecretsDetector` to `dlp_detectors.py`. Collect diff --git a/docs/research/yaml-route-matching-formats.md b/docs/research/yaml-route-matching-formats.md index 2af2bb7..17b8ee3 100644 --- a/docs/research/yaml-route-matching-formats.md +++ b/docs/research/yaml-route-matching-formats.md @@ -407,23 +407,20 @@ egress: All predicates within `match` are ANDed. A list of `paths` entries is ORed (first match wins — same as the current `path_allowlist` semantics). -### 2. Path type enum (`exact` | `prefix` | `glob` | `regex`) +### 2. Path type enum (`exact` | `prefix` | `regex`) -Use four named types rather than inferring from the value's syntax. This +Use three named types rather than inferring from the value's syntax. This avoids the ambiguity that plagues `.gitignore` and `nginx location` patterns where the same string can mean different things depending on leading characters. - `prefix`: mirrors current `path_allowlist` semantics. -- `glob`: adopts ALB-style `*` (single segment) and `**` (multi-segment, - covering `/api/*/data` and `/api/**/data`). Simpler for operators than - writing regex. -- `regex`: RE2 for advanced cases. Reject at load time if the pattern fails - to compile. +- `regex`: RE2 for wildcard and advanced cases. Reject at load time if the + pattern fails to compile. Covers every case glob would handle — + `/api/[^/]+/data` is the `/api/*/data` equivalent. -**Glob semantics decision needed:** ALB's `*` matches across `/`; most -shell-glob conventions treat `*` as intra-segment and `**` as cross-segment. -The shell convention (`*` = no slash, `**` = slash-crossing) is less -surprising to operators and avoids accidental over-matching. +Glob-style syntax is not included: it adds a third path-matching language +on top of prefix and regex without meaningful operator benefit, since regex +is already required for any non-trivial wildcard. ### 3. Header matching as a list of `{name, value, type}` objects @@ -469,18 +466,22 @@ predicates. --- -## Open questions +## Decisions -1. **Backward compatibility:** `path_allowlist` is the current field. If - adopting a `match`/`matches` structure, keep `path_allowlist` as a - deprecated alias? Or treat this as a breaking manifest version bump? -2. **Glob segment semantics:** adopt shell convention (`*` = intra-segment, - `**` = cross-segment) or ALB convention (`*` = anything including `/`)? - The shell convention is safer; ALB's is simpler. -3. **Header value OR:** Gateway API requires a separate match entry to OR - header values. ALB allows multiple values in one condition. Which is - less surprising for bot-bottle operators? The ALB approach is more - concise for the common case (e.g., `Content-Type: [application/json, - application/x-www-form-urlencoded]`). -4. **Case sensitivity on method names:** normalize to uppercase at parse - time (fail on unrecognised values) or case-insensitively? +The open questions raised during research were resolved in PR #196 review: + +1. **Backward compatibility:** Hard cutover. The new `matches` structure + replaces `path_allowlist` entirely with no compatibility shim and no + fallback parsing for the old format. Manifests using `path_allowlist` + must be migrated. + +2. **Glob support:** Dropped. Not strictly necessary — `regex` covers every + case glob would handle. Fewer path-matching languages to document and + validate. + +3. **Header value OR:** Stick with Gateway API. OR across header values + requires a separate entry in the `matches` list, not multiple values + inside one `headers` block. + +4. **Method name case:** Case-insensitive at parse time. `get`, `GET`, and + `Get` are all accepted and normalised to uppercase internally.