From 5265e25f9bbd39019d373ebb4e3a30877abc26ce Mon Sep 17 00:00:00 2001
From: claude <eric+claude@dideric.is>
Date: Fri, 5 Jun 2026 00:52:57 +0000
Subject: [PATCH] docs: address PR #196 review; update research decisions and
 PRD
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Research doc: close open questions with decisions from review — hard
cutover on path_allowlist, drop glob (regex sufficient), stick with
Gateway API OR semantics for headers, case-insensitive method names.

PRD 0053: adopt Gateway API HTTPRoute match vocabulary (paths, methods,
headers) as the route schema replacement for path_allowlist. Add
MatchEntry / PathMatch / HeaderMatch types to EgressRoute design; cite
the route matching research doc; fold match restructure into chunk 1
alongside the dlp block.
---
 docs/prds/0053-egress-dlp-addon.md           | 182 ++++++++++++++++---
 docs/research/yaml-route-matching-formats.md |  51 +++---
 2 files changed, 179 insertions(+), 54 deletions(-)

diff --git a/docs/prds/0053-egress-dlp-addon.md b/docs/prds/0053-egress-dlp-addon.md
index 8112700..8718d63 100644
--- a/docs/prds/0053-egress-dlp-addon.md
+++ b/docs/prds/0053-egress-dlp-addon.md
@@ -11,12 +11,19 @@ With pipelock removed (PR #193), the egress proxy no longer performs DLP
 scanning on traffic to or from the agent. This PRD implements a replacement
 directly inside the mitmproxy egress addon: per-route DLP detectors that
 scan outbound requests for credential leakage and inbound responses for
-prompt injection attempts. Configuration is expressed as a new `dlp` block
-on each `egress.routes` entry in the bottle manifest.
+prompt injection attempts.
 
-The design follows the recommendation in the [DLP research document
-(PR #192)](https://gitea.dideric.is/didericis/bot-bottle/pulls/192) and
-covers all three remaining implementation phases from that plan:
+The manifest route schema is also upgraded in this PRD from the flat
+`path_allowlist` field to a structured `matches` block modelled on the
+[Kubernetes Gateway API `HTTPRoute`](https://gateway-api.sigs.k8s.io/reference/spec/#gateway.networking.k8s.io/v1.HTTPRouteMatch)
+match vocabulary. This upgrade is a hard cutover — no compatibility shim
+for the old format. The rationale and format survey are in the
+[YAML route matching formats research doc](https://gitea.dideric.is/didericis/bot-bottle/src/branch/main/docs/research/yaml-route-matching-formats.md).
+DLP detectors attach to the new `matches`-based routes directly.
+
+The design follows the recommendation in the
+[DLP research document (PR #192)](https://gitea.dideric.is/didericis/bot-bottle/pulls/192)
+and covers all three remaining implementation phases from that plan:
 
 1. Token pattern detection (Phase 1a)
 2. Known-secrets detection (Phase 1b)
@@ -31,6 +38,11 @@ proxy with no DLP capability at all. The egress addon already holds per-route
 logic for path allowlisting and credential injection; DLP rules belong in the
 same place.
 
+The existing `path_allowlist` field is also limiting: it only supports path
+prefixes, with no way to express exact-path, regex, method, or header
+constraints. The Gateway API match vocabulary is a well-specified, widely
+deployed standard that covers all of these without inventing new syntax.
+
 ## Goals / Success Criteria
 
 1. Outbound request bodies and headers are scanned for known token patterns
@@ -46,8 +58,13 @@ same place.
    `dlp` block in the manifest.
 5. All detector logic lives in `egress_addon_core.py` (pure Python, no
    mitmproxy dependency) and is covered by unit tests on the host.
-6. Adding `dlp` configuration to a route that omits it entirely is
-   backward-compatible — the route behaves as if all detectors are enabled.
+6. Each route's `matches` block supports path (exact/prefix/regex), HTTP
+   method, and header predicates using Gateway API match semantics.
+7. The manifest change is a hard cutover: `path_allowlist` is removed with
+   no fallback, no deprecation alias, and no loud exception for old-format
+   manifests. Old manifests that use `path_allowlist` will fail validation
+   at load time with an unknown-key error (same as any other unrecognised
+   key today).
 
 ## Non-goals
 
@@ -60,12 +77,96 @@ same place.
   exfil relevant to agent containment.
 - Changes to the cred-proxy sidecar.
 - Streaming response scanning (scan buffered response body only).
+- Glob-style path matching — regex covers every case glob would handle
+  without adding a third path-matching language.
 
 ## Design
 
+### Route matching: Gateway API `matches` vocabulary
+
+The existing `path_allowlist` field is replaced by a `matches` list. The
+vocabulary mirrors Kubernetes Gateway API `HTTPRouteMatch` (see the
+[route matching research doc](https://gitea.dideric.is/didericis/bot-bottle/src/branch/main/docs/research/yaml-route-matching-formats.md)
+for a full format survey and rationale). Gateway API was chosen because it
+is spec-backed, implementation-tested across multiple proxies, and its
+`{type, value}` pattern is consistent and schema-validatable.
+
+**AND/OR semantics** (same as Gateway API):
+- Predicates *within* a single `matches` entry are ANDed.
+- Multiple entries in the `matches` list are ORed — the route matches if
+  any entry matches.
+
+```yaml
+egress:
+  routes:
+    # Bare route — all traffic to this host is forwarded (no path/method/header
+    # constraints). Equivalent to the old path_allowlist-omitted case.
+    - host: api.anthropic.com
+      auth:
+        scheme: Bearer
+        token_ref: EGRESS_TOKEN_0
+
+    # Two match entries (OR): GET/HEAD on /packages/** OR POST on /upload
+    - host: files.pythonhosted.org
+      matches:
+        - paths:
+            - type: prefix
+              value: /packages/
+          methods: [GET, HEAD]
+        - paths:
+            - type: exact
+              value: /upload
+          methods: [POST]
+      dlp:
+        inbound_detectors: false   # skip response scanning (binary downloads)
+
+    # Header + regex path — only JSON API responses on versioned endpoints
+    - host: internal-api.corp
+      matches:
+        - paths:
+            - type: regex
+              value: "^/v[0-9]+/"
+          headers:
+            - name: Content-Type
+              type: exact
+              value: application/json
+      dlp:
+        outbound_detectors: false
+        inbound_detectors: false
+```
+
+#### Path matching types
+
+| `type` | Semantics |
+|--------|-----------|
+| `exact` | Full path must equal `value` exactly |
+| `prefix` | Path must start with `value` at a segment boundary (matches `/api/v1` for value `/api/v1`, rejects `/api/v10`) |
+| `regex` | RE2 regex; rejected at load time if pattern fails to compile. Use for wildcard needs: `/api/[^/]+/data` instead of glob |
+
+`type` defaults to `prefix` when omitted (preserves the semantic of the
+old `path_allowlist`).
+
+#### Method matching
+
+`methods` is a list of HTTP method names, case-insensitive at parse time —
+`get`, `GET`, and `Get` are all accepted and stored as uppercase internally.
+An absent or empty `methods` list means all methods are permitted.
+
+#### Header matching
+
+`headers` is a list of `{name, value, type}` objects. ALL listed headers
+must match (AND semantics). To OR on header values, use multiple `matches`
+entries.
+
+| `type` | Semantics |
+|--------|-----------|
+| `exact` | Header value equals `value` (default when `type` omitted) |
+| `regex` | Header value matches RE2 regex |
+
 ### Manifest schema — `dlp` block
 
-Each `egress.routes` entry gains an optional `dlp` key:
+Each `egress.routes` entry gains an optional `dlp` key alongside `matches`
+and `auth`:
 
 ```yaml
 egress:
@@ -100,13 +201,34 @@ rejects unknown detector names.
 
 ### `EgressRoute` changes
 
-`EgressRoute` gains two new fields:
+`EgressRoute` replaces `PathAllowlist` with `Matches` and gains two new
+DLP fields. `MatchEntry` captures one AND-predicate block:
 
 ```python
+@dataclass(frozen=True)
+class PathMatch:
+    type: str   # "exact" | "prefix" | "regex"
+    value: str
+
+
+@dataclass(frozen=True)
+class HeaderMatch:
+    name: str
+    value: str
+    type: str = "exact"   # "exact" | "regex"
+
+
+@dataclass(frozen=True)
+class MatchEntry:
+    paths: tuple[PathMatch, ...] = ()     # empty = match any path
+    methods: tuple[str, ...] = ()         # empty = match any method (uppercase)
+    headers: tuple[HeaderMatch, ...] = () # empty = match any headers
+
+
 @dataclass(frozen=True)
 class EgressRoute:
     Host: str
-    PathAllowlist: tuple[str, ...] = ()
+    Matches: tuple[MatchEntry, ...] = ()  # empty = match all requests
     AuthScheme: str = ""
     TokenRef: str = ""
     Role: tuple[str, ...] = ()
@@ -114,28 +236,30 @@ class EgressRoute:
     InboundDetectors: tuple[str, ...] | None = None    # None = all enabled
 ```
 
-`None` means "use defaults" (all active); an empty `tuple[str, ...]` means
-"disabled". Named detectors use `tuple[str, ...]` with the detector name.
-
-`manifest_egress.py` uses `from_dict` to parse the new `dlp` block and
-populate these fields; unknown keys inside `dlp` are rejected.
+`manifest_egress.py`'s `from_dict` parses the new `matches` block and `dlp`
+block; `path_allowlist` is no longer a recognised key and will be rejected
+by the unknown-key check.
 
 ### `Route` changes in `egress_addon_core.py`
 
-The addon-side `Route` dataclass mirrors the manifest-side change:
+The addon-side `Route` and its helper types mirror the manifest-side changes.
+`match_route` is extended to evaluate the `Matches` list:
 
 ```python
 @dataclass(frozen=True)
 class Route:
     host: str
-    path_allowlist: tuple[str, ...] = ()
+    matches: tuple[MatchEntry, ...] = ()
     auth_scheme: str = ""
     token_env: str = ""
     outbound_detectors: tuple[str, ...] | None = None
     inbound_detectors: tuple[str, ...] | None = None
 ```
 
-`parse_routes` / `_parse_one` grow the corresponding parsing logic.
+`decide()` feeds through `match_route` (unchanged host lookup) then
+evaluates the match entries in order; if the route has no `matches` entries
+all requests pass. Path `prefix` type uses segment-boundary checking
+(`/api/v1` matches `/api/v1/foo` but not `/api/v10`).
 
 ### Detector interface
 
@@ -212,7 +336,7 @@ Pattern-based inbound response scanner. Uses two tiers:
 - Single jailbreak keyword without additional context.
 - Common documentation phrases.
 
-See the research doc for the full phrase lists and pseudocode.
+See the DLP research doc for the full phrase lists and pseudocode.
 
 ### Wiring into `egress_addon.py`
 
@@ -220,7 +344,7 @@ Two new mitmproxy hooks are added alongside the existing `request` hook:
 
 ```python
 def request(self, flow: http.HTTPFlow) -> None:
-    # ... existing path-allowlist + auth-injection logic ...
+    # ... existing match + auth-injection logic ...
     # After route decision, if action == "forward":
     result = scan_outbound(route, flow.request, os.environ)
     if result and result.severity == "block":
@@ -250,20 +374,20 @@ afterward, preserving the existing credential-injection security model.
 
 ## Implementation chunks
 
-1. **Manifest `dlp` block + `EgressRoute` fields.**
-   `manifest_egress.py`: parse `dlp`, add `OutboundDetectors` /
-   `InboundDetectors` to `EgressRoute`. Extend
-   `tests/unit/test_manifest_egress.py` with `dlp` valid/invalid cases.
-   `egress_addon_core.py`: add `outbound_detectors` / `inbound_detectors`
-   to `Route`; update `_parse_one` and `parse_routes`; extend
-   `tests/unit/test_egress_addon_core.py`.
+1. **New `matches` block + `EgressRoute` / `Route` restructure.**
+   Remove `path_allowlist` from `manifest_egress.py` and `egress_addon_core.py`.
+   Add `MatchEntry`, `PathMatch`, `HeaderMatch` types. Parse `matches` in
+   `EgressRoute.from_dict` and `_parse_one`; unknown-key rejection handles
+   old `path_allowlist` manifests. Add `OutboundDetectors` / `InboundDetectors`
+   to `EgressRoute` and `Route`; parse `dlp` block. Extend
+   `tests/unit/test_manifest_egress.py` and `tests/unit/test_egress_addon_core.py`
+   with match and dlp valid/invalid cases.
 
 2. **Token-patterns detector (Phase 1a).**
    New module `bot_bottle/dlp_detectors.py` (host-importable) and
    companion flat copy for the sidecar bundle. Add `TokenPatternsDetector`
    with the regex set above. Wire `scan_outbound` into the `request` hook
-   in `egress_addon.py`. Unit tests in
-   `tests/unit/test_dlp_detectors.py`.
+   in `egress_addon.py`. Unit tests in `tests/unit/test_dlp_detectors.py`.
 
 3. **Known-secrets detector (Phase 1b).**
    Add `KnownSecretsDetector` to `dlp_detectors.py`. Collect
diff --git a/docs/research/yaml-route-matching-formats.md b/docs/research/yaml-route-matching-formats.md
index 2af2bb7..17b8ee3 100644
--- a/docs/research/yaml-route-matching-formats.md
+++ b/docs/research/yaml-route-matching-formats.md
@@ -407,23 +407,20 @@ egress:
 All predicates within `match` are ANDed. A list of `paths` entries is
 ORed (first match wins — same as the current `path_allowlist` semantics).
 
-### 2. Path type enum (`exact` | `prefix` | `glob` | `regex`)
+### 2. Path type enum (`exact` | `prefix` | `regex`)
 
-Use four named types rather than inferring from the value's syntax. This
+Use three named types rather than inferring from the value's syntax. This
 avoids the ambiguity that plagues `.gitignore` and `nginx location` patterns
 where the same string can mean different things depending on leading characters.
 
 - `prefix`: mirrors current `path_allowlist` semantics.
-- `glob`: adopts ALB-style `*` (single segment) and `**` (multi-segment,
-  covering `/api/*/data` and `/api/**/data`). Simpler for operators than
-  writing regex.
-- `regex`: RE2 for advanced cases. Reject at load time if the pattern fails
-  to compile.
+- `regex`: RE2 for wildcard and advanced cases. Reject at load time if the
+  pattern fails to compile. Covers every case glob would handle —
+  `/api/[^/]+/data` is the `/api/*/data` equivalent.
 
-**Glob semantics decision needed:** ALB's `*` matches across `/`; most
-shell-glob conventions treat `*` as intra-segment and `**` as cross-segment.
-The shell convention (`*` = no slash, `**` = slash-crossing) is less
-surprising to operators and avoids accidental over-matching.
+Glob-style syntax is not included: it adds a third path-matching language
+on top of prefix and regex without meaningful operator benefit, since regex
+is already required for any non-trivial wildcard.
 
 ### 3. Header matching as a list of `{name, value, type}` objects
 
@@ -469,18 +466,22 @@ predicates.
 
 ---
 
-## Open questions
+## Decisions
 
-1. **Backward compatibility:** `path_allowlist` is the current field. If
-   adopting a `match`/`matches` structure, keep `path_allowlist` as a
-   deprecated alias? Or treat this as a breaking manifest version bump?
-2. **Glob segment semantics:** adopt shell convention (`*` = intra-segment,
-   `**` = cross-segment) or ALB convention (`*` = anything including `/`)?
-   The shell convention is safer; ALB's is simpler.
-3. **Header value OR:** Gateway API requires a separate match entry to OR
-   header values. ALB allows multiple values in one condition. Which is
-   less surprising for bot-bottle operators? The ALB approach is more
-   concise for the common case (e.g., `Content-Type: [application/json,
-   application/x-www-form-urlencoded]`).
-4. **Case sensitivity on method names:** normalize to uppercase at parse
-   time (fail on unrecognised values) or case-insensitively?
+The open questions raised during research were resolved in PR #196 review:
+
+1. **Backward compatibility:** Hard cutover. The new `matches` structure
+   replaces `path_allowlist` entirely with no compatibility shim and no
+   fallback parsing for the old format. Manifests using `path_allowlist`
+   must be migrated.
+
+2. **Glob support:** Dropped. Not strictly necessary — `regex` covers every
+   case glob would handle. Fewer path-matching languages to document and
+   validate.
+
+3. **Header value OR:** Stick with Gateway API. OR across header values
+   requires a separate entry in the `matches` list, not multiple values
+   inside one `headers` block.
+
+4. **Method name case:** Case-insensitive at parse time. `get`, `GET`, and
+   `Get` are all accepted and normalised to uppercase internally.