docs: address PR #196 review; update research decisions and PRD
Research doc: close open questions with decisions from review — hard cutover on path_allowlist, drop glob (regex sufficient), stick with Gateway API OR semantics for headers, case-insensitive method names. PRD 0053: adopt Gateway API HTTPRoute match vocabulary (paths, methods, headers) as the route schema replacement for path_allowlist. Add MatchEntry / PathMatch / HeaderMatch types to EgressRoute design; cite the route matching research doc; fold match restructure into chunk 1 alongside the dlp block.
This commit is contained in:
@@ -11,12 +11,19 @@ With pipelock removed (PR #193), the egress proxy no longer performs DLP
|
|||||||
scanning on traffic to or from the agent. This PRD implements a replacement
|
scanning on traffic to or from the agent. This PRD implements a replacement
|
||||||
directly inside the mitmproxy egress addon: per-route DLP detectors that
|
directly inside the mitmproxy egress addon: per-route DLP detectors that
|
||||||
scan outbound requests for credential leakage and inbound responses for
|
scan outbound requests for credential leakage and inbound responses for
|
||||||
prompt injection attempts. Configuration is expressed as a new `dlp` block
|
prompt injection attempts.
|
||||||
on each `egress.routes` entry in the bottle manifest.
|
|
||||||
|
|
||||||
The design follows the recommendation in the [DLP research document
|
The manifest route schema is also upgraded in this PRD from the flat
|
||||||
(PR #192)](https://gitea.dideric.is/didericis/bot-bottle/pulls/192) and
|
`path_allowlist` field to a structured `matches` block modelled on the
|
||||||
covers all three remaining implementation phases from that plan:
|
[Kubernetes Gateway API `HTTPRoute`](https://gateway-api.sigs.k8s.io/reference/spec/#gateway.networking.k8s.io/v1.HTTPRouteMatch)
|
||||||
|
match vocabulary. This upgrade is a hard cutover — no compatibility shim
|
||||||
|
for the old format. The rationale and format survey are in the
|
||||||
|
[YAML route matching formats research doc](https://gitea.dideric.is/didericis/bot-bottle/src/branch/main/docs/research/yaml-route-matching-formats.md).
|
||||||
|
DLP detectors attach to the new `matches`-based routes directly.
|
||||||
|
|
||||||
|
The design follows the recommendation in the
|
||||||
|
[DLP research document (PR #192)](https://gitea.dideric.is/didericis/bot-bottle/pulls/192)
|
||||||
|
and covers all three remaining implementation phases from that plan:
|
||||||
|
|
||||||
1. Token pattern detection (Phase 1a)
|
1. Token pattern detection (Phase 1a)
|
||||||
2. Known-secrets detection (Phase 1b)
|
2. Known-secrets detection (Phase 1b)
|
||||||
@@ -31,6 +38,11 @@ proxy with no DLP capability at all. The egress addon already holds per-route
|
|||||||
logic for path allowlisting and credential injection; DLP rules belong in the
|
logic for path allowlisting and credential injection; DLP rules belong in the
|
||||||
same place.
|
same place.
|
||||||
|
|
||||||
|
The existing `path_allowlist` field is also limiting: it only supports path
|
||||||
|
prefixes, with no way to express exact-path, regex, method, or header
|
||||||
|
constraints. The Gateway API match vocabulary is a well-specified, widely
|
||||||
|
deployed standard that covers all of these without inventing new syntax.
|
||||||
|
|
||||||
## Goals / Success Criteria
|
## Goals / Success Criteria
|
||||||
|
|
||||||
1. Outbound request bodies and headers are scanned for known token patterns
|
1. Outbound request bodies and headers are scanned for known token patterns
|
||||||
@@ -46,8 +58,13 @@ same place.
|
|||||||
`dlp` block in the manifest.
|
`dlp` block in the manifest.
|
||||||
5. All detector logic lives in `egress_addon_core.py` (pure Python, no
|
5. All detector logic lives in `egress_addon_core.py` (pure Python, no
|
||||||
mitmproxy dependency) and is covered by unit tests on the host.
|
mitmproxy dependency) and is covered by unit tests on the host.
|
||||||
6. Adding `dlp` configuration to a route that omits it entirely is
|
6. Each route's `matches` block supports path (exact/prefix/regex), HTTP
|
||||||
backward-compatible — the route behaves as if all detectors are enabled.
|
method, and header predicates using Gateway API match semantics.
|
||||||
|
7. The manifest change is a hard cutover: `path_allowlist` is removed with
|
||||||
|
no fallback, no deprecation alias, and no loud exception for old-format
|
||||||
|
manifests. Old manifests that use `path_allowlist` will fail validation
|
||||||
|
at load time with an unknown-key error (same as any other unrecognised
|
||||||
|
key today).
|
||||||
|
|
||||||
## Non-goals
|
## Non-goals
|
||||||
|
|
||||||
@@ -60,12 +77,96 @@ same place.
|
|||||||
exfil relevant to agent containment.
|
exfil relevant to agent containment.
|
||||||
- Changes to the cred-proxy sidecar.
|
- Changes to the cred-proxy sidecar.
|
||||||
- Streaming response scanning (scan buffered response body only).
|
- Streaming response scanning (scan buffered response body only).
|
||||||
|
- Glob-style path matching — regex covers every case glob would handle
|
||||||
|
without adding a third path-matching language.
|
||||||
|
|
||||||
## Design
|
## Design
|
||||||
|
|
||||||
|
### Route matching: Gateway API `matches` vocabulary
|
||||||
|
|
||||||
|
The existing `path_allowlist` field is replaced by a `matches` list. The
|
||||||
|
vocabulary mirrors Kubernetes Gateway API `HTTPRouteMatch` (see the
|
||||||
|
[route matching research doc](https://gitea.dideric.is/didericis/bot-bottle/src/branch/main/docs/research/yaml-route-matching-formats.md)
|
||||||
|
for a full format survey and rationale). Gateway API was chosen because it
|
||||||
|
is spec-backed, implementation-tested across multiple proxies, and its
|
||||||
|
`{type, value}` pattern is consistent and schema-validatable.
|
||||||
|
|
||||||
|
**AND/OR semantics** (same as Gateway API):
|
||||||
|
- Predicates *within* a single `matches` entry are ANDed.
|
||||||
|
- Multiple entries in the `matches` list are ORed — the route matches if
|
||||||
|
any entry matches.
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
egress:
|
||||||
|
routes:
|
||||||
|
# Bare route — all traffic to this host is forwarded (no path/method/header
|
||||||
|
# constraints). Equivalent to the old path_allowlist-omitted case.
|
||||||
|
- host: api.anthropic.com
|
||||||
|
auth:
|
||||||
|
scheme: Bearer
|
||||||
|
token_ref: EGRESS_TOKEN_0
|
||||||
|
|
||||||
|
# Two match entries (OR): GET/HEAD on /packages/** OR POST on /upload
|
||||||
|
- host: files.pythonhosted.org
|
||||||
|
matches:
|
||||||
|
- paths:
|
||||||
|
- type: prefix
|
||||||
|
value: /packages/
|
||||||
|
methods: [GET, HEAD]
|
||||||
|
- paths:
|
||||||
|
- type: exact
|
||||||
|
value: /upload
|
||||||
|
methods: [POST]
|
||||||
|
dlp:
|
||||||
|
inbound_detectors: false # skip response scanning (binary downloads)
|
||||||
|
|
||||||
|
# Header + regex path — only JSON API responses on versioned endpoints
|
||||||
|
- host: internal-api.corp
|
||||||
|
matches:
|
||||||
|
- paths:
|
||||||
|
- type: regex
|
||||||
|
value: "^/v[0-9]+/"
|
||||||
|
headers:
|
||||||
|
- name: Content-Type
|
||||||
|
type: exact
|
||||||
|
value: application/json
|
||||||
|
dlp:
|
||||||
|
outbound_detectors: false
|
||||||
|
inbound_detectors: false
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Path matching types
|
||||||
|
|
||||||
|
| `type` | Semantics |
|
||||||
|
|--------|-----------|
|
||||||
|
| `exact` | Full path must equal `value` exactly |
|
||||||
|
| `prefix` | Path must start with `value` at a segment boundary (matches `/api/v1` for value `/api/v1`, rejects `/api/v10`) |
|
||||||
|
| `regex` | RE2 regex; rejected at load time if pattern fails to compile. Use for wildcard needs: `/api/[^/]+/data` instead of glob |
|
||||||
|
|
||||||
|
`type` defaults to `prefix` when omitted (preserves the semantic of the
|
||||||
|
old `path_allowlist`).
|
||||||
|
|
||||||
|
#### Method matching
|
||||||
|
|
||||||
|
`methods` is a list of HTTP method names, case-insensitive at parse time —
|
||||||
|
`get`, `GET`, and `Get` are all accepted and stored as uppercase internally.
|
||||||
|
An absent or empty `methods` list means all methods are permitted.
|
||||||
|
|
||||||
|
#### Header matching
|
||||||
|
|
||||||
|
`headers` is a list of `{name, value, type}` objects. ALL listed headers
|
||||||
|
must match (AND semantics). To OR on header values, use multiple `matches`
|
||||||
|
entries.
|
||||||
|
|
||||||
|
| `type` | Semantics |
|
||||||
|
|--------|-----------|
|
||||||
|
| `exact` | Header value equals `value` (default when `type` omitted) |
|
||||||
|
| `regex` | Header value matches RE2 regex |
|
||||||
|
|
||||||
### Manifest schema — `dlp` block
|
### Manifest schema — `dlp` block
|
||||||
|
|
||||||
Each `egress.routes` entry gains an optional `dlp` key:
|
Each `egress.routes` entry gains an optional `dlp` key alongside `matches`
|
||||||
|
and `auth`:
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
egress:
|
egress:
|
||||||
@@ -100,13 +201,34 @@ rejects unknown detector names.
|
|||||||
|
|
||||||
### `EgressRoute` changes
|
### `EgressRoute` changes
|
||||||
|
|
||||||
`EgressRoute` gains two new fields:
|
`EgressRoute` replaces `PathAllowlist` with `Matches` and gains two new
|
||||||
|
DLP fields. `MatchEntry` captures one AND-predicate block:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class PathMatch:
|
||||||
|
type: str # "exact" | "prefix" | "regex"
|
||||||
|
value: str
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class HeaderMatch:
|
||||||
|
name: str
|
||||||
|
value: str
|
||||||
|
type: str = "exact" # "exact" | "regex"
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class MatchEntry:
|
||||||
|
paths: tuple[PathMatch, ...] = () # empty = match any path
|
||||||
|
methods: tuple[str, ...] = () # empty = match any method (uppercase)
|
||||||
|
headers: tuple[HeaderMatch, ...] = () # empty = match any headers
|
||||||
|
|
||||||
|
|
||||||
@dataclass(frozen=True)
|
@dataclass(frozen=True)
|
||||||
class EgressRoute:
|
class EgressRoute:
|
||||||
Host: str
|
Host: str
|
||||||
PathAllowlist: tuple[str, ...] = ()
|
Matches: tuple[MatchEntry, ...] = () # empty = match all requests
|
||||||
AuthScheme: str = ""
|
AuthScheme: str = ""
|
||||||
TokenRef: str = ""
|
TokenRef: str = ""
|
||||||
Role: tuple[str, ...] = ()
|
Role: tuple[str, ...] = ()
|
||||||
@@ -114,28 +236,30 @@ class EgressRoute:
|
|||||||
InboundDetectors: tuple[str, ...] | None = None # None = all enabled
|
InboundDetectors: tuple[str, ...] | None = None # None = all enabled
|
||||||
```
|
```
|
||||||
|
|
||||||
`None` means "use defaults" (all active); an empty `tuple[str, ...]` means
|
`manifest_egress.py`'s `from_dict` parses the new `matches` block and `dlp`
|
||||||
"disabled". Named detectors use `tuple[str, ...]` with the detector name.
|
block; `path_allowlist` is no longer a recognised key and will be rejected
|
||||||
|
by the unknown-key check.
|
||||||
`manifest_egress.py` uses `from_dict` to parse the new `dlp` block and
|
|
||||||
populate these fields; unknown keys inside `dlp` are rejected.
|
|
||||||
|
|
||||||
### `Route` changes in `egress_addon_core.py`
|
### `Route` changes in `egress_addon_core.py`
|
||||||
|
|
||||||
The addon-side `Route` dataclass mirrors the manifest-side change:
|
The addon-side `Route` and its helper types mirror the manifest-side changes.
|
||||||
|
`match_route` is extended to evaluate the `Matches` list:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
@dataclass(frozen=True)
|
@dataclass(frozen=True)
|
||||||
class Route:
|
class Route:
|
||||||
host: str
|
host: str
|
||||||
path_allowlist: tuple[str, ...] = ()
|
matches: tuple[MatchEntry, ...] = ()
|
||||||
auth_scheme: str = ""
|
auth_scheme: str = ""
|
||||||
token_env: str = ""
|
token_env: str = ""
|
||||||
outbound_detectors: tuple[str, ...] | None = None
|
outbound_detectors: tuple[str, ...] | None = None
|
||||||
inbound_detectors: tuple[str, ...] | None = None
|
inbound_detectors: tuple[str, ...] | None = None
|
||||||
```
|
```
|
||||||
|
|
||||||
`parse_routes` / `_parse_one` grow the corresponding parsing logic.
|
`decide()` feeds through `match_route` (unchanged host lookup) then
|
||||||
|
evaluates the match entries in order; if the route has no `matches` entries
|
||||||
|
all requests pass. Path `prefix` type uses segment-boundary checking
|
||||||
|
(`/api/v1` matches `/api/v1/foo` but not `/api/v10`).
|
||||||
|
|
||||||
### Detector interface
|
### Detector interface
|
||||||
|
|
||||||
@@ -212,7 +336,7 @@ Pattern-based inbound response scanner. Uses two tiers:
|
|||||||
- Single jailbreak keyword without additional context.
|
- Single jailbreak keyword without additional context.
|
||||||
- Common documentation phrases.
|
- Common documentation phrases.
|
||||||
|
|
||||||
See the research doc for the full phrase lists and pseudocode.
|
See the DLP research doc for the full phrase lists and pseudocode.
|
||||||
|
|
||||||
### Wiring into `egress_addon.py`
|
### Wiring into `egress_addon.py`
|
||||||
|
|
||||||
@@ -220,7 +344,7 @@ Two new mitmproxy hooks are added alongside the existing `request` hook:
|
|||||||
|
|
||||||
```python
|
```python
|
||||||
def request(self, flow: http.HTTPFlow) -> None:
|
def request(self, flow: http.HTTPFlow) -> None:
|
||||||
# ... existing path-allowlist + auth-injection logic ...
|
# ... existing match + auth-injection logic ...
|
||||||
# After route decision, if action == "forward":
|
# After route decision, if action == "forward":
|
||||||
result = scan_outbound(route, flow.request, os.environ)
|
result = scan_outbound(route, flow.request, os.environ)
|
||||||
if result and result.severity == "block":
|
if result and result.severity == "block":
|
||||||
@@ -250,20 +374,20 @@ afterward, preserving the existing credential-injection security model.
|
|||||||
|
|
||||||
## Implementation chunks
|
## Implementation chunks
|
||||||
|
|
||||||
1. **Manifest `dlp` block + `EgressRoute` fields.**
|
1. **New `matches` block + `EgressRoute` / `Route` restructure.**
|
||||||
`manifest_egress.py`: parse `dlp`, add `OutboundDetectors` /
|
Remove `path_allowlist` from `manifest_egress.py` and `egress_addon_core.py`.
|
||||||
`InboundDetectors` to `EgressRoute`. Extend
|
Add `MatchEntry`, `PathMatch`, `HeaderMatch` types. Parse `matches` in
|
||||||
`tests/unit/test_manifest_egress.py` with `dlp` valid/invalid cases.
|
`EgressRoute.from_dict` and `_parse_one`; unknown-key rejection handles
|
||||||
`egress_addon_core.py`: add `outbound_detectors` / `inbound_detectors`
|
old `path_allowlist` manifests. Add `OutboundDetectors` / `InboundDetectors`
|
||||||
to `Route`; update `_parse_one` and `parse_routes`; extend
|
to `EgressRoute` and `Route`; parse `dlp` block. Extend
|
||||||
`tests/unit/test_egress_addon_core.py`.
|
`tests/unit/test_manifest_egress.py` and `tests/unit/test_egress_addon_core.py`
|
||||||
|
with match and dlp valid/invalid cases.
|
||||||
|
|
||||||
2. **Token-patterns detector (Phase 1a).**
|
2. **Token-patterns detector (Phase 1a).**
|
||||||
New module `bot_bottle/dlp_detectors.py` (host-importable) and
|
New module `bot_bottle/dlp_detectors.py` (host-importable) and
|
||||||
companion flat copy for the sidecar bundle. Add `TokenPatternsDetector`
|
companion flat copy for the sidecar bundle. Add `TokenPatternsDetector`
|
||||||
with the regex set above. Wire `scan_outbound` into the `request` hook
|
with the regex set above. Wire `scan_outbound` into the `request` hook
|
||||||
in `egress_addon.py`. Unit tests in
|
in `egress_addon.py`. Unit tests in `tests/unit/test_dlp_detectors.py`.
|
||||||
`tests/unit/test_dlp_detectors.py`.
|
|
||||||
|
|
||||||
3. **Known-secrets detector (Phase 1b).**
|
3. **Known-secrets detector (Phase 1b).**
|
||||||
Add `KnownSecretsDetector` to `dlp_detectors.py`. Collect
|
Add `KnownSecretsDetector` to `dlp_detectors.py`. Collect
|
||||||
|
|||||||
@@ -407,23 +407,20 @@ egress:
|
|||||||
All predicates within `match` are ANDed. A list of `paths` entries is
|
All predicates within `match` are ANDed. A list of `paths` entries is
|
||||||
ORed (first match wins — same as the current `path_allowlist` semantics).
|
ORed (first match wins — same as the current `path_allowlist` semantics).
|
||||||
|
|
||||||
### 2. Path type enum (`exact` | `prefix` | `glob` | `regex`)
|
### 2. Path type enum (`exact` | `prefix` | `regex`)
|
||||||
|
|
||||||
Use four named types rather than inferring from the value's syntax. This
|
Use three named types rather than inferring from the value's syntax. This
|
||||||
avoids the ambiguity that plagues `.gitignore` and `nginx location` patterns
|
avoids the ambiguity that plagues `.gitignore` and `nginx location` patterns
|
||||||
where the same string can mean different things depending on leading characters.
|
where the same string can mean different things depending on leading characters.
|
||||||
|
|
||||||
- `prefix`: mirrors current `path_allowlist` semantics.
|
- `prefix`: mirrors current `path_allowlist` semantics.
|
||||||
- `glob`: adopts ALB-style `*` (single segment) and `**` (multi-segment,
|
- `regex`: RE2 for wildcard and advanced cases. Reject at load time if the
|
||||||
covering `/api/*/data` and `/api/**/data`). Simpler for operators than
|
pattern fails to compile. Covers every case glob would handle —
|
||||||
writing regex.
|
`/api/[^/]+/data` is the `/api/*/data` equivalent.
|
||||||
- `regex`: RE2 for advanced cases. Reject at load time if the pattern fails
|
|
||||||
to compile.
|
|
||||||
|
|
||||||
**Glob semantics decision needed:** ALB's `*` matches across `/`; most
|
Glob-style syntax is not included: it adds a third path-matching language
|
||||||
shell-glob conventions treat `*` as intra-segment and `**` as cross-segment.
|
on top of prefix and regex without meaningful operator benefit, since regex
|
||||||
The shell convention (`*` = no slash, `**` = slash-crossing) is less
|
is already required for any non-trivial wildcard.
|
||||||
surprising to operators and avoids accidental over-matching.
|
|
||||||
|
|
||||||
### 3. Header matching as a list of `{name, value, type}` objects
|
### 3. Header matching as a list of `{name, value, type}` objects
|
||||||
|
|
||||||
@@ -469,18 +466,22 @@ predicates.
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Open questions
|
## Decisions
|
||||||
|
|
||||||
1. **Backward compatibility:** `path_allowlist` is the current field. If
|
The open questions raised during research were resolved in PR #196 review:
|
||||||
adopting a `match`/`matches` structure, keep `path_allowlist` as a
|
|
||||||
deprecated alias? Or treat this as a breaking manifest version bump?
|
1. **Backward compatibility:** Hard cutover. The new `matches` structure
|
||||||
2. **Glob segment semantics:** adopt shell convention (`*` = intra-segment,
|
replaces `path_allowlist` entirely with no compatibility shim and no
|
||||||
`**` = cross-segment) or ALB convention (`*` = anything including `/`)?
|
fallback parsing for the old format. Manifests using `path_allowlist`
|
||||||
The shell convention is safer; ALB's is simpler.
|
must be migrated.
|
||||||
3. **Header value OR:** Gateway API requires a separate match entry to OR
|
|
||||||
header values. ALB allows multiple values in one condition. Which is
|
2. **Glob support:** Dropped. Not strictly necessary — `regex` covers every
|
||||||
less surprising for bot-bottle operators? The ALB approach is more
|
case glob would handle. Fewer path-matching languages to document and
|
||||||
concise for the common case (e.g., `Content-Type: [application/json,
|
validate.
|
||||||
application/x-www-form-urlencoded]`).
|
|
||||||
4. **Case sensitivity on method names:** normalize to uppercase at parse
|
3. **Header value OR:** Stick with Gateway API. OR across header values
|
||||||
time (fail on unrecognised values) or case-insensitively?
|
requires a separate entry in the `matches` list, not multiple values
|
||||||
|
inside one `headers` block.
|
||||||
|
|
||||||
|
4. **Method name case:** Case-insensitive at parse time. `get`, `GET`, and
|
||||||
|
`Get` are all accepted and normalised to uppercase internally.
|
||||||
|
|||||||
Reference in New Issue
Block a user