docs: research on YAML route matching formats (paths, headers, methods)
test / unit (pull_request) Successful in 28s
test / integration (pull_request) Successful in 46s

This commit is contained in:
2026-06-05 00:41:19 +00:00
parent f145203eee
commit 035ed430ba
@@ -0,0 +1,486 @@
# YAML route matching formats: paths, headers, and methods
## Question
Bot-bottle's egress manifest currently supports exact-host matching and
a flat list of path prefixes (`path_allowlist`). As the DLP work (PRD 0053)
and future route hardening evolve, we may want more expressive matching:
glob-style path patterns (`/api/*/data`), header predicates (Content-Type,
Accept), and per-method rules (GET allowed, POST blocked). What established
YAML-based formats exist for declaring this kind of route matching, and
which design choices should bot-bottle adopt?
## Summary
Four formats stand out as well-designed, widely deployed references:
**Kubernetes Gateway API `HTTPRoute`**, **Envoy `RouteConfiguration`**,
**AWS ALB listener rules**, and **Traefik dynamic routing**. A fifth,
Istio `VirtualService`, is worth noting but is largely superseded by
Gateway API for new designs.
**Recommendation for bot-bottle:** adopt the Gateway API `HTTPRoute`
match vocabulary as a direct model. It is the most carefully designed of
the four, has a published spec, handles all three requirements cleanly, and
its match object nests naturally into a YAML route block alongside
bot-bottle's existing `host`, `path_allowlist`, and `auth` fields.
Envoy's format is more powerful but far more verbose and harder to
validate by hand; ALB rules use a flat predicate list that does not
compose well; Traefik uses string expressions rather than structured YAML.
## Current bot-bottle route schema
```yaml
egress:
routes:
- host: api.github.com
path_allowlist:
- /repos/myorg/
auth:
scheme: Bearer
token_ref: EGRESS_TOKEN_0
```
Matching today: exact host + path-prefix list. No method or header
awareness.
---
## Format 1: Kubernetes Gateway API `HTTPRoute`
**Spec:** [gateway.networking.k8s.io/v1](https://gateway-api.sigs.k8s.io/reference/spec/#gateway.networking.k8s.io/v1.HTTPRouteMatch)
**Maturity:** GA (v1.0+, 2023). Backed by SIG Network; shipping in GKE,
EKS, AKS, Istio, Envoy Gateway, Cilium, Traefik v3.
### Match object
```yaml
rules:
- matches:
- path:
type: Exact # Exact | PathPrefix | RegularExpression
value: /api/v1/data
headers:
- name: Content-Type
type: Exact # Exact | RegularExpression
value: application/json
queryParams:
- name: version
type: Exact
value: "2"
method: GET # GET | POST | PUT | DELETE | PATCH | …
```
A `matches` entry is a logical AND across all predicates within it. Multiple
entries in the `matches` list are ORed: the rule fires if any entry matches.
### Path matching
| `type` | Semantics |
|--------|-----------|
| `Exact` | Full path must equal `value` (no trailing-slash equivalence) |
| `PathPrefix` | Path must start with `value`; `/api` matches `/api/v1` but not `/apiv1` |
| `RegularExpression` | RE2-syntax regex; implementations may differ on anchoring |
**Glob-style paths (`/api/*/data`):** Gateway API does not define a glob
type. The intent is to use `RegularExpression` for that case:
`/api/[^/]+/data` replaces `/api/*/data`. This is unambiguous and widely
understood.
### Header matching
```yaml
headers:
- name: Content-Type
type: Exact
value: application/json
- name: X-Request-Id
type: RegularExpression
value: "[0-9a-f]{8}-.*"
```
All `headers` entries must match (AND semantics). Missing a header is a
non-match (no "header absent" type in v1; implementations add it as an
extension).
### Method matching
```yaml
method: GET
```
Single method per match entry. To allow GET and POST, use two match
entries (OR semantics at the matches level):
```yaml
matches:
- path:
type: PathPrefix
value: /api/v1
method: GET
- path:
type: PathPrefix
value: /api/v1
method: POST
```
### Strengths / weaknesses
**Strengths:** spec-backed, implementation-tested, composable AND/OR
semantics, explicit about what is not supported (no glob, no header-absent),
good field naming (`type` + `value` pattern is consistent throughout).
**Weaknesses:** verbosity when expressing OR across methods; regex is
the only path wildcard mechanism; no body matching.
---
## Format 2: Envoy `RouteConfiguration`
**Spec:** [envoy.config.route.v3.RouteMatch](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/route/v3/route_components.proto#config-route-v3-routematch)
**Maturity:** Widely deployed (Istio data plane, AWS App Mesh, solo.io
Gloo). Defined in protobuf; YAML is the human-readable rendering.
### Match object
```yaml
match:
path: /exact/path # exact match
# OR
prefix: /api/ # prefix match
# OR
safe_regex:
google_re2: {}
regex: "/api/v[0-9]+/.*"
# OR
path_separated_prefix: /api/v1 # prefix with segment boundary enforcement
headers:
- name: content-type
string_match:
exact: application/json
# OR
prefix: text/
# OR
safe_regex:
google_re2: {}
regex: "application/(json|xml)"
invert_match: false # negate the predicate
- name: x-custom-header
present_match: true # just check presence
query_parameters:
- name: version
string_match:
exact: "2"
```
Method is matched via a pseudo-header:
```yaml
headers:
- name: :method
string_match:
exact: GET
```
Multiple methods require an OR combinator (`or_match`), available in
Envoy v1.21+:
```yaml
headers:
- name: :method
or_match:
value_matchers:
- string_match:
exact: GET
- string_match:
exact: POST
```
### Path matching
| Field | Semantics |
|-------|-----------|
| `prefix` | Path starts with value (any suffix allowed) |
| `path` | Exact match |
| `safe_regex` | RE2 regex (Google RE2 safety guarantees) |
| `path_separated_prefix` | Like `prefix` but only matches at segment boundaries (`/api/v1` won't match `/api/v10`) |
| `connect_matcher` | CONNECT method only |
Glob (`/api/*/data`): use `safe_regex`: `/api/[^/]+/data`.
### Strengths / weaknesses
**Strengths:** most expressive format surveyed; `invert_match`, `present_match`,
OR combinators, pseudo-header method matching; handles every edge case.
**Weaknesses:** very verbose; protobuf-origin field names are not
self-evident; `or_match` nesting is awkward; hard to validate in a
lightweight schema check; not appropriate as a user-facing YAML format
without a wrapping DSL.
---
## Format 3: AWS ALB Listener Rules
**Spec:** [AWS Elastic Load Balancing API — Conditions](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-listeners.html#rule-condition-types)
**Maturity:** GA, widely used in AWS infrastructure-as-code (CloudFormation,
Terraform `aws_lb_listener_rule`).
### Match object (Terraform / CloudFormation rendering)
```yaml
conditions:
- field: path-pattern
path_pattern_config:
values:
- /api/*
- /health
- field: http-header
http_header_config:
http_header_name: Content-Type
values:
- application/json
- application/x-www-form-urlencoded
- field: http-request-method
http_request_method_config:
values:
- GET
- POST
- field: host-header
host_header_config:
values:
- "*.example.com"
- api.example.com
- field: query-string
query_string_config:
values:
- key: version
value: "2"
```
All conditions in a rule are ANDed. Multiple values within a single
condition are ORed. Up to 5 conditions per rule.
### Path matching
ALB natively supports glob patterns in `path-pattern`:
- `*` matches any sequence of characters (including `/`).
- `?` matches any single character.
This is the only surveyed format with first-class glob support. `/api/*/data`
is valid and unambiguous. No regex support.
### Header matching
Header conditions match against the header value. Multiple values are ORed.
The header name is fixed per condition block; to AND two header predicates,
add two separate `http-header` conditions. Case-insensitive matching on
values.
### Method matching
```yaml
- field: http-request-method
http_request_method_config:
values:
- GET
- POST
```
Multiple values are ORed (GET or POST). Up to 40 methods per rule.
### Strengths / weaknesses
**Strengths:** first-class glob path matching (the only format surveyed
with `*` and `?`); multi-value OR within a condition block is concise for
the common case; method matching is a flat list, easy to write.
**Weaknesses:** maximum 5 conditions per rule; no regex; no header-absent
predicate; no request-body matching; the `field` + `*_config` naming is
awkward (the field name is a string enum that determines which sibling key
is relevant — a schema-validation anti-pattern); tied to AWS semantics
(target groups, priority integers).
---
## Format 4: Traefik Dynamic Routing
**Spec:** [Traefik Router Rule syntax](https://doc.traefik.io/traefik/routing/routers/#rule)
**Maturity:** GA, widely deployed in Kubernetes (IngressRoute CRD) and
Docker-Compose setups. Traefik v3 aligns with Gateway API for Kubernetes
routes but keeps its own expression syntax for the `rule` field.
### Match expression (string, embedded in YAML)
```yaml
http:
routers:
my-router:
rule: >
Host(`api.example.com`) &&
PathPrefix(`/api/v1`) &&
Method(`GET`, `POST`) &&
Header(`Content-Type`, `application/json`)
service: my-service
```
`&&` = AND, `||` = OR. Parentheses for grouping.
Available matchers:
| Matcher | Example |
|---------|---------|
| `Host` | `Host("api.example.com")` |
| `HostRegexp` | `HostRegexp(".*\.example\.com")` |
| `Path` | `Path("/exact/path")` |
| `PathPrefix` | `PathPrefix("/api/v1")` |
| `PathRegexp` | `PathRegexp("/api/v[0-9]+/.*")` |
| `Method` | `Method("GET", "POST")` |
| `Header` | `Header("Content-Type", "application/json")` |
| `HeaderRegexp` | `HeaderRegexp("Accept", "application/.*")` |
| `Query` | `Query("version", "2")` |
| `QueryRegexp` | `QueryRegexp("id", "[0-9]+")` |
| `ClientIP` | `ClientIP("10.0.0.0/8")` |
Glob paths: not supported directly. Use `PathRegexp` instead.
### Strengths / weaknesses
**Strengths:** the most expressive and concise format for complex boolean
combinations (AND/OR/NOT in a single line); `Method("GET", "POST")` is
the cleanest multi-method syntax surveyed; full regex support on every
field; Traefik v3 supports this inside Kubernetes CRDs.
**Weaknesses:** the rule is a *string* embedded in YAML, not a structured
object — it cannot be validated with JSON Schema and is harder to generate
programmatically; no structured round-trip; no glob, only regex.
---
## Comparison table
| | Gateway API | Envoy | AWS ALB | Traefik |
|---|---|---|---|---|
| **Path: exact** | ✅ `Exact` | ✅ `path` | ✅ exact value | ✅ `Path()` |
| **Path: prefix** | ✅ `PathPrefix` | ✅ `prefix` / `path_separated_prefix` | ✅ (via glob `/*`) | ✅ `PathPrefix()` |
| **Path: glob** (`/a/*/b`) | ❌ (use regex) | ❌ (use regex) | ✅ native | ❌ (use regex) |
| **Path: regex** | ✅ `RegularExpression` | ✅ `safe_regex` | ❌ | ✅ `PathRegexp()` |
| **Header: exact** | ✅ | ✅ | ✅ | ✅ |
| **Header: regex** | ✅ | ✅ | ❌ | ✅ |
| **Header: absent** | ❌ (extension) | ✅ `present_match: false` | ❌ | ❌ |
| **Method matching** | ✅ (one per entry; OR via multiple entries) | ✅ (via `:method` pseudo-header) | ✅ (list = OR) | ✅ `Method("GET","POST")` |
| **AND semantics** | predicates within one `matches` entry | all conditions | all `conditions` entries | `&&` operator |
| **OR semantics** | multiple `matches` entries | `or_match` combinator | multiple values in one condition | `\|\|` operator |
| **Schema-validatable** | ✅ (CRD/JSON Schema) | ✅ (protobuf) | ✅ (CloudFormation schema) | ❌ (embedded string) |
| **Human-writable** | ✅ | ⚠️ verbose | ✅ | ✅ |
| **Generatable** | ✅ | ✅ | ✅ | ⚠️ (string concat) |
---
## Design choices worth adopting
### 1. Match object as a structured peer to `host`
Gateway API's separation of concerns maps well onto bot-bottle's existing
schema. Instead of a flat `path_allowlist`, a `match` block nests all
predicates:
```yaml
egress:
routes:
- host: api.github.com
match:
paths:
- type: prefix # exact | prefix | glob | regex
value: /repos/myorg/
headers:
- name: Content-Type
value: application/json
methods: [GET, POST]
auth:
scheme: Bearer
token_ref: EGRESS_TOKEN_0
```
All predicates within `match` are ANDed. A list of `paths` entries is
ORed (first match wins — same as the current `path_allowlist` semantics).
### 2. Path type enum (`exact` | `prefix` | `glob` | `regex`)
Use four named types rather than inferring from the value's syntax. This
avoids the ambiguity that plagues `.gitignore` and `nginx location` patterns
where the same string can mean different things depending on leading characters.
- `prefix`: mirrors current `path_allowlist` semantics.
- `glob`: adopts ALB-style `*` (single segment) and `**` (multi-segment,
covering `/api/*/data` and `/api/**/data`). Simpler for operators than
writing regex.
- `regex`: RE2 for advanced cases. Reject at load time if the pattern fails
to compile.
**Glob semantics decision needed:** ALB's `*` matches across `/`; most
shell-glob conventions treat `*` as intra-segment and `**` as cross-segment.
The shell convention (`*` = no slash, `**` = slash-crossing) is less
surprising to operators and avoids accidental over-matching.
### 3. Header matching as a list of `{name, value, type}` objects
Mirrors Gateway API exactly. ALL headers must match (AND). `type` defaults
to `exact`; `regex` is available. No header-absent for now (adds complexity,
low immediate need).
```yaml
headers:
- name: Content-Type
value: application/json # type: exact (default)
- name: X-Internal-Key
value: "dev-[0-9]+"
type: regex
```
### 4. Method list as a flat enum list
Adopts ALB's conciseness. An empty or absent `methods` list means all
methods are permitted. Values are uppercased HTTP method names.
```yaml
methods: [GET, HEAD]
```
### 5. Multiple `match` entries per route: OR semantics at the route level
If a route needs GET on one path and POST on a different path, use a
`matches` (plural) list where entries are ORed:
```yaml
routes:
- host: api.example.com
matches:
- paths: [{type: prefix, value: /read}]
methods: [GET, HEAD]
- paths: [{type: exact, value: /write}]
methods: [POST, PUT]
```
This mirrors Gateway API's top-level OR; each entry is an AND of its
predicates.
---
## Open questions
1. **Backward compatibility:** `path_allowlist` is the current field. If
adopting a `match`/`matches` structure, keep `path_allowlist` as a
deprecated alias? Or treat this as a breaking manifest version bump?
2. **Glob segment semantics:** adopt shell convention (`*` = intra-segment,
`**` = cross-segment) or ALB convention (`*` = anything including `/`)?
The shell convention is safer; ALB's is simpler.
3. **Header value OR:** Gateway API requires a separate match entry to OR
header values. ALB allows multiple values in one condition. Which is
less surprising for bot-bottle operators? The ALB approach is more
concise for the common case (e.g., `Content-Type: [application/json,
application/x-www-form-urlencoded]`).
4. **Case sensitivity on method names:** normalize to uppercase at parse
time (fail on unrecognised values) or case-insensitively?