From 035ed430ba851ca1bffb19e8d0d12bcf8a53dc21 Mon Sep 17 00:00:00 2001 From: claude Date: Fri, 5 Jun 2026 00:41:19 +0000 Subject: [PATCH] docs: research on YAML route matching formats (paths, headers, methods) --- docs/research/yaml-route-matching-formats.md | 486 +++++++++++++++++++ 1 file changed, 486 insertions(+) create mode 100644 docs/research/yaml-route-matching-formats.md diff --git a/docs/research/yaml-route-matching-formats.md b/docs/research/yaml-route-matching-formats.md new file mode 100644 index 0000000..2af2bb7 --- /dev/null +++ b/docs/research/yaml-route-matching-formats.md @@ -0,0 +1,486 @@ +# YAML route matching formats: paths, headers, and methods + +## Question + +Bot-bottle's egress manifest currently supports exact-host matching and +a flat list of path prefixes (`path_allowlist`). As the DLP work (PRD 0053) +and future route hardening evolve, we may want more expressive matching: +glob-style path patterns (`/api/*/data`), header predicates (Content-Type, +Accept), and per-method rules (GET allowed, POST blocked). What established +YAML-based formats exist for declaring this kind of route matching, and +which design choices should bot-bottle adopt? + +## Summary + +Four formats stand out as well-designed, widely deployed references: +**Kubernetes Gateway API `HTTPRoute`**, **Envoy `RouteConfiguration`**, +**AWS ALB listener rules**, and **Traefik dynamic routing**. A fifth, +Istio `VirtualService`, is worth noting but is largely superseded by +Gateway API for new designs. + +**Recommendation for bot-bottle:** adopt the Gateway API `HTTPRoute` +match vocabulary as a direct model. It is the most carefully designed of +the four, has a published spec, handles all three requirements cleanly, and +its match object nests naturally into a YAML route block alongside +bot-bottle's existing `host`, `path_allowlist`, and `auth` fields. +Envoy's format is more powerful but far more verbose and harder to +validate by hand; ALB rules use a flat predicate list that does not +compose well; Traefik uses string expressions rather than structured YAML. + +## Current bot-bottle route schema + +```yaml +egress: + routes: + - host: api.github.com + path_allowlist: + - /repos/myorg/ + auth: + scheme: Bearer + token_ref: EGRESS_TOKEN_0 +``` + +Matching today: exact host + path-prefix list. No method or header +awareness. + +--- + +## Format 1: Kubernetes Gateway API `HTTPRoute` + +**Spec:** [gateway.networking.k8s.io/v1](https://gateway-api.sigs.k8s.io/reference/spec/#gateway.networking.k8s.io/v1.HTTPRouteMatch) +**Maturity:** GA (v1.0+, 2023). Backed by SIG Network; shipping in GKE, +EKS, AKS, Istio, Envoy Gateway, Cilium, Traefik v3. + +### Match object + +```yaml +rules: + - matches: + - path: + type: Exact # Exact | PathPrefix | RegularExpression + value: /api/v1/data + headers: + - name: Content-Type + type: Exact # Exact | RegularExpression + value: application/json + queryParams: + - name: version + type: Exact + value: "2" + method: GET # GET | POST | PUT | DELETE | PATCH | … +``` + +A `matches` entry is a logical AND across all predicates within it. Multiple +entries in the `matches` list are ORed: the rule fires if any entry matches. + +### Path matching + +| `type` | Semantics | +|--------|-----------| +| `Exact` | Full path must equal `value` (no trailing-slash equivalence) | +| `PathPrefix` | Path must start with `value`; `/api` matches `/api/v1` but not `/apiv1` | +| `RegularExpression` | RE2-syntax regex; implementations may differ on anchoring | + +**Glob-style paths (`/api/*/data`):** Gateway API does not define a glob +type. The intent is to use `RegularExpression` for that case: +`/api/[^/]+/data` replaces `/api/*/data`. This is unambiguous and widely +understood. + +### Header matching + +```yaml +headers: + - name: Content-Type + type: Exact + value: application/json + - name: X-Request-Id + type: RegularExpression + value: "[0-9a-f]{8}-.*" +``` + +All `headers` entries must match (AND semantics). Missing a header is a +non-match (no "header absent" type in v1; implementations add it as an +extension). + +### Method matching + +```yaml +method: GET +``` + +Single method per match entry. To allow GET and POST, use two match +entries (OR semantics at the matches level): + +```yaml +matches: + - path: + type: PathPrefix + value: /api/v1 + method: GET + - path: + type: PathPrefix + value: /api/v1 + method: POST +``` + +### Strengths / weaknesses + +**Strengths:** spec-backed, implementation-tested, composable AND/OR +semantics, explicit about what is not supported (no glob, no header-absent), +good field naming (`type` + `value` pattern is consistent throughout). + +**Weaknesses:** verbosity when expressing OR across methods; regex is +the only path wildcard mechanism; no body matching. + +--- + +## Format 2: Envoy `RouteConfiguration` + +**Spec:** [envoy.config.route.v3.RouteMatch](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/route/v3/route_components.proto#config-route-v3-routematch) +**Maturity:** Widely deployed (Istio data plane, AWS App Mesh, solo.io +Gloo). Defined in protobuf; YAML is the human-readable rendering. + +### Match object + +```yaml +match: + path: /exact/path # exact match + # OR + prefix: /api/ # prefix match + # OR + safe_regex: + google_re2: {} + regex: "/api/v[0-9]+/.*" + # OR + path_separated_prefix: /api/v1 # prefix with segment boundary enforcement + + headers: + - name: content-type + string_match: + exact: application/json + # OR + prefix: text/ + # OR + safe_regex: + google_re2: {} + regex: "application/(json|xml)" + invert_match: false # negate the predicate + + - name: x-custom-header + present_match: true # just check presence + + query_parameters: + - name: version + string_match: + exact: "2" +``` + +Method is matched via a pseudo-header: + +```yaml +headers: + - name: :method + string_match: + exact: GET +``` + +Multiple methods require an OR combinator (`or_match`), available in +Envoy v1.21+: + +```yaml +headers: + - name: :method + or_match: + value_matchers: + - string_match: + exact: GET + - string_match: + exact: POST +``` + +### Path matching + +| Field | Semantics | +|-------|-----------| +| `prefix` | Path starts with value (any suffix allowed) | +| `path` | Exact match | +| `safe_regex` | RE2 regex (Google RE2 safety guarantees) | +| `path_separated_prefix` | Like `prefix` but only matches at segment boundaries (`/api/v1` won't match `/api/v10`) | +| `connect_matcher` | CONNECT method only | + +Glob (`/api/*/data`): use `safe_regex`: `/api/[^/]+/data`. + +### Strengths / weaknesses + +**Strengths:** most expressive format surveyed; `invert_match`, `present_match`, +OR combinators, pseudo-header method matching; handles every edge case. + +**Weaknesses:** very verbose; protobuf-origin field names are not +self-evident; `or_match` nesting is awkward; hard to validate in a +lightweight schema check; not appropriate as a user-facing YAML format +without a wrapping DSL. + +--- + +## Format 3: AWS ALB Listener Rules + +**Spec:** [AWS Elastic Load Balancing API — Conditions](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-listeners.html#rule-condition-types) +**Maturity:** GA, widely used in AWS infrastructure-as-code (CloudFormation, +Terraform `aws_lb_listener_rule`). + +### Match object (Terraform / CloudFormation rendering) + +```yaml +conditions: + - field: path-pattern + path_pattern_config: + values: + - /api/* + - /health + - field: http-header + http_header_config: + http_header_name: Content-Type + values: + - application/json + - application/x-www-form-urlencoded + - field: http-request-method + http_request_method_config: + values: + - GET + - POST + - field: host-header + host_header_config: + values: + - "*.example.com" + - api.example.com + - field: query-string + query_string_config: + values: + - key: version + value: "2" +``` + +All conditions in a rule are ANDed. Multiple values within a single +condition are ORed. Up to 5 conditions per rule. + +### Path matching + +ALB natively supports glob patterns in `path-pattern`: +- `*` matches any sequence of characters (including `/`). +- `?` matches any single character. + +This is the only surveyed format with first-class glob support. `/api/*/data` +is valid and unambiguous. No regex support. + +### Header matching + +Header conditions match against the header value. Multiple values are ORed. +The header name is fixed per condition block; to AND two header predicates, +add two separate `http-header` conditions. Case-insensitive matching on +values. + +### Method matching + +```yaml +- field: http-request-method + http_request_method_config: + values: + - GET + - POST +``` + +Multiple values are ORed (GET or POST). Up to 40 methods per rule. + +### Strengths / weaknesses + +**Strengths:** first-class glob path matching (the only format surveyed +with `*` and `?`); multi-value OR within a condition block is concise for +the common case; method matching is a flat list, easy to write. + +**Weaknesses:** maximum 5 conditions per rule; no regex; no header-absent +predicate; no request-body matching; the `field` + `*_config` naming is +awkward (the field name is a string enum that determines which sibling key +is relevant — a schema-validation anti-pattern); tied to AWS semantics +(target groups, priority integers). + +--- + +## Format 4: Traefik Dynamic Routing + +**Spec:** [Traefik Router Rule syntax](https://doc.traefik.io/traefik/routing/routers/#rule) +**Maturity:** GA, widely deployed in Kubernetes (IngressRoute CRD) and +Docker-Compose setups. Traefik v3 aligns with Gateway API for Kubernetes +routes but keeps its own expression syntax for the `rule` field. + +### Match expression (string, embedded in YAML) + +```yaml +http: + routers: + my-router: + rule: > + Host(`api.example.com`) && + PathPrefix(`/api/v1`) && + Method(`GET`, `POST`) && + Header(`Content-Type`, `application/json`) + service: my-service +``` + +`&&` = AND, `||` = OR. Parentheses for grouping. + +Available matchers: + +| Matcher | Example | +|---------|---------| +| `Host` | `Host("api.example.com")` | +| `HostRegexp` | `HostRegexp(".*\.example\.com")` | +| `Path` | `Path("/exact/path")` | +| `PathPrefix` | `PathPrefix("/api/v1")` | +| `PathRegexp` | `PathRegexp("/api/v[0-9]+/.*")` | +| `Method` | `Method("GET", "POST")` | +| `Header` | `Header("Content-Type", "application/json")` | +| `HeaderRegexp` | `HeaderRegexp("Accept", "application/.*")` | +| `Query` | `Query("version", "2")` | +| `QueryRegexp` | `QueryRegexp("id", "[0-9]+")` | +| `ClientIP` | `ClientIP("10.0.0.0/8")` | + +Glob paths: not supported directly. Use `PathRegexp` instead. + +### Strengths / weaknesses + +**Strengths:** the most expressive and concise format for complex boolean +combinations (AND/OR/NOT in a single line); `Method("GET", "POST")` is +the cleanest multi-method syntax surveyed; full regex support on every +field; Traefik v3 supports this inside Kubernetes CRDs. + +**Weaknesses:** the rule is a *string* embedded in YAML, not a structured +object — it cannot be validated with JSON Schema and is harder to generate +programmatically; no structured round-trip; no glob, only regex. + +--- + +## Comparison table + +| | Gateway API | Envoy | AWS ALB | Traefik | +|---|---|---|---|---| +| **Path: exact** | ✅ `Exact` | ✅ `path` | ✅ exact value | ✅ `Path()` | +| **Path: prefix** | ✅ `PathPrefix` | ✅ `prefix` / `path_separated_prefix` | ✅ (via glob `/*`) | ✅ `PathPrefix()` | +| **Path: glob** (`/a/*/b`) | ❌ (use regex) | ❌ (use regex) | ✅ native | ❌ (use regex) | +| **Path: regex** | ✅ `RegularExpression` | ✅ `safe_regex` | ❌ | ✅ `PathRegexp()` | +| **Header: exact** | ✅ | ✅ | ✅ | ✅ | +| **Header: regex** | ✅ | ✅ | ❌ | ✅ | +| **Header: absent** | ❌ (extension) | ✅ `present_match: false` | ❌ | ❌ | +| **Method matching** | ✅ (one per entry; OR via multiple entries) | ✅ (via `:method` pseudo-header) | ✅ (list = OR) | ✅ `Method("GET","POST")` | +| **AND semantics** | predicates within one `matches` entry | all conditions | all `conditions` entries | `&&` operator | +| **OR semantics** | multiple `matches` entries | `or_match` combinator | multiple values in one condition | `\|\|` operator | +| **Schema-validatable** | ✅ (CRD/JSON Schema) | ✅ (protobuf) | ✅ (CloudFormation schema) | ❌ (embedded string) | +| **Human-writable** | ✅ | ⚠️ verbose | ✅ | ✅ | +| **Generatable** | ✅ | ✅ | ✅ | ⚠️ (string concat) | + +--- + +## Design choices worth adopting + +### 1. Match object as a structured peer to `host` + +Gateway API's separation of concerns maps well onto bot-bottle's existing +schema. Instead of a flat `path_allowlist`, a `match` block nests all +predicates: + +```yaml +egress: + routes: + - host: api.github.com + match: + paths: + - type: prefix # exact | prefix | glob | regex + value: /repos/myorg/ + headers: + - name: Content-Type + value: application/json + methods: [GET, POST] + auth: + scheme: Bearer + token_ref: EGRESS_TOKEN_0 +``` + +All predicates within `match` are ANDed. A list of `paths` entries is +ORed (first match wins — same as the current `path_allowlist` semantics). + +### 2. Path type enum (`exact` | `prefix` | `glob` | `regex`) + +Use four named types rather than inferring from the value's syntax. This +avoids the ambiguity that plagues `.gitignore` and `nginx location` patterns +where the same string can mean different things depending on leading characters. + +- `prefix`: mirrors current `path_allowlist` semantics. +- `glob`: adopts ALB-style `*` (single segment) and `**` (multi-segment, + covering `/api/*/data` and `/api/**/data`). Simpler for operators than + writing regex. +- `regex`: RE2 for advanced cases. Reject at load time if the pattern fails + to compile. + +**Glob semantics decision needed:** ALB's `*` matches across `/`; most +shell-glob conventions treat `*` as intra-segment and `**` as cross-segment. +The shell convention (`*` = no slash, `**` = slash-crossing) is less +surprising to operators and avoids accidental over-matching. + +### 3. Header matching as a list of `{name, value, type}` objects + +Mirrors Gateway API exactly. ALL headers must match (AND). `type` defaults +to `exact`; `regex` is available. No header-absent for now (adds complexity, +low immediate need). + +```yaml +headers: + - name: Content-Type + value: application/json # type: exact (default) + - name: X-Internal-Key + value: "dev-[0-9]+" + type: regex +``` + +### 4. Method list as a flat enum list + +Adopts ALB's conciseness. An empty or absent `methods` list means all +methods are permitted. Values are uppercased HTTP method names. + +```yaml +methods: [GET, HEAD] +``` + +### 5. Multiple `match` entries per route: OR semantics at the route level + +If a route needs GET on one path and POST on a different path, use a +`matches` (plural) list where entries are ORed: + +```yaml +routes: + - host: api.example.com + matches: + - paths: [{type: prefix, value: /read}] + methods: [GET, HEAD] + - paths: [{type: exact, value: /write}] + methods: [POST, PUT] +``` + +This mirrors Gateway API's top-level OR; each entry is an AND of its +predicates. + +--- + +## Open questions + +1. **Backward compatibility:** `path_allowlist` is the current field. If + adopting a `match`/`matches` structure, keep `path_allowlist` as a + deprecated alias? Or treat this as a breaking manifest version bump? +2. **Glob segment semantics:** adopt shell convention (`*` = intra-segment, + `**` = cross-segment) or ALB convention (`*` = anything including `/`)? + The shell convention is safer; ALB's is simpler. +3. **Header value OR:** Gateway API requires a separate match entry to OR + header values. ALB allows multiple values in one condition. Which is + less surprising for bot-bottle operators? The ALB approach is more + concise for the common case (e.g., `Content-Type: [application/json, + application/x-www-form-urlencoded]`). +4. **Case sensitivity on method names:** normalize to uppercase at parse + time (fail on unrecognised values) or case-insensitively?