Files
bot-bottle/docs/research/yaml-route-matching-formats.md
T
didericis-claude 8c0a9c5bc6 docs: rename PRD 0053 to PRD 0052
Renames docs/prds/0053-egress-dlp-addon.md to 0052-egress-dlp-addon.md
and updates all references in the documentation.
2026-06-06 16:27:04 +00:00

15 KiB

YAML route matching formats: paths, headers, and methods

Question

Bot-bottle's egress manifest currently supports exact-host matching and a flat list of path prefixes (path_allowlist). As the DLP work (PRD 0052) and future route hardening evolve, we may want more expressive matching: glob-style path patterns (/api/*/data), header predicates (Content-Type, Accept), and per-method rules (GET allowed, POST blocked). What established YAML-based formats exist for declaring this kind of route matching, and which design choices should bot-bottle adopt?

Summary

Four formats stand out as well-designed, widely deployed references: Kubernetes Gateway API HTTPRoute, Envoy RouteConfiguration, AWS ALB listener rules, and Traefik dynamic routing. A fifth, Istio VirtualService, is worth noting but is largely superseded by Gateway API for new designs.

Recommendation for bot-bottle: adopt the Gateway API HTTPRoute match vocabulary as a direct model. It is the most carefully designed of the four, has a published spec, handles all three requirements cleanly, and its match object nests naturally into a YAML route block alongside bot-bottle's existing host, path_allowlist, and auth fields. Envoy's format is more powerful but far more verbose and harder to validate by hand; ALB rules use a flat predicate list that does not compose well; Traefik uses string expressions rather than structured YAML.

Current bot-bottle route schema

egress:
  routes:
    - host: api.github.com
      path_allowlist:
        - /repos/myorg/
      auth:
        scheme: Bearer
        token_ref: EGRESS_TOKEN_0

Matching today: exact host + path-prefix list. No method or header awareness.


Format 1: Kubernetes Gateway API HTTPRoute

Spec: gateway.networking.k8s.io/v1 Maturity: GA (v1.0+, 2023). Backed by SIG Network; shipping in GKE, EKS, AKS, Istio, Envoy Gateway, Cilium, Traefik v3.

Match object

rules:
  - matches:
      - path:
          type: Exact          # Exact | PathPrefix | RegularExpression
          value: /api/v1/data
        headers:
          - name: Content-Type
            type: Exact        # Exact | RegularExpression
            value: application/json
        queryParams:
          - name: version
            type: Exact
            value: "2"
        method: GET            # GET | POST | PUT | DELETE | PATCH | …

A matches entry is a logical AND across all predicates within it. Multiple entries in the matches list are ORed: the rule fires if any entry matches.

Path matching

type Semantics
Exact Full path must equal value (no trailing-slash equivalence)
PathPrefix Path must start with value; /api matches /api/v1 but not /apiv1
RegularExpression RE2-syntax regex; implementations may differ on anchoring

Glob-style paths (/api/*/data): Gateway API does not define a glob type. The intent is to use RegularExpression for that case: /api/[^/]+/data replaces /api/*/data. This is unambiguous and widely understood.

Header matching

headers:
  - name: Content-Type
    type: Exact
    value: application/json
  - name: X-Request-Id
    type: RegularExpression
    value: "[0-9a-f]{8}-.*"

All headers entries must match (AND semantics). Missing a header is a non-match (no "header absent" type in v1; implementations add it as an extension).

Method matching

method: GET

Single method per match entry. To allow GET and POST, use two match entries (OR semantics at the matches level):

matches:
  - path:
      type: PathPrefix
      value: /api/v1
    method: GET
  - path:
      type: PathPrefix
      value: /api/v1
    method: POST

Strengths / weaknesses

Strengths: spec-backed, implementation-tested, composable AND/OR semantics, explicit about what is not supported (no glob, no header-absent), good field naming (type + value pattern is consistent throughout).

Weaknesses: verbosity when expressing OR across methods; regex is the only path wildcard mechanism; no body matching.


Format 2: Envoy RouteConfiguration

Spec: envoy.config.route.v3.RouteMatch Maturity: Widely deployed (Istio data plane, AWS App Mesh, solo.io Gloo). Defined in protobuf; YAML is the human-readable rendering.

Match object

match:
  path: /exact/path              # exact match
  # OR
  prefix: /api/                  # prefix match
  # OR
  safe_regex:
    google_re2: {}
    regex: "/api/v[0-9]+/.*"
  # OR
  path_separated_prefix: /api/v1 # prefix with segment boundary enforcement

  headers:
    - name: content-type
      string_match:
        exact: application/json
      # OR
        prefix: text/
      # OR
        safe_regex:
          google_re2: {}
          regex: "application/(json|xml)"
      invert_match: false        # negate the predicate

    - name: x-custom-header
      present_match: true        # just check presence

  query_parameters:
    - name: version
      string_match:
        exact: "2"

Method is matched via a pseudo-header:

headers:
  - name: :method
    string_match:
      exact: GET

Multiple methods require an OR combinator (or_match), available in Envoy v1.21+:

headers:
  - name: :method
    or_match:
      value_matchers:
        - string_match:
            exact: GET
        - string_match:
            exact: POST

Path matching

Field Semantics
prefix Path starts with value (any suffix allowed)
path Exact match
safe_regex RE2 regex (Google RE2 safety guarantees)
path_separated_prefix Like prefix but only matches at segment boundaries (/api/v1 won't match /api/v10)
connect_matcher CONNECT method only

Glob (/api/*/data): use safe_regex: /api/[^/]+/data.

Strengths / weaknesses

Strengths: most expressive format surveyed; invert_match, present_match, OR combinators, pseudo-header method matching; handles every edge case.

Weaknesses: very verbose; protobuf-origin field names are not self-evident; or_match nesting is awkward; hard to validate in a lightweight schema check; not appropriate as a user-facing YAML format without a wrapping DSL.


Format 3: AWS ALB Listener Rules

Spec: AWS Elastic Load Balancing API — Conditions Maturity: GA, widely used in AWS infrastructure-as-code (CloudFormation, Terraform aws_lb_listener_rule).

Match object (Terraform / CloudFormation rendering)

conditions:
  - field: path-pattern
    path_pattern_config:
      values:
        - /api/*
        - /health
  - field: http-header
    http_header_config:
      http_header_name: Content-Type
      values:
        - application/json
        - application/x-www-form-urlencoded
  - field: http-request-method
    http_request_method_config:
      values:
        - GET
        - POST
  - field: host-header
    host_header_config:
      values:
        - "*.example.com"
        - api.example.com
  - field: query-string
    query_string_config:
      values:
        - key: version
          value: "2"

All conditions in a rule are ANDed. Multiple values within a single condition are ORed. Up to 5 conditions per rule.

Path matching

ALB natively supports glob patterns in path-pattern:

  • * matches any sequence of characters (including /).
  • ? matches any single character.

This is the only surveyed format with first-class glob support. /api/*/data is valid and unambiguous. No regex support.

Header matching

Header conditions match against the header value. Multiple values are ORed. The header name is fixed per condition block; to AND two header predicates, add two separate http-header conditions. Case-insensitive matching on values.

Method matching

- field: http-request-method
  http_request_method_config:
    values:
      - GET
      - POST

Multiple values are ORed (GET or POST). Up to 40 methods per rule.

Strengths / weaknesses

Strengths: first-class glob path matching (the only format surveyed with * and ?); multi-value OR within a condition block is concise for the common case; method matching is a flat list, easy to write.

Weaknesses: maximum 5 conditions per rule; no regex; no header-absent predicate; no request-body matching; the field + *_config naming is awkward (the field name is a string enum that determines which sibling key is relevant — a schema-validation anti-pattern); tied to AWS semantics (target groups, priority integers).


Format 4: Traefik Dynamic Routing

Spec: Traefik Router Rule syntax Maturity: GA, widely deployed in Kubernetes (IngressRoute CRD) and Docker-Compose setups. Traefik v3 aligns with Gateway API for Kubernetes routes but keeps its own expression syntax for the rule field.

Match expression (string, embedded in YAML)

http:
  routers:
    my-router:
      rule: >
        Host(`api.example.com`) &&
        PathPrefix(`/api/v1`) &&
        Method(`GET`, `POST`) &&
        Header(`Content-Type`, `application/json`)
      service: my-service

&& = AND, || = OR. Parentheses for grouping.

Available matchers:

Matcher Example
Host Host("api.example.com")
HostRegexp HostRegexp(".*\.example\.com")
Path Path("/exact/path")
PathPrefix PathPrefix("/api/v1")
PathRegexp PathRegexp("/api/v[0-9]+/.*")
Method Method("GET", "POST")
Header Header("Content-Type", "application/json")
HeaderRegexp HeaderRegexp("Accept", "application/.*")
Query Query("version", "2")
QueryRegexp QueryRegexp("id", "[0-9]+")
ClientIP ClientIP("10.0.0.0/8")

Glob paths: not supported directly. Use PathRegexp instead.

Strengths / weaknesses

Strengths: the most expressive and concise format for complex boolean combinations (AND/OR/NOT in a single line); Method("GET", "POST") is the cleanest multi-method syntax surveyed; full regex support on every field; Traefik v3 supports this inside Kubernetes CRDs.

Weaknesses: the rule is a string embedded in YAML, not a structured object — it cannot be validated with JSON Schema and is harder to generate programmatically; no structured round-trip; no glob, only regex.


Comparison table

Gateway API Envoy AWS ALB Traefik
Path: exact Exact path exact value Path()
Path: prefix PathPrefix prefix / path_separated_prefix (via glob /*) PathPrefix()
Path: glob (/a/*/b) (use regex) (use regex) native (use regex)
Path: regex RegularExpression safe_regex PathRegexp()
Header: exact
Header: regex
Header: absent (extension) present_match: false
Method matching (one per entry; OR via multiple entries) (via :method pseudo-header) (list = OR) Method("GET","POST")
AND semantics predicates within one matches entry all conditions all conditions entries && operator
OR semantics multiple matches entries or_match combinator multiple values in one condition || operator
Schema-validatable (CRD/JSON Schema) (protobuf) (CloudFormation schema) (embedded string)
Human-writable ⚠️ verbose
Generatable ⚠️ (string concat)

Design choices worth adopting

1. Match object as a structured peer to host

Gateway API's separation of concerns maps well onto bot-bottle's existing schema. Instead of a flat path_allowlist, a match block nests all predicates:

egress:
  routes:
    - host: api.github.com
      match:
        paths:
          - type: prefix       # exact | prefix | glob | regex
            value: /repos/myorg/
        headers:
          - name: Content-Type
            value: application/json
        methods: [GET, POST]
      auth:
        scheme: Bearer
        token_ref: EGRESS_TOKEN_0

All predicates within match are ANDed. A list of paths entries is ORed (first match wins — same as the current path_allowlist semantics).

2. Path type enum (exact | prefix | regex)

Use three named types rather than inferring from the value's syntax. This avoids the ambiguity that plagues .gitignore and nginx location patterns where the same string can mean different things depending on leading characters.

  • prefix: mirrors current path_allowlist semantics.
  • regex: RE2 for wildcard and advanced cases. Reject at load time if the pattern fails to compile. Covers every case glob would handle — /api/[^/]+/data is the /api/*/data equivalent.

Glob-style syntax is not included: it adds a third path-matching language on top of prefix and regex without meaningful operator benefit, since regex is already required for any non-trivial wildcard.

3. Header matching as a list of {name, value, type} objects

Mirrors Gateway API exactly. ALL headers must match (AND). type defaults to exact; regex is available. No header-absent for now (adds complexity, low immediate need).

headers:
  - name: Content-Type
    value: application/json         # type: exact (default)
  - name: X-Internal-Key
    value: "dev-[0-9]+"
    type: regex

4. Method list as a flat enum list

Adopts ALB's conciseness. An empty or absent methods list means all methods are permitted. Values are uppercased HTTP method names.

methods: [GET, HEAD]

5. Multiple match entries per route: OR semantics at the route level

If a route needs GET on one path and POST on a different path, use a matches (plural) list where entries are ORed:

routes:
  - host: api.example.com
    matches:
      - paths: [{type: prefix, value: /read}]
        methods: [GET, HEAD]
      - paths: [{type: exact, value: /write}]
        methods: [POST, PUT]

This mirrors Gateway API's top-level OR; each entry is an AND of its predicates.


Decisions

The open questions raised during research were resolved in PR #196 review:

  1. Backward compatibility: Hard cutover. The new matches structure replaces path_allowlist entirely with no compatibility shim and no fallback parsing for the old format. Manifests using path_allowlist must be migrated.

  2. Glob support: Dropped. Not strictly necessary — regex covers every case glob would handle. Fewer path-matching languages to document and validate.

  3. Header value OR: Stick with Gateway API. OR across header values requires a separate entry in the matches list, not multiple values inside one headers block.

  4. Method name case: Case-insensitive at parse time. get, GET, and Get are all accepted and normalised to uppercase internally.