From 8e261563dcf786779126b3cacd247accc2ac8792 Mon Sep 17 00:00:00 2001 From: didericis Date: Tue, 12 May 2026 11:41:34 -0400 Subject: [PATCH] docs(research): TLS interception topologies for pipelock content scanning Survey of TLS-MITM tools (mitmproxy, Squid+ssl_bump, Go libraries) and five candidate topologies for adding TLS termination to the egress path so pipelock's DLP, subdomain-entropy, and MCP scanners can fire on plaintext bodies. Recommends mitmproxy in front of pipelock for v1 with a per-bottle ephemeral CA. Co-Authored-By: Claude Opus 4.7 --- docs/research/tls-mitm-for-pipelock.md | 508 +++++++++++++++++++++++++ 1 file changed, 508 insertions(+) create mode 100644 docs/research/tls-mitm-for-pipelock.md diff --git a/docs/research/tls-mitm-for-pipelock.md b/docs/research/tls-mitm-for-pipelock.md new file mode 100644 index 0000000..aa8c4d0 --- /dev/null +++ b/docs/research/tls-mitm-for-pipelock.md @@ -0,0 +1,508 @@ +# TLS interception for pipelock content scanning + +Research into adding TLS termination ("MITM") to the egress path so that +pipelock's scanning pipeline can see plaintext HTTP request and response +bodies, instead of only the `CONNECT` host and opaque ciphertext. + +## Summary + +- Pipelock today sees `CONNECT` hostnames and the encrypted bytes that follow. + Its DLP, subdomain-entropy, and MCP scanners cannot fire on TLS-encrypted + bodies, which is the gap explicitly named under "Scope gaps" in + `pipelock-assessment.md` ("Pipelock does not perform TLS inspection (no CA + trust injection)"). +- Closing that gap requires a TLS-terminating proxy that bumps `CONNECT`, + presents a leaf certificate for the target hostname signed by a CA the + bottle's trust store accepts, decrypts the inner HTTP, and re-establishes + TLS to the real upstream. +- The mature open-source option is **mitmproxy**. Squid + `ssl_bump` is the + heavier production-grade alternative. The Go ecosystem (`goproxy`, + `gomitmproxy`, `martian`) is suitable only if we want a custom binary + tightly coupled to pipelock. +- Recommended v1 topology: **mitmproxy in front of pipelock** on the same + egress route. mitmproxy terminates client TLS, forwards plaintext to + pipelock as its upstream HTTP proxy, and re-encrypts to the real upstream. + Pipelock stays unchanged. +- Per-bottle ephemeral CA, generated at bottle start and destroyed on + teardown. The CA private key lives only on the sidecar; the bottle's + trust store only ever sees the public cert. +- Cert pinning is a known caveat but a small one given the narrow allowlist + in this project. Selective bumping is the mitigation if a future + allowlisted host turns out to pin. + +--- + +## What pipelock cannot see today + +The current egress topology (per `pipelock-assessment.md`): + +``` +agent --HTTPS_PROXY--> pipelock --CONNECT host:443--> internet + \____________________________ + opaque TLS bytes +``` + +The agent's client (Claude Code, `curl`, an MCP server, a Python SDK) +sends `CONNECT api.anthropic.com:443`. Pipelock checks the hostname +against its `api_allowlist`, replies `200 Connection Established`, and +then blindly relays bytes between the two TCP halves. The TLS handshake +and everything inside it happens end-to-end between the agent and the +real upstream. + +What pipelock can scan in this mode: + +- `CONNECT` target hostname (SNI is not even needed). +- TLS record framing and lengths (useful for budgets, useless for DLP). +- Plain HTTP/1.1 to non-HTTPS destinations (irrelevant — there are none + in `DEFAULT_ALLOWLIST`). + +What pipelock cannot scan in this mode: + +- Request URL, method, headers, body. +- Response status, headers, body. +- MCP JSON-RPC payloads inside the TLS session. +- WebSocket frames inside a TLS-wrapped upgrade. +- Whether the inner SNI or HTTP `Host` / `:authority` matches the + outer `CONNECT` target (domain-fronting check). + +The 48-pattern DLP layer, the subdomain-entropy check (insofar as it +inspects URLs rather than DNS-resolver queries), the request-redaction +feature added in v2.3.0, and bidirectional MCP scanning all require +plaintext to operate on. Without TLS termination, those layers are +inert against any HTTPS destination — which is every destination in +the current allowlist. + +--- + +## How TLS interception works + +The mechanics of `CONNECT` bumping, end to end: + +1. **Agent issues `CONNECT`.** The HTTP client sees `HTTPS_PROXY` set, + so it opens a TCP connection to the proxy and sends + `CONNECT api.anthropic.com:443 HTTP/1.1`. +2. **Proxy answers `200`.** Standard tunnel-established response. +3. **Proxy starts TLS as the server.** Instead of relaying bytes, the + proxy itself performs a TLS handshake with the agent. It needs a + server certificate for `api.anthropic.com` — so on first contact for + that hostname, the proxy generates a leaf certificate with + `CN=api.anthropic.com` and a SAN for the same, signs it with its + own CA private key, and presents that cert. Subsequent connections + to the same hostname reuse the cached leaf. +4. **Agent verifies the cert.** The agent's TLS library walks the chain + to a trusted root. Because the bottle's trust store contains the + proxy's CA cert, validation succeeds. The agent has no way to tell + it isn't talking to the real `api.anthropic.com`. +5. **Proxy opens its own TLS to the real upstream.** As a client this + time, using the system root store, talking to the real + `api.anthropic.com`. Real SNI, real cert chain validated normally. +6. **Proxy bridges the two TLS sessions.** Decrypts on the server side, + re-encrypts on the client side, and scans the plaintext in between. + +This is what every TLS-terminating egress proxy does. The trade-offs +live in three places: + +- **CA trust injection.** Step 4 only works if the bottle's trust + store contains the proxy's CA. Mechanics covered under "CA lifecycle" + below. +- **Cert generation cost.** Generating an RSA-2048 leaf cert takes + ~50 ms; ECDSA P-256 is ~5 ms. Cache leaves per (hostname, SAN list) + to keep this off the steady-state hot path. +- **Protocol coverage.** The proxy needs to speak HTTP/1.1, HTTP/2 (ALPN + `h2`), and ideally WebSocket. HTTP/3 / QUIC is UDP and requires a + separate code path; for v1, blocking UDP/443 at the iptables layer + forces clients to fall back to HTTP/2, which we can inspect. + +--- + +## Tools + +### mitmproxy + +- **What it is.** Python (with Rust crypto bits) interactive HTTPS proxy. + Reference open-source implementation of the bump pattern. Ships as + `mitmproxy` (TUI), `mitmweb` (browser UI), and `mitmdump` (headless). +- **Cert handling.** Generates a CA on first run under `~/.mitmproxy/`. + Per-host leaves are generated on demand and cached in memory. Cert + cache keyed by (hostname, SAN extensions inferred from upstream cert). +- **Protocols.** HTTP/1.1, HTTP/2, WebSocket fully supported. HTTP/3 + exists as experimental. Raw TCP / non-HTTP TLS supported via + `--mode reverse:` but not in CONNECT-bump mode. +- **Extensibility.** Python addon API. An addon module can inspect or + modify any `request` / `response` / `tcp_message` flow. The pipelock + integration in Topology D below uses this. +- **Selective bumping.** `ignore_hosts` regex; matching CONNECTs are + tunneled blindly instead of bumped. Critical for the cert-pinning + mitigation. +- **Docker image.** `mitmproxy/mitmproxy` on Docker Hub. Single binary + for the CLI, ~80 MB image. Configurable via flags or `~/.mitmproxy/config.yaml`. +- **Project URL.** , . + +Most mature, best-documented, lowest-effort integration. Default choice +for v1. + +### Squid + ssl_bump + +- **What it is.** Squid is a long-running C++ caching proxy. + `ssl_bump` is its TLS-interception feature, controlled by per-CONNECT + actions: `splice` (tunnel blindly), `bump` (decrypt and re-encrypt), + `peek` (look at TLS hello then decide), `stare` (look at server cert + then decide), `terminate` (abort the connection). +- **Cert handling.** Configured via `sslcrtd_program` — a helper that + generates and caches per-host certs. CA cert and key referenced by + PEM paths in `squid.conf`. +- **Protocols.** HTTP/1.1 fully; HTTP/2 to clients via recent versions; + no scripted addons. +- **Extensibility.** ICAP (Internet Content Adaptation Protocol) for + external scanners — Squid POSTs each request/response to an ICAP + service that can modify or reject. This is the formal version of + Topology D below. +- **Production track record.** Used at corporate-proxy scale (large + enterprises, ISPs). Heavyweight for a single-bottle sidecar. +- **Project URL.** . + +Right tool if pipelock grows an ICAP server endpoint. Otherwise, more +config surface than this project needs. + +### Go libraries: goproxy, gomitmproxy, martian + +- **`goproxy`** (elazarl) — long-lived Go library, basic CONNECT-bumping + proxy with a handler API. Sparse on HTTP/2. + +- **`gomitmproxy`** (AdGuard) — newer, cleaner API; built for AdGuard + Home / DNS-filtering products. HTTP/2 support is partial. + +- **`martian`** (Google) — request/response modifier framework with a + JSON-configurable rule engine. Used internally at Google; public + ecosystem thin. + + +These are relevant only if we decide to write a custom TLS-terminating +binary that links pipelock's scanning packages directly — Topology C +below. They are not faster than mitmproxy for the v1 sidecar shape; +they are smaller and more direct, at the cost of writing more Go. + +### Disqualified + +- **Caddy, Envoy, HAProxy.** All can terminate TLS at a reverse-proxy + vhost. None ship a "bump on CONNECT and forward plaintext to a + downstream proxy" mode out of the box. Adapting any of them to this + shape is more work than starting from mitmproxy. +- **Cloudflare Gateway, Zscaler, NetSkope, Forcepoint.** Managed cloud + egress with TLS inspection. Wrong topology — they live outside the + host, not as a per-bottle sidecar, and they require trusting a vendor + with full plaintext. +- **Charles Proxy, Burp Suite.** Closed-source GUI tools for developer + capture and security testing. Not appropriate as headless sidecars. +- **`mitmdump` standalone vs. embedding mitmproxy as a library.** Both + are mitmproxy. Calling out only to note: the project ships both a CLI + and a Python API; addons can be loaded either way. + +--- + +## Topologies + +Five candidate topologies, ordered roughly from least to most coupled +between the two components. + +### A — mitmproxy in front of pipelock (recommended) + +``` +agent --HTTPS_PROXY--> mitmproxy --HTTP_PROXY--> pipelock --> internet + (bump TLS) (scan plain) (real TLS) +``` + +mitmproxy terminates the agent's TLS connection, decrypts, and then +forwards the inner HTTP request to pipelock by treating pipelock as +its own upstream HTTP forward proxy. Pipelock receives plaintext HTTP +exactly as if the agent had used HTTP, applies its full scanning +pipeline, and forwards to mitmproxy's upstream client half — which +re-establishes TLS to the real destination. + +Concretely the agent's `HTTPS_PROXY` points at mitmproxy; mitmproxy's +`upstream_proxy` config points at pipelock; pipelock's network reach +includes the real internet. + +- **Wins.** Pipelock unchanged. mitmproxy unchanged from default + configuration. Each component has one job. Failure modes are clear + per layer. +- **Costs.** Two sidecars per bottle instead of one. One extra + decrypt / re-encrypt hop, ~5–15 ms per request in steady state. +- **Open question.** How exactly mitmproxy forwards to pipelock matters + for whether pipelock sees TLS again or only HTTP. mitmproxy's + `upstream` mode wraps the decrypted request in another CONNECT if the + destination is HTTPS — which would re-encrypt before pipelock sees + it, defeating the point. The correct mode is `upstream` with TLS + re-origination disabled, or `regular` mode with a chained proxy. The + v2 release of mitmproxy reworked this; needs verification against the + current docs at integration time. + +### B — pipelock in front of mitmproxy (ruled out) + +``` +agent --HTTPS_PROXY--> pipelock --CONNECT?--> mitmproxy --> internet + (sees CONNECT only) (bump TLS) +``` + +Pipelock would receive a `CONNECT` and decide to allow or deny based +on hostname, then tunnel to mitmproxy. mitmproxy would terminate TLS +and see plaintext — but pipelock would never see the plaintext, which +is the whole point of the exercise. The scanning still happens (in +mitmproxy), but it isn't pipelock doing it, so we'd need an entirely +different rule engine. Ruled out. + +### C — Extend pipelock itself to terminate TLS + +Two sub-variants: + +**C.1 — Upstream a `tls_terminate` mode.** Submit a feature to +pipelock that adds CONNECT bumping and per-host cert generation in Go, +using `crypto/tls` and the existing scanning packages. Pipelock becomes +a self-contained MITM proxy. License question matters here: the Apache +2.0 core can grow new features in-tree, but if upstream insists this +belongs in `enterprise/` (ELv2), we either accept ELv2 or fork. + +**C.2 — Wrap pipelock in a thin Go binary in the same container.** A +small Go program does the TLS half (`CONNECT` parsing, cert generation, +TLS handshake) and pipes plaintext to pipelock over UDS or loopback. +The wrapper is ours; pipelock is unmodified. No license question. + +- **Wins.** Single component on the egress path. Pipelock owns the + scanning end-to-end, including domain-fronting checks (SNI vs. + `Host` vs. `CONNECT`). +- **Costs.** Real Go engineering effort. CA generation, cert caching, + TLS handshake, HTTP/2 ALPN negotiation, WebSocket upgrade — all + things mitmproxy already solves. +- **When.** Right shape for v2 or v3 once the v1 mitmproxy-in-front + topology has proven the integration works and the scanning rules are + stable. + +### D — mitmproxy as the proxy, pipelock as a content-scan subroutine + +``` +agent --HTTPS_PROXY--> mitmproxy --> internet + (bump TLS) + | + v + POST /scan to pipelock + <- allow / block / redact +``` + +A Python addon in mitmproxy sends each decrypted request (and response) +to a pipelock HTTP `/scan` endpoint and gates the flow on the verdict. +mitmproxy handles all networking; pipelock is the rule engine only. + +- **Wins.** Clean separation of concerns. Pipelock doesn't have to + speak TLS at all. The addon is small, ~100 lines of Python. +- **Costs.** Requires pipelock to expose a scan API. The current Apache + 2.0 core does not document one. If `/scan` lives in `enterprise/`, + ELv2 applies. If it doesn't exist, we'd be asking pipelock for a new + surface. +- **Variant.** Squid's ICAP path is the formalized version of the same + pattern. + +### E — Single container, two processes + +mitmproxy and pipelock share a container, started by `supervisord` or +`s6-overlay`. Networking simplifies to localhost. Lifecycle complicates: +container restart now means restarting both; failure of one process is +not visible at the Docker layer; logs interleave. + +- **Wins.** Slightly less Docker plumbing in `cli.py`. +- **Costs.** Operational complexity not worth the savings. The two + containers are independent processes with independent failure modes; + Docker is the right tool for that. + +Net: not recommended. + +--- + +## CA lifecycle + +The CA private key is the asset to defend. With it, anyone can issue +certs that the bottle's trust store will accept for any hostname. So: + +**Per-bottle ephemeral CA.** At bottle start, generate a fresh +RSA-2048 or ECDSA-P256 CA inside the mitmproxy sidecar. Export only +the public cert (PEM) into the bottle's trust store at one of: + +- `/usr/local/share/ca-certificates/claude-bottle-mitm.crt` followed by + `update-ca-certificates` (Debian/Ubuntu base images). +- `/etc/pki/ca-trust/source/anchors/` with `update-ca-trust` + (Red-Hat-family). +- `$NODE_EXTRA_CA_CERTS` for Node-based agents (Claude Code). +- `$SSL_CERT_FILE` / `$REQUESTS_CA_BUNDLE` for Python SDKs. + +The private key never leaves the sidecar's filesystem. The CA cert +public half is the only artifact that crosses into the bottle. + +On bottle teardown, the sidecar container is destroyed; the CA dies +with it. The next bottle gets a fresh CA. No long-lived MITM CA on +disk. + +**Why not a shared per-host CA.** A persistent CA across bottles is +faster (no generation at start) but is a real liability: if any bottle +exfiltrates the CA cert public half (which it can — it's in the trust +store by design), an attacker on the host network could in principle +impersonate any host to any bottle. With a per-bottle CA, the exfil +gains nothing: the CA is bottle-local and dies in minutes. + +**Generation cost.** RSA-2048 CA generation is ~200 ms; ECDSA-P256 is +~5 ms. Either is irrelevant against the per-bottle Docker pull and +network setup cost. + +**Where the CA lives in the bottle's trust store.** Both: a +distribution-standard path with `update-ca-certificates`, and the +env-var path. Belt and suspenders, because some Node and Python +libraries honor the env vars only, and some load only `/etc/ssl/certs/` +directly. + +--- + +## Cert pinning (brief) + +A client that pins ignores the trust store and refuses any cert whose +public key isn't on a hardcoded list. Three observations for this +project: + +- The current `DEFAULT_ALLOWLIST` (`api.anthropic.com`, + `statsig.anthropic.com`, `sentry.io`, `claude.ai`, + `platform.claude.com`, `downloads.claude.ai`, + `raw.githubusercontent.com`) does not appear to include any host that + pins against server-side SDKs. Server-side SDKs (Node, Python) almost + universally honor system trust and `NODE_EXTRA_CA_CERTS` / + `SSL_CERT_FILE`. Mobile SDKs and Chromium pin; we don't run those. +- If a future allowlisted host turns out to pin, the mitigation is + selective bumping via mitmproxy `ignore_hosts`: that specific + hostname tunnels blindly and pipelock loses DLP coverage for it. + Coverage on every other host is unaffected. +- The cost of finding out: a single 5-minute test before adding a host + — point mitmproxy at the host, observe whether the client succeeds. + +Not a v1 blocker. Document the failure mode and the mitigation. + +--- + +## Comparison table + +| | A: mitmproxy → pipelock | B: pipelock → mitmproxy | C: TLS in pipelock | D: mitmproxy + scan API | E: one container | +|---|---|---|---|---|---| +| Pipelock sees plaintext | yes | no | yes | yes (via /scan) | yes | +| Code change to pipelock | none | none | substantial | adds /scan endpoint | none | +| Sidecar count | 2 | 2 | 1 | 2 | 1 | +| Cert generation owner | mitmproxy | mitmproxy | pipelock | mitmproxy | mitmproxy | +| Selective bumping | mitmproxy `ignore_hosts` | mitmproxy `ignore_hosts` | pipelock config | mitmproxy `ignore_hosts` | mitmproxy `ignore_hosts` | +| Failure isolation per process | yes | yes | n/a (one process) | yes | no (shared container) | +| License question | none | none | ELv2 risk | ELv2 risk | none | +| v1 effort | low | low (but pointless) | high | medium | low | +| Long-term shape | interim | n/a | best | possible | not recommended | + +--- + +## Recommendation + +**Adopt Topology A for v1.** Add a mitmproxy sidecar to the egress +topology, in front of pipelock on the same per-bottle internal network. +The agent's `HTTPS_PROXY` points at mitmproxy; mitmproxy's upstream is +pipelock; pipelock's upstream is the real internet. + +Concretely: + +1. Add a `MitmproxyProxy` class alongside `PipelockProxy`, with the + same `prepare` / `start` / `stop` lifecycle. The class generates + a per-bottle CA in `stage_dir`, exports the public cert into a + second file, and writes a mitmproxy config that: + - bumps every CONNECT by default + - uses `upstream_proxy = http://pipelock-:` + - listens on a known port inside the per-bottle internal network +2. Extend the bottle launch step to copy the CA public cert into the + agent container under + `/usr/local/share/ca-certificates/claude-bottle-mitm.crt`, run + `update-ca-certificates`, and set `NODE_EXTRA_CA_CERTS` / + `SSL_CERT_FILE` / `REQUESTS_CA_BUNDLE` accordingly. +3. Repoint the agent's `HTTPS_PROXY` and `HTTP_PROXY` from the pipelock + container to the mitmproxy container. +4. Verify mitmproxy's upstream-proxy mode forwards plaintext (not a + re-wrapped CONNECT) to pipelock; if not, use `regular` mode with a + chained proxy directive. +5. Test that pipelock's DLP, subdomain-entropy, and MCP scanners now + fire on real request bodies for `api.anthropic.com` traffic. + +**Defer Topologies C and D.** Topology C (extending pipelock to +terminate TLS) is the cleanest long-term shape but is a substantial +build and runs into the Apache 2.0 vs. ELv2 question. Topology D +(mitmproxy with pipelock as a scan API) is attractive but requires a +pipelock surface that doesn't exist today. Both are valid v2 targets; +neither is the right starting point. + +The `network-egress-guard.md` v1 iptables + dnsmasq layer remains +necessary alongside this — TLS interception covers HTTP/HTTPS only; +raw TCP, UDP/443 (QUIC), UDP/53 (DNS), and ICMP still need the +IP-level default-deny. + +--- + +## Open questions + +1. **mitmproxy upstream-proxy mode mechanics.** Does mitmproxy in + `upstream_proxy` mode forward decrypted HTTP plaintext to the + upstream, or does it wrap it in a new CONNECT? The documented + behavior changed between mitmproxy 8 and 10. Needs verification + against the version we pin. +2. **Pipelock's behavior when receiving plain HTTP.** Pipelock's + `forward_proxy.enabled: true` accepts both `GET http://...` (plain + HTTP) and `CONNECT host:443` (HTTPS). After Topology A is wired up, + pipelock will see only plain HTTP — does its DLP / MCP scanning + pipeline run the full set of layers, or are some gated on the + CONNECT path? Confirm by reading + `github.com/luckyPipewrench/pipelock/blob/main/docs/configuration.md`. +3. **CA installation in the Anthropic-provided Claude Code Docker image.** + The base image's distribution determines whether `update-ca-certificates` + (Debian/Ubuntu) or `update-ca-trust` (Red Hat) is the right command. + The current `Dockerfile` should be inspected before assuming Debian. +4. **HTTP/2 over the agent → mitmproxy hop.** Node's HTTP client + negotiates `h2` via ALPN. mitmproxy speaks `h2` to clients in recent + versions. Confirm the version we pin supports `h2` end-to-end and + doesn't downgrade to `http/1.1` (which would be a silent + performance regression). +5. **Selective-bump policy surface.** Where does the + "tunnel this hostname blindly" decision live? Options: a field on + `bottle.egress` in the manifest, a fixed list of known-pinning + hosts baked into the mitmproxy config, or pipelock-side opt-out. + Manifest field is most consistent with the existing + `bottle.egress.allowlist` shape. +6. **Image pin for mitmproxy.** The `pipelock-assessment.md` + recommendation is to pin by digest. The mitmproxy Docker Hub image + should be pinned the same way. Which release line? `mitmproxy/mitmproxy` + ships rolling and tagged versions; the tagged `:11.x` line is the + right baseline. +7. **CA generation in Python (mitmproxy) vs. as a separate step.** + mitmproxy generates a CA on first launch if none is provided. For + per-bottle ephemerality, we want the CA to be ours, not whatever + mitmproxy chooses — so generate the CA in the host-side prepare + step and inject it via `--certs *=...`. Mechanics need confirming. +8. **Domain fronting verification.** Once pipelock sees plaintext, it + has access to the inner `Host` / `:authority`. A new rule that + compares it against the outer `CONNECT` target catches domain + fronting. Worth a follow-up note on whether pipelock has such a + rule or whether we add it. + +--- + +## References + +- mitmproxy: , +- mitmproxy `upstream_proxy` mode: +- mitmproxy CA cert installation: +- Squid `ssl_bump`: +- Squid ICAP: +- `goproxy`: +- `gomitmproxy`: +- `martian`: +- Node TLS / `NODE_EXTRA_CA_CERTS`: +- Python `SSL_CERT_FILE` and `REQUESTS_CA_BUNDLE`: +- Prior research — pipelock assessment: `docs/research/pipelock-assessment.md` +- Prior research — network egress guard: `docs/research/network-egress-guard.md` +- Prior research — secret exfil tripwire encodings: `docs/research/secret-exfil-tripwire-encodings.md` + +Research conducted 2026-05-12.