# TLS interception for pipelock content scanning

Research into adding TLS termination ("MITM") to the egress path so that
pipelock's scanning pipeline can see plaintext HTTP request and response
bodies, instead of only the `CONNECT` host and opaque ciphertext.

## Summary

- Pipelock today sees `CONNECT` hostnames and the encrypted bytes that follow.
  Its DLP, subdomain-entropy, and MCP scanners cannot fire on TLS-encrypted
  bodies, which is the gap explicitly named under "Scope gaps" in
  `pipelock-assessment.md` ("Pipelock does not perform TLS inspection (no CA
  trust injection)").
- Closing that gap requires a TLS-terminating proxy that bumps `CONNECT`,
  presents a leaf certificate for the target hostname signed by a CA the
  bottle's trust store accepts, decrypts the inner HTTP, and re-establishes
  TLS to the real upstream.
- The mature open-source option is **mitmproxy**. Squid + `ssl_bump` is the
  heavier production-grade alternative. The Go ecosystem (`goproxy`,
  `gomitmproxy`, `martian`) is suitable only if we want a custom binary
  tightly coupled to pipelock.
- Recommended v1 topology: **mitmproxy in front of pipelock** on the same
  egress route. mitmproxy terminates client TLS, forwards plaintext to
  pipelock as its upstream HTTP proxy, and re-encrypts to the real upstream.
  Pipelock stays unchanged.
- Per-bottle ephemeral CA, generated at bottle start and destroyed on
  teardown. The CA private key lives only on the sidecar; the bottle's
  trust store only ever sees the public cert.
- Cert pinning is a known caveat but a small one given the narrow allowlist
  in this project. Selective bumping is the mitigation if a future
  allowlisted host turns out to pin.

---

## What pipelock cannot see today

The current egress topology (per `pipelock-assessment.md`):

```
agent --HTTPS_PROXY--> pipelock --CONNECT host:443--> internet
                                  \____________________________
                                       opaque TLS bytes
```

The agent's client (Claude Code, `curl`, an MCP server, a Python SDK)
sends `CONNECT api.anthropic.com:443`. Pipelock checks the hostname
against its `api_allowlist`, replies `200 Connection Established`, and
then blindly relays bytes between the two TCP halves. The TLS handshake
and everything inside it happens end-to-end between the agent and the
real upstream.

What pipelock can scan in this mode:

- `CONNECT` target hostname (SNI is not even needed).
- TLS record framing and lengths (useful for budgets, useless for DLP).
- Plain HTTP/1.1 to non-HTTPS destinations (irrelevant — there are none
  in `DEFAULT_ALLOWLIST`).

What pipelock cannot scan in this mode:

- Request URL, method, headers, body.
- Response status, headers, body.
- MCP JSON-RPC payloads inside the TLS session.
- WebSocket frames inside a TLS-wrapped upgrade.
- Whether the inner SNI or HTTP `Host` / `:authority` matches the
  outer `CONNECT` target (domain-fronting check).

The 48-pattern DLP layer, the subdomain-entropy check (insofar as it
inspects URLs rather than DNS-resolver queries), the request-redaction
feature added in v2.3.0, and bidirectional MCP scanning all require
plaintext to operate on. Without TLS termination, those layers are
inert against any HTTPS destination — which is every destination in
the current allowlist.

---

## How TLS interception works

The mechanics of `CONNECT` bumping, end to end:

1. **Agent issues `CONNECT`.** The HTTP client sees `HTTPS_PROXY` set,
   so it opens a TCP connection to the proxy and sends
   `CONNECT api.anthropic.com:443 HTTP/1.1`.
2. **Proxy answers `200`.** Standard tunnel-established response.
3. **Proxy starts TLS as the server.** Instead of relaying bytes, the
   proxy itself performs a TLS handshake with the agent. It needs a
   server certificate for `api.anthropic.com` — so on first contact for
   that hostname, the proxy generates a leaf certificate with
   `CN=api.anthropic.com` and a SAN for the same, signs it with its
   own CA private key, and presents that cert. Subsequent connections
   to the same hostname reuse the cached leaf.
4. **Agent verifies the cert.** The agent's TLS library walks the chain
   to a trusted root. Because the bottle's trust store contains the
   proxy's CA cert, validation succeeds. The agent has no way to tell
   it isn't talking to the real `api.anthropic.com`.
5. **Proxy opens its own TLS to the real upstream.** As a client this
   time, using the system root store, talking to the real
   `api.anthropic.com`. Real SNI, real cert chain validated normally.
6. **Proxy bridges the two TLS sessions.** Decrypts on the server side,
   re-encrypts on the client side, and scans the plaintext in between.

This is what every TLS-terminating egress proxy does. The trade-offs
live in three places:

- **CA trust injection.** Step 4 only works if the bottle's trust
  store contains the proxy's CA. Mechanics covered under "CA lifecycle"
  below.
- **Cert generation cost.** Generating an RSA-2048 leaf cert takes
  ~50 ms; ECDSA P-256 is ~5 ms. Cache leaves per (hostname, SAN list)
  to keep this off the steady-state hot path.
- **Protocol coverage.** The proxy needs to speak HTTP/1.1, HTTP/2 (ALPN
  `h2`), and ideally WebSocket. HTTP/3 / QUIC is UDP and requires a
  separate code path; for v1, blocking UDP/443 at the iptables layer
  forces clients to fall back to HTTP/2, which we can inspect.

---

## Tools

### mitmproxy

- **What it is.** Python (with Rust crypto bits) interactive HTTPS proxy.
  Reference open-source implementation of the bump pattern. Ships as
  `mitmproxy` (TUI), `mitmweb` (browser UI), and `mitmdump` (headless).
- **Cert handling.** Generates a CA on first run under `~/.mitmproxy/`.
  Per-host leaves are generated on demand and cached in memory. Cert
  cache keyed by (hostname, SAN extensions inferred from upstream cert).
- **Protocols.** HTTP/1.1, HTTP/2, WebSocket fully supported. HTTP/3
  exists as experimental. Raw TCP / non-HTTP TLS supported via
  `--mode reverse:` but not in CONNECT-bump mode.
- **Extensibility.** Python addon API. An addon module can inspect or
  modify any `request` / `response` / `tcp_message` flow. The pipelock
  integration in Topology D below uses this.
- **Selective bumping.** `ignore_hosts` regex; matching CONNECTs are
  tunneled blindly instead of bumped. Critical for the cert-pinning
  mitigation.
- **Docker image.** `mitmproxy/mitmproxy` on Docker Hub. Single binary
  for the CLI, ~80 MB image. Configurable via flags or `~/.mitmproxy/config.yaml`.
- **Project URL.** <https://mitmproxy.org>, <https://github.com/mitmproxy/mitmproxy>.

Most mature, best-documented, lowest-effort integration. Default choice
for v1.

### Squid + ssl_bump

- **What it is.** Squid is a long-running C++ caching proxy.
  `ssl_bump` is its TLS-interception feature, controlled by per-CONNECT
  actions: `splice` (tunnel blindly), `bump` (decrypt and re-encrypt),
  `peek` (look at TLS hello then decide), `stare` (look at server cert
  then decide), `terminate` (abort the connection).
- **Cert handling.** Configured via `sslcrtd_program` — a helper that
  generates and caches per-host certs. CA cert and key referenced by
  PEM paths in `squid.conf`.
- **Protocols.** HTTP/1.1 fully; HTTP/2 to clients via recent versions;
  no scripted addons.
- **Extensibility.** ICAP (Internet Content Adaptation Protocol) for
  external scanners — Squid POSTs each request/response to an ICAP
  service that can modify or reject. This is the formal version of
  Topology D below.
- **Production track record.** Used at corporate-proxy scale (large
  enterprises, ISPs). Heavyweight for a single-bottle sidecar.
- **Project URL.** <https://wiki.squid-cache.org/Features/SslPeekAndSplice>.

Right tool if pipelock grows an ICAP server endpoint. Otherwise, more
config surface than this project needs.

### Go libraries: goproxy, gomitmproxy, martian

- **`goproxy`** (elazarl) — long-lived Go library, basic CONNECT-bumping
  proxy with a handler API. Sparse on HTTP/2.
  <https://github.com/elazarl/goproxy>
- **`gomitmproxy`** (AdGuard) — newer, cleaner API; built for AdGuard
  Home / DNS-filtering products. HTTP/2 support is partial.
  <https://github.com/AdguardTeam/gomitmproxy>
- **`martian`** (Google) — request/response modifier framework with a
  JSON-configurable rule engine. Used internally at Google; public
  ecosystem thin.
  <https://github.com/google/martian>

These are relevant only if we decide to write a custom TLS-terminating
binary that links pipelock's scanning packages directly — Topology C
below. They are not faster than mitmproxy for the v1 sidecar shape;
they are smaller and more direct, at the cost of writing more Go.

### Disqualified

- **Caddy, Envoy, HAProxy.** All can terminate TLS at a reverse-proxy
  vhost. None ship a "bump on CONNECT and forward plaintext to a
  downstream proxy" mode out of the box. Adapting any of them to this
  shape is more work than starting from mitmproxy.
- **Cloudflare Gateway, Zscaler, NetSkope, Forcepoint.** Managed cloud
  egress with TLS inspection. Wrong topology — they live outside the
  host, not as a per-bottle sidecar, and they require trusting a vendor
  with full plaintext.
- **Charles Proxy, Burp Suite.** Closed-source GUI tools for developer
  capture and security testing. Not appropriate as headless sidecars.
- **`mitmdump` standalone vs. embedding mitmproxy as a library.** Both
  are mitmproxy. Calling out only to note: the project ships both a CLI
  and a Python API; addons can be loaded either way.

---

## Topologies

Five candidate topologies, ordered roughly from least to most coupled
between the two components.

### A — mitmproxy in front of pipelock (recommended)

```
agent --HTTPS_PROXY--> mitmproxy --HTTP_PROXY--> pipelock --> internet
                       (bump TLS)               (scan plain)  (real TLS)
```

mitmproxy terminates the agent's TLS connection, decrypts, and then
forwards the inner HTTP request to pipelock by treating pipelock as
its own upstream HTTP forward proxy. Pipelock receives plaintext HTTP
exactly as if the agent had used HTTP, applies its full scanning
pipeline, and forwards to mitmproxy's upstream client half — which
re-establishes TLS to the real destination.

Concretely the agent's `HTTPS_PROXY` points at mitmproxy; mitmproxy's
`upstream_proxy` config points at pipelock; pipelock's network reach
includes the real internet.

- **Wins.** Pipelock unchanged. mitmproxy unchanged from default
  configuration. Each component has one job. Failure modes are clear
  per layer.
- **Costs.** Two sidecars per bottle instead of one. One extra
  decrypt / re-encrypt hop, ~5–15 ms per request in steady state.
- **Open question.** How exactly mitmproxy forwards to pipelock matters
  for whether pipelock sees TLS again or only HTTP. mitmproxy's
  `upstream` mode wraps the decrypted request in another CONNECT if the
  destination is HTTPS — which would re-encrypt before pipelock sees
  it, defeating the point. The correct mode is `upstream` with TLS
  re-origination disabled, or `regular` mode with a chained proxy. The
  v2 release of mitmproxy reworked this; needs verification against the
  current docs at integration time.

### B — pipelock in front of mitmproxy (ruled out)

```
agent --HTTPS_PROXY--> pipelock --CONNECT?--> mitmproxy --> internet
                       (sees CONNECT only)   (bump TLS)
```

Pipelock would receive a `CONNECT` and decide to allow or deny based
on hostname, then tunnel to mitmproxy. mitmproxy would terminate TLS
and see plaintext — but pipelock would never see the plaintext, which
is the whole point of the exercise. The scanning still happens (in
mitmproxy), but it isn't pipelock doing it, so we'd need an entirely
different rule engine. Ruled out.

### C — Extend pipelock itself to terminate TLS

Two sub-variants:

**C.1 — Upstream a `tls_terminate` mode.** Submit a feature to
pipelock that adds CONNECT bumping and per-host cert generation in Go,
using `crypto/tls` and the existing scanning packages. Pipelock becomes
a self-contained MITM proxy. License question matters here: the Apache
2.0 core can grow new features in-tree, but if upstream insists this
belongs in `enterprise/` (ELv2), we either accept ELv2 or fork.

**C.2 — Wrap pipelock in a thin Go binary in the same container.** A
small Go program does the TLS half (`CONNECT` parsing, cert generation,
TLS handshake) and pipes plaintext to pipelock over UDS or loopback.
The wrapper is ours; pipelock is unmodified. No license question.

- **Wins.** Single component on the egress path. Pipelock owns the
  scanning end-to-end, including domain-fronting checks (SNI vs.
  `Host` vs. `CONNECT`).
- **Costs.** Real Go engineering effort. CA generation, cert caching,
  TLS handshake, HTTP/2 ALPN negotiation, WebSocket upgrade — all
  things mitmproxy already solves.
- **When.** Right shape for v2 or v3 once the v1 mitmproxy-in-front
  topology has proven the integration works and the scanning rules are
  stable.

### D — mitmproxy as the proxy, pipelock as a content-scan subroutine

```
agent --HTTPS_PROXY--> mitmproxy --> internet
                       (bump TLS)
                          |
                          v
                       POST /scan to pipelock
                       <- allow / block / redact
```

A Python addon in mitmproxy sends each decrypted request (and response)
to a pipelock HTTP `/scan` endpoint and gates the flow on the verdict.
mitmproxy handles all networking; pipelock is the rule engine only.

- **Wins.** Clean separation of concerns. Pipelock doesn't have to
  speak TLS at all. The addon is small, ~100 lines of Python.
- **Costs.** Requires pipelock to expose a scan API. The current Apache
  2.0 core does not document one. If `/scan` lives in `enterprise/`,
  ELv2 applies. If it doesn't exist, we'd be asking pipelock for a new
  surface.
- **Variant.** Squid's ICAP path is the formalized version of the same
  pattern.

### E — Single container, two processes

mitmproxy and pipelock share a container, started by `supervisord` or
`s6-overlay`. Networking simplifies to localhost. Lifecycle complicates:
container restart now means restarting both; failure of one process is
not visible at the Docker layer; logs interleave.

- **Wins.** Slightly less Docker plumbing in `cli.py`.
- **Costs.** Operational complexity not worth the savings. The two
  containers are independent processes with independent failure modes;
  Docker is the right tool for that.

Net: not recommended.

---

## CA lifecycle

The CA private key is the asset to defend. With it, anyone can issue
certs that the bottle's trust store will accept for any hostname. So:

**Per-bottle ephemeral CA.** At bottle start, generate a fresh
RSA-2048 or ECDSA-P256 CA inside the mitmproxy sidecar. Export only
the public cert (PEM) into the bottle's trust store at one of:

- `/usr/local/share/ca-certificates/claude-bottle-mitm.crt` followed by
  `update-ca-certificates` (Debian/Ubuntu base images).
- `/etc/pki/ca-trust/source/anchors/` with `update-ca-trust`
  (Red-Hat-family).
- `$NODE_EXTRA_CA_CERTS` for Node-based agents (Claude Code).
- `$SSL_CERT_FILE` / `$REQUESTS_CA_BUNDLE` for Python SDKs.

The private key never leaves the sidecar's filesystem. The CA cert
public half is the only artifact that crosses into the bottle.

On bottle teardown, the sidecar container is destroyed; the CA dies
with it. The next bottle gets a fresh CA. No long-lived MITM CA on
disk.

**Why not a shared per-host CA.** A persistent CA across bottles is
faster (no generation at start) but is a real liability: if any bottle
exfiltrates the CA cert public half (which it can — it's in the trust
store by design), an attacker on the host network could in principle
impersonate any host to any bottle. With a per-bottle CA, the exfil
gains nothing: the CA is bottle-local and dies in minutes.

**Generation cost.** RSA-2048 CA generation is ~200 ms; ECDSA-P256 is
~5 ms. Either is irrelevant against the per-bottle Docker pull and
network setup cost.

**Where the CA lives in the bottle's trust store.** Both: a
distribution-standard path with `update-ca-certificates`, and the
env-var path. Belt and suspenders, because some Node and Python
libraries honor the env vars only, and some load only `/etc/ssl/certs/`
directly.

---

## Cert pinning (brief)

A client that pins ignores the trust store and refuses any cert whose
public key isn't on a hardcoded list. Three observations for this
project:

- The current `DEFAULT_ALLOWLIST` (`api.anthropic.com`,
  `statsig.anthropic.com`, `sentry.io`, `claude.ai`,
  `platform.claude.com`, `downloads.claude.ai`,
  `raw.githubusercontent.com`) does not appear to include any host that
  pins against server-side SDKs. Server-side SDKs (Node, Python) almost
  universally honor system trust and `NODE_EXTRA_CA_CERTS` /
  `SSL_CERT_FILE`. Mobile SDKs and Chromium pin; we don't run those.
- If a future allowlisted host turns out to pin, the mitigation is
  selective bumping via mitmproxy `ignore_hosts`: that specific
  hostname tunnels blindly and pipelock loses DLP coverage for it.
  Coverage on every other host is unaffected.
- The cost of finding out: a single 5-minute test before adding a host
  — point mitmproxy at the host, observe whether the client succeeds.

Not a v1 blocker. Document the failure mode and the mitigation.

---

## Comparison table

| | A: mitmproxy → pipelock | B: pipelock → mitmproxy | C: TLS in pipelock | D: mitmproxy + scan API | E: one container |
|---|---|---|---|---|---|
| Pipelock sees plaintext | yes | no | yes | yes (via /scan) | yes |
| Code change to pipelock | none | none | substantial | adds /scan endpoint | none |
| Sidecar count | 2 | 2 | 1 | 2 | 1 |
| Cert generation owner | mitmproxy | mitmproxy | pipelock | mitmproxy | mitmproxy |
| Selective bumping | mitmproxy `ignore_hosts` | mitmproxy `ignore_hosts` | pipelock config | mitmproxy `ignore_hosts` | mitmproxy `ignore_hosts` |
| Failure isolation per process | yes | yes | n/a (one process) | yes | no (shared container) |
| License question | none | none | ELv2 risk | ELv2 risk | none |
| v1 effort | low | low (but pointless) | high | medium | low |
| Long-term shape | interim | n/a | best | possible | not recommended |

---

## Recommendation

**Adopt Topology A for v1.** Add a mitmproxy sidecar to the egress
topology, in front of pipelock on the same per-bottle internal network.
The agent's `HTTPS_PROXY` points at mitmproxy; mitmproxy's upstream is
pipelock; pipelock's upstream is the real internet.

Concretely:

1. Add a `MitmproxyProxy` class alongside `PipelockProxy`, with the
   same `prepare` / `start` / `stop` lifecycle. The class generates
   a per-bottle CA in `stage_dir`, exports the public cert into a
   second file, and writes a mitmproxy config that:
   - bumps every CONNECT by default
   - uses `upstream_proxy = http://pipelock-<slug>:<port>`
   - listens on a known port inside the per-bottle internal network
2. Extend the bottle launch step to copy the CA public cert into the
   agent container under
   `/usr/local/share/ca-certificates/claude-bottle-mitm.crt`, run
   `update-ca-certificates`, and set `NODE_EXTRA_CA_CERTS` /
   `SSL_CERT_FILE` / `REQUESTS_CA_BUNDLE` accordingly.
3. Repoint the agent's `HTTPS_PROXY` and `HTTP_PROXY` from the pipelock
   container to the mitmproxy container.
4. Verify mitmproxy's upstream-proxy mode forwards plaintext (not a
   re-wrapped CONNECT) to pipelock; if not, use `regular` mode with a
   chained proxy directive.
5. Test that pipelock's DLP, subdomain-entropy, and MCP scanners now
   fire on real request bodies for `api.anthropic.com` traffic.

**Defer Topologies C and D.** Topology C (extending pipelock to
terminate TLS) is the cleanest long-term shape but is a substantial
build and runs into the Apache 2.0 vs. ELv2 question. Topology D
(mitmproxy with pipelock as a scan API) is attractive but requires a
pipelock surface that doesn't exist today. Both are valid v2 targets;
neither is the right starting point.

The `network-egress-guard.md` v1 iptables + dnsmasq layer remains
necessary alongside this — TLS interception covers HTTP/HTTPS only;
raw TCP, UDP/443 (QUIC), UDP/53 (DNS), and ICMP still need the
IP-level default-deny.

---

## Open questions

1. **mitmproxy upstream-proxy mode mechanics.** Does mitmproxy in
   `upstream_proxy` mode forward decrypted HTTP plaintext to the
   upstream, or does it wrap it in a new CONNECT? The documented
   behavior changed between mitmproxy 8 and 10. Needs verification
   against the version we pin.
2. **Pipelock's behavior when receiving plain HTTP.** Pipelock's
   `forward_proxy.enabled: true` accepts both `GET http://...` (plain
   HTTP) and `CONNECT host:443` (HTTPS). After Topology A is wired up,
   pipelock will see only plain HTTP — does its DLP / MCP scanning
   pipeline run the full set of layers, or are some gated on the
   CONNECT path? Confirm by reading
   `github.com/luckyPipewrench/pipelock/blob/main/docs/configuration.md`.
3. **CA installation in the Anthropic-provided Claude Code Docker image.**
   The base image's distribution determines whether `update-ca-certificates`
   (Debian/Ubuntu) or `update-ca-trust` (Red Hat) is the right command.
   The current `Dockerfile` should be inspected before assuming Debian.
4. **HTTP/2 over the agent → mitmproxy hop.** Node's HTTP client
   negotiates `h2` via ALPN. mitmproxy speaks `h2` to clients in recent
   versions. Confirm the version we pin supports `h2` end-to-end and
   doesn't downgrade to `http/1.1` (which would be a silent
   performance regression).
5. **Selective-bump policy surface.** Where does the
   "tunnel this hostname blindly" decision live? Options: a field on
   `bottle.egress` in the manifest, a fixed list of known-pinning
   hosts baked into the mitmproxy config, or pipelock-side opt-out.
   Manifest field is most consistent with the existing
   `bottle.egress.allowlist` shape.
6. **Image pin for mitmproxy.** The `pipelock-assessment.md`
   recommendation is to pin by digest. The mitmproxy Docker Hub image
   should be pinned the same way. Which release line? `mitmproxy/mitmproxy`
   ships rolling and tagged versions; the tagged `:11.x` line is the
   right baseline.
7. **CA generation in Python (mitmproxy) vs. as a separate step.**
   mitmproxy generates a CA on first launch if none is provided. For
   per-bottle ephemerality, we want the CA to be ours, not whatever
   mitmproxy chooses — so generate the CA in the host-side prepare
   step and inject it via `--certs *=...`. Mechanics need confirming.
8. **Domain fronting verification.** Once pipelock sees plaintext, it
   has access to the inner `Host` / `:authority`. A new rule that
   compares it against the outer `CONNECT` target catches domain
   fronting. Worth a follow-up note on whether pipelock has such a
   rule or whether we add it.

---

## References

- mitmproxy: <https://mitmproxy.org>, <https://github.com/mitmproxy/mitmproxy>
- mitmproxy `upstream_proxy` mode: <https://docs.mitmproxy.org/stable/concepts/modes/#upstream-proxy>
- mitmproxy CA cert installation: <https://docs.mitmproxy.org/stable/concepts/certificates/>
- Squid `ssl_bump`: <https://wiki.squid-cache.org/Features/SslPeekAndSplice>
- Squid ICAP: <https://wiki.squid-cache.org/Features/ICAP>
- `goproxy`: <https://github.com/elazarl/goproxy>
- `gomitmproxy`: <https://github.com/AdguardTeam/gomitmproxy>
- `martian`: <https://github.com/google/martian>
- Node TLS / `NODE_EXTRA_CA_CERTS`: <https://nodejs.org/api/cli.html#node_extra_ca_certsfile>
- Python `SSL_CERT_FILE` and `REQUESTS_CA_BUNDLE`: <https://docs.python.org/3/library/ssl.html#ssl.SSLContext.load_verify_locations>
- Prior research — pipelock assessment: `docs/research/pipelock-assessment.md`
- Prior research — network egress guard: `docs/research/network-egress-guard.md`
- Prior research — secret exfil tripwire encodings: `docs/research/secret-exfil-tripwire-encodings.md`

Research conducted 2026-05-12.