Rethink ingress prompt-injection detection (redundant with egress; default off + hidden-channel detectors) #313

Open
opened 2026-06-26 23:29:17 -04:00 by didericis-claude · 0 comments
Collaborator

Context

The inbound naive_injection_detection detector (dlp_detectors.scan_naive_injection) is regex phrase-matching — disclosure phrases near jailbreak phrases on response bodies. It catches dumb/automated attempts and is trivially bypassed by paraphrase, translation, encoding, or indirection. It's also on by default (Route.inbound_detectors defaults to None → all enabled).

More fundamentally: ingress injection detection is redundant with the egress boundary. Whether an injection succeeds or not, the consequence (exfil, reaching a disallowed host) shows up at the outbound allowlist + DLP we already enforce. So this is defense-in-depth / telemetry at best, not prevention — and the value ceiling is low.

This issue tracks rethinking what (if anything) we invest here.

Options considered

  • Regex/heuristics (current). Free, inline, near-zero recall vs adaptive adversaries. Tripwire only.
  • Fine-tuned encoder classifier (e.g. DeBERTa-class prompt-injection model). Best accuracy-per-cost of the model-based options: one forward pass, no generation, no corpus to curate. But imports model weights + a runtime (torch/onnx) into a sidecar that is currently stdlib-only / zero-runtime-deps / flat-bundled — large architectural tax.
  • Embedding + vector search. Cheaper than an LLM judge, but it's a signature DB: lags novel/paraphrased/encoded attacks (poor recall) and flags legitimate text discussing injection (poor precision). Prefer a classifier over this.
  • LLM judge. Highest raw accuracy but expensive on the proxy hot path, and the judge is itself injectable. Non-starter inline.

Proposed direction (priority order)

  1. Downgrade ingress detection to opt-in telemetry. Keep the regex but flip naive_injection_detection off by default; redirect attention to the egress/exfil side, which contains the blast radius regardless of the prompt. (Also removes the last on-by-default consumer of the proximity scan.)
  2. Cheap, no-model upgrade for a more real threat class: deterministic, stdlib-only detectors for hidden instruction channels — zero-width/bidi unicode, base64/hex blobs that decode to imperative text, HTML comments, hidden markdown. These catch the indirect-injection vectors an agent actually hits when fetching web/tool content, and fit the zero-dep bundle design.
  3. Out of scope here, noted for completeness: the genuinely effective defenses (trust-zoning/provenance, spotlighting/datamarking untrusted spans) live in the harness, not an egress proxy.

Decision needed

Pick the default-state for naive_injection_detection (lean: off-by-default) and whether to pursue the hidden-channel structural detectors (item 2).

Background: came out of the quality-eval discussion alongside #312.

## Context The inbound `naive_injection_detection` detector (`dlp_detectors.scan_naive_injection`) is regex phrase-matching — disclosure phrases near jailbreak phrases on response bodies. It catches dumb/automated attempts and is trivially bypassed by paraphrase, translation, encoding, or indirection. It's also **on by default** (`Route.inbound_detectors` defaults to `None` → all enabled). More fundamentally: ingress injection detection is **redundant with the egress boundary**. Whether an injection succeeds or not, the *consequence* (exfil, reaching a disallowed host) shows up at the outbound allowlist + DLP we already enforce. So this is defense-in-depth / telemetry at best, not prevention — and the value ceiling is low. This issue tracks rethinking what (if anything) we invest here. ## Options considered - **Regex/heuristics (current).** Free, inline, near-zero recall vs adaptive adversaries. Tripwire only. - **Fine-tuned encoder classifier** (e.g. DeBERTa-class prompt-injection model). Best accuracy-per-cost of the model-based options: one forward pass, no generation, no corpus to curate. But imports model weights + a runtime (torch/onnx) into a sidecar that is currently stdlib-only / zero-runtime-deps / flat-bundled — large architectural tax. - **Embedding + vector search.** Cheaper than an LLM judge, but it's a signature DB: lags novel/paraphrased/encoded attacks (poor recall) and flags legitimate text *discussing* injection (poor precision). Prefer a classifier over this. - **LLM judge.** Highest raw accuracy but expensive on the proxy hot path, and the judge is itself injectable. Non-starter inline. ## Proposed direction (priority order) 1. **Downgrade ingress detection to opt-in telemetry.** Keep the regex but flip `naive_injection_detection` **off by default**; redirect attention to the egress/exfil side, which contains the blast radius regardless of the prompt. (Also removes the last on-by-default consumer of the proximity scan.) 2. **Cheap, no-model upgrade for a more real threat class:** deterministic, stdlib-only detectors for *hidden instruction channels* — zero-width/bidi unicode, base64/hex blobs that decode to imperative text, HTML comments, hidden markdown. These catch the indirect-injection vectors an agent actually hits when fetching web/tool content, and fit the zero-dep bundle design. 3. **Out of scope here, noted for completeness:** the genuinely effective defenses (trust-zoning/provenance, spotlighting/datamarking untrusted spans) live in the harness, not an egress proxy. ## Decision needed Pick the default-state for `naive_injection_detection` (lean: off-by-default) and whether to pursue the hidden-channel structural detectors (item 2). Background: came out of the quality-eval discussion alongside #312.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: didericis/bot-bottle#313