Rethink ingress prompt-injection detection (redundant with egress; default off + hidden-channel detectors) #313
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
The inbound
naive_injection_detectiondetector (dlp_detectors.scan_naive_injection) is regex phrase-matching — disclosure phrases near jailbreak phrases on response bodies. It catches dumb/automated attempts and is trivially bypassed by paraphrase, translation, encoding, or indirection. It's also on by default (Route.inbound_detectorsdefaults toNone→ all enabled).More fundamentally: ingress injection detection is redundant with the egress boundary. Whether an injection succeeds or not, the consequence (exfil, reaching a disallowed host) shows up at the outbound allowlist + DLP we already enforce. So this is defense-in-depth / telemetry at best, not prevention — and the value ceiling is low.
This issue tracks rethinking what (if anything) we invest here.
Options considered
Proposed direction (priority order)
naive_injection_detectionoff by default; redirect attention to the egress/exfil side, which contains the blast radius regardless of the prompt. (Also removes the last on-by-default consumer of the proximity scan.)Decision needed
Pick the default-state for
naive_injection_detection(lean: off-by-default) and whether to pursue the hidden-channel structural detectors (item 2).Background: came out of the quality-eval discussion alongside #312.