perf(dlp): memoize encoded variants and linearize partial-window scan

Two per-request hot-path costs in the egress DLP scanner:

- `_encoded_variants` derived the full variant set (gzip + nine
  encodings) for every provisioned secret on every redaction and
  known-secret scan — once per host, path, header, and body. Cache it
  per distinct secret; callers still get a fresh list so they can't
  corrupt the shared cached tuple.
- `_find_partial_window` searched the text once per secret n-gram,
  giving O(len(secret) * len(text)). Build the secret's n-gram set once
  and sweep the text a single time: O(len(text)), no coverage loss.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NkwFXLFff9PYPy4wgVBJp9
This commit is contained in:
2026-06-26 22:53:27 -04:00
parent ebbcae663c
commit 0bb47bd754
2 changed files with 49 additions and 10 deletions
+11
View File
@@ -281,6 +281,17 @@ class TestEncodedVariants(unittest.TestCase):
v = self._variants()
self.assertEqual(len(v), len(set(v)))
def test_repeated_calls_equal(self):
# Memoization must not change observable output.
self.assertEqual(self._variants(), self._variants())
def test_returns_fresh_list_each_call(self):
# Callers mutate/iterate the result; the cached set must not be
# exposed by reference, or one caller could corrupt another's view.
first = self._variants()
first.append("MUTATED")
self.assertNotIn("MUTATED", self._variants())
class TestUnicodeNormalization(unittest.TestCase):
def test_fullwidth_chars_normalized(self):