perf(dlp): memoize encoded variants and linearize partial-window scan

Two per-request hot-path costs in the egress DLP scanner: - `_encoded_variants` derived the full variant set (gzip + nine encodings) for every provisioned secret on every redaction and known-secret scan — once per host, path, header, and body. Cache it per distinct secret; callers still get a fresh list so they can't corrupt the shared cached tuple. - `_find_partial_window` searched the text once per secret n-gram, giving O(len(secret) * len(text)). Build the secret's n-gram set once and sweep the text a single time: O(len(text)), no coverage loss. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NkwFXLFff9PYPy4wgVBJp9
2026-06-26 22:53:27 -04:00
parent ebbcae663c
commit 0bb47bd754
2 changed files with 49 additions and 10 deletions
@@ -281,6 +281,17 @@ class TestEncodedVariants(unittest.TestCase):
        v = self._variants()
        self.assertEqual(len(v), len(set(v)))

+    def test_repeated_calls_equal(self):
+        # Memoization must not change observable output.
+        self.assertEqual(self._variants(), self._variants())
+
+    def test_returns_fresh_list_each_call(self):
+        # Callers mutate/iterate the result; the cached set must not be
+        # exposed by reference, or one caller could corrupt another's view.
+        first = self._variants()
+        first.append("MUTATED")
+        self.assertNotIn("MUTATED", self._variants())
+

 class TestUnicodeNormalization(unittest.TestCase):
    def test_fullwidth_chars_normalized(self):