DLP hot-path perf + manifest load_for_agent split #310

Merged
didericis-claude merged 2 commits from dlp-perf-manifest-cleanup into main 2026-06-26 23:03:41 -04:00
Collaborator

Summary

Addresses the high- and medium-priority findings from a quality eval of the egress DLP scanner and the manifest loader.

  • perf(dlp): memoize encoded variants. _encoded_variants derived the full variant set (gzip + nine encodings) for every provisioned secret on every redaction and known-secret scan — once per host, path, header, and body. It is now cached per distinct secret; callers still get a fresh list so they can't corrupt the shared cached tuple.
  • perf(dlp): linearize the partial-window scan. _find_partial_window searched the text once per secret n-gram (O(len(secret) * len(text))). It now builds the secret's n-gram set once and sweeps the text a single time (O(len(text))), with no detection-coverage loss.
  • refactor(manifest): split load_for_agent. The ~100-line dual-mode method is split into _load_for_agent_eager / _load_for_agent_lazy behind a small dispatcher, with the duplicated git-user merge tail extracted into _manifest_with_merged_git_user. No behavior change.

Deferred (intentionally not bundled here)

Two playbook items from the eval are better as standalone PRs:

  • Decomposing the 600–800 line modules (egress_addon_core, cli/tui, supervise_server) is an architectural change that deserves its own reviewable diff.
  • Backfilling ~146 return annotations is a large mechanical sweep with low marginal value (pyright is already at 0 errors via inference).

Verification

  • Full unit suite: 1482 passed (2 new DLP cache regression tests added)
  • pyright: 0 errors
  • pylint: 9.93/10 (unchanged baseline)
## Summary Addresses the high- and medium-priority findings from a quality eval of the egress DLP scanner and the manifest loader. - **perf(dlp): memoize encoded variants.** `_encoded_variants` derived the full variant set (gzip + nine encodings) for every provisioned secret on every redaction and known-secret scan — once per host, path, header, and body. It is now cached per distinct secret; callers still get a fresh list so they can't corrupt the shared cached tuple. - **perf(dlp): linearize the partial-window scan.** `_find_partial_window` searched the text once per secret n-gram (`O(len(secret) * len(text))`). It now builds the secret's n-gram set once and sweeps the text a single time (`O(len(text))`), with no detection-coverage loss. - **refactor(manifest): split `load_for_agent`.** The ~100-line dual-mode method is split into `_load_for_agent_eager` / `_load_for_agent_lazy` behind a small dispatcher, with the duplicated git-user merge tail extracted into `_manifest_with_merged_git_user`. No behavior change. ## Deferred (intentionally not bundled here) Two playbook items from the eval are better as standalone PRs: - Decomposing the 600–800 line modules (`egress_addon_core`, `cli/tui`, `supervise_server`) is an architectural change that deserves its own reviewable diff. - Backfilling ~146 return annotations is a large mechanical sweep with low marginal value (pyright is already at 0 errors via inference). ## Verification - Full unit suite: **1482 passed** (2 new DLP cache regression tests added) - pyright: **0 errors** - pylint: **9.93/10** (unchanged baseline)
didericis-claude added 2 commits 2026-06-26 22:54:14 -04:00
Two per-request hot-path costs in the egress DLP scanner:

- `_encoded_variants` derived the full variant set (gzip + nine
  encodings) for every provisioned secret on every redaction and
  known-secret scan — once per host, path, header, and body. Cache it
  per distinct secret; callers still get a fresh list so they can't
  corrupt the shared cached tuple.
- `_find_partial_window` searched the text once per secret n-gram,
  giving O(len(secret) * len(text)). Build the secret's n-gram set once
  and sweep the text a single time: O(len(text)), no coverage loss.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NkwFXLFff9PYPy4wgVBJp9
refactor(manifest): split load_for_agent into eager/lazy methods
lint / lint (push) Successful in 2m18s
test / unit (pull_request) Successful in 1m1s
test / integration (pull_request) Successful in 28s
test / coverage (pull_request) Successful in 1m17s
2a67a85835
`ManifestIndex.load_for_agent` was a ~100-line method branching across
the eager (from_json_obj) and lazy (from disk) resolution modes, with
the git-user merge tail duplicated in both branches. Split into
`_load_for_agent_eager` / `_load_for_agent_lazy` behind a small
dispatcher and extract the shared tail into
`_manifest_with_merged_git_user`. No behavior change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NkwFXLFff9PYPy4wgVBJp9
didericis-claude merged commit 121dc84b9f into main 2026-06-26 23:03:41 -04:00
didericis-claude deleted branch dlp-perf-manifest-cleanup 2026-06-26 23:03:49 -04:00
Sign in to join this conversation.