DLP hot-path perf + manifest load_for_agent split #310

2026-06-26T22:54:11-04:00

didericis-claude commented

2026-06-26 22:54:11 -04:00

Summary

Addresses the high- and medium-priority findings from a quality eval of the egress DLP scanner and the manifest loader.

perf(dlp): memoize encoded variants. _encoded_variants derived the full variant set (gzip + nine encodings) for every provisioned secret on every redaction and known-secret scan — once per host, path, header, and body. It is now cached per distinct secret; callers still get a fresh list so they can't corrupt the shared cached tuple.
perf(dlp): linearize the partial-window scan. _find_partial_window searched the text once per secret n-gram (O(len(secret) * len(text))). It now builds the secret's n-gram set once and sweeps the text a single time (O(len(text))), with no detection-coverage loss.
refactor(manifest): split load_for_agent. The ~100-line dual-mode method is split into _load_for_agent_eager / _load_for_agent_lazy behind a small dispatcher, with the duplicated git-user merge tail extracted into _manifest_with_merged_git_user. No behavior change.

Deferred (intentionally not bundled here)

Two playbook items from the eval are better as standalone PRs:

Decomposing the 600–800 line modules (egress_addon_core, cli/tui, supervise_server) is an architectural change that deserves its own reviewable diff.
Backfilling ~146 return annotations is a large mechanical sweep with low marginal value (pyright is already at 0 errors via inference).

Verification

Full unit suite: 1482 passed (2 new DLP cache regression tests added)
pyright: 0 errors
pylint: 9.93/10 (unchanged baseline)

## Summary Addresses the high- and medium-priority findings from a quality eval of the egress DLP scanner and the manifest loader. - **perf(dlp): memoize encoded variants.** `_encoded_variants` derived the full variant set (gzip + nine encodings) for every provisioned secret on every redaction and known-secret scan — once per host, path, header, and body. It is now cached per distinct secret; callers still get a fresh list so they can't corrupt the shared cached tuple. - **perf(dlp): linearize the partial-window scan.** `_find_partial_window` searched the text once per secret n-gram (`O(len(secret) * len(text))`). It now builds the secret's n-gram set once and sweeps the text a single time (`O(len(text))`), with no detection-coverage loss. - **refactor(manifest): split `load_for_agent`.** The ~100-line dual-mode method is split into `_load_for_agent_eager` / `_load_for_agent_lazy` behind a small dispatcher, with the duplicated git-user merge tail extracted into `_manifest_with_merged_git_user`. No behavior change. ## Deferred (intentionally not bundled here) Two playbook items from the eval are better as standalone PRs: - Decomposing the 600–800 line modules (`egress_addon_core`, `cli/tui`, `supervise_server`) is an architectural change that deserves its own reviewable diff. - Backfilling ~146 return annotations is a large mechanical sweep with low marginal value (pyright is already at 0 errors via inference). ## Verification - Full unit suite: **1482 passed** (2 new DLP cache regression tests added) - pyright: **0 errors** - pylint: **9.93/10** (unchanged baseline)

didericis-claude added 2 commits 2026-06-26 22:54:14 -04:00

perf(dlp): memoize encoded variants and linearize partial-window scan 0bb47bd754

Two per-request hot-path costs in the egress DLP scanner:

- `_encoded_variants` derived the full variant set (gzip + nine
  encodings) for every provisioned secret on every redaction and
  known-secret scan — once per host, path, header, and body. Cache it
  per distinct secret; callers still get a fresh list so they can't
  corrupt the shared cached tuple.
- `_find_partial_window` searched the text once per secret n-gram,
  giving O(len(secret) * len(text)). Build the secret's n-gram set once
  and sweep the text a single time: O(len(text)), no coverage loss.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NkwFXLFff9PYPy4wgVBJp9

refactor(manifest): split load_for_agent into eager/lazy methods

lint / lint (push) Successful in 2m18s

Details

test / unit (pull_request) Successful in 1m1s

Details

test / integration (pull_request) Successful in 28s

Details

test / coverage (pull_request) Successful in 1m17s

Details

2a67a85835

`ManifestIndex.load_for_agent` was a ~100-line method branching across
the eager (from_json_obj) and lazy (from disk) resolution modes, with
the git-user merge tail duplicated in both branches. Split into
`_load_for_agent_eager` / `_load_for_agent_lazy` behind a small
dispatcher and extract the shared tail into
`_manifest_with_merged_git_user`. No behavior change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NkwFXLFff9PYPy4wgVBJp9

didericis-claude merged commit 121dc84b9f into main

2026-06-26 23:03:41 -04:00

didericis-claude deleted branch dlp-perf-manifest-cleanup

2026-06-26 23:03:49 -04:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: didericis/bot-bottle#310