Out-of-band egress enforcement & cost-control plane (forced cutoff + remote dashboard) #251

Open
opened 2026-06-23 20:45:45 -04:00 by didericis · 9 comments
Owner

Summary

Add an out-of-band egress enforcement & observability plane: a way to meter an agent's API usage, forcibly cut its egress when a limit/threshold is reached, and manage running agents from a remote dashboard to prevent cost overruns. This control must act on the agent without the agent's cooperation.

Motivation

I want two things bot-bottle can't currently do:

  1. Forced egress shutdown on limit. When an agent crosses an API/cost threshold, kill its egress automatically — no human in the loop required.
  2. Remote management. Drive agents from a dashboard (see usage, cut egress, stop bottles) to prevent cost overruns.

Why the supervise sidecar does not solve this

The existing supervise sidecar (PRD 0013) is entirely agent-initiated. Per bot_bottle/supervise.py:

egress-block / allow — the agent proposes a new routes.yaml … capability-block — the agent proposes a new Dockerfile. Each tool call, the agent passes the proposed file + justification; the operator approves; the sidecar applies.

Every action starts with the agent voluntarily calling an MCP tool. A runaway or expensive agent — exactly the cost-overrun case — will never call egress-block on itself. So supervision is a collaborative recovery mechanism, not an enforcement mechanism. Making it mandatory (see #249) would not deliver forced cost-cutoff.

Two distinct planes

This requirement forces a distinction the current design blurs:

  • Plane A — enforcement / observability (this issue). Operator/system → infrastructure. Meter usage, cut egress on threshold or remote command, account for cost. Out-of-band; does not depend on the agent. Should be unconditional — an enforcement plane you can opt out of isn't enforcement.
  • Plane B — agent-facing recovery (the current supervise sidecar). Agent → operator, approval-gated. Useful interactively; meaningless for a headless agent with no operator watching its queue. Can remain optional.

Proposed design

Build Plane A on the egress sidecar, which is already always-on and is the MITM proxy every agent's traffic flows through — so it is the natural place to both observe and enforce:

  • Metering. The egress proxy already sees all calls to API hosts (e.g. api.anthropic.com). Count requests / parse usage so per-bottle consumption is known without agent involvement.
  • Forced cutoff. A host-side operation on the egress plane: drop the bottle's routes.yaml to empty (+ reload), or isolate the bottle from the egress network. Triggered either automatically by a configured threshold, or on demand from the dashboard.
  • Control endpoint. An out-of-band channel the dashboard calls to read usage and command a cutoff/stop. Likely a single host-level controller managing all bottles (cf. cli/supervise.py's cross-bottle view and the existing passive dashboard surface) rather than a per-bottle daemon.

Design constraint: the auto-cutoff must NOT be implemented as a proposal on the supervise queue. The trigger (usage threshold) and the action (route-drop) both live in the egress plane and execute without the agent in the loop.

Relationship to #249

#249 proposes making the supervise sidecar (Plane B) mandatory. This issue argues the property worth making unconditional is Plane A, not Plane B. Two coherent paths:

  1. Keep supervise optional (default-on for ergonomics) and add Plane A as a separate egress-plane feature.
  2. Reframe "always-on control" as the enforcement plane: make Plane A unconditional, leave the agent-facing proposal tools (Plane B) opt-in. Under this framing #249 becomes "the egress control plane is always present" — a more defensible invariant than "every agent runs the agent-facing supervisor."

Unsupervised (headless/CI/ephemeral) agents remain first-class either way: they are still subject to the mandatory meter + kill switch; they simply lack the agent-facing proposal tools they couldn't use anyway.

Open questions

  • Where does the usage signal come from — request counts at the proxy, response-body token parsing, or a provider usage API? What's accurate enough to gate on?
  • Per-bottle thresholds vs. a global budget across all bottles?
  • Host-level controller vs. per-bottle control daemon for the dashboard channel.
  • On cutoff: drop egress only, or also pause/stop the bottle?
  • Does the dashboard need authn for remote control, and over what transport?
## Summary Add an **out-of-band egress enforcement & observability plane**: a way to meter an agent's API usage, forcibly cut its egress when a limit/threshold is reached, and manage running agents from a remote dashboard to prevent cost overruns. This control must act on the agent **without the agent's cooperation**. ## Motivation I want two things bot-bottle can't currently do: 1. **Forced egress shutdown on limit.** When an agent crosses an API/cost threshold, kill its egress automatically — no human in the loop required. 2. **Remote management.** Drive agents from a dashboard (see usage, cut egress, stop bottles) to prevent cost overruns. ## Why the supervise sidecar does not solve this The existing supervise sidecar (PRD 0013) is **entirely agent-initiated**. Per `bot_bottle/supervise.py`: > egress-block / allow — the agent proposes a new routes.yaml … capability-block — the agent proposes a new Dockerfile. Each tool call, the agent passes the proposed file + justification; the operator approves; the sidecar applies. Every action starts with the agent voluntarily calling an MCP tool. A runaway or expensive agent — exactly the cost-overrun case — will never call `egress-block` on itself. So supervision is a **collaborative recovery** mechanism, not an **enforcement** mechanism. Making it mandatory (see #249) would not deliver forced cost-cutoff. ## Two distinct planes This requirement forces a distinction the current design blurs: - **Plane A — enforcement / observability (this issue).** Operator/system → infrastructure. Meter usage, cut egress on threshold or remote command, account for cost. Out-of-band; does not depend on the agent. **Should be unconditional** — an enforcement plane you can opt out of isn't enforcement. - **Plane B — agent-facing recovery (the current supervise sidecar).** Agent → operator, approval-gated. Useful interactively; meaningless for a headless agent with no operator watching its queue. Can remain optional. ## Proposed design Build Plane A on the **egress sidecar**, which is already always-on and is the MITM proxy every agent's traffic flows through — so it is the natural place to both observe and enforce: - **Metering.** The egress proxy already sees all calls to API hosts (e.g. `api.anthropic.com`). Count requests / parse usage so per-bottle consumption is known without agent involvement. - **Forced cutoff.** A host-side operation on the egress plane: drop the bottle's `routes.yaml` to empty (+ reload), or isolate the bottle from the egress network. Triggered either automatically by a configured threshold, or on demand from the dashboard. - **Control endpoint.** An out-of-band channel the dashboard calls to read usage and command a cutoff/stop. Likely a single **host-level controller** managing all bottles (cf. `cli/supervise.py`'s cross-bottle view and the existing passive dashboard surface) rather than a per-bottle daemon. **Design constraint:** the auto-cutoff must NOT be implemented as a proposal on the supervise queue. The trigger (usage threshold) and the action (route-drop) both live in the egress plane and execute without the agent in the loop. ## Relationship to #249 #249 proposes making the supervise sidecar (Plane B) mandatory. This issue argues the property worth making unconditional is Plane A, not Plane B. Two coherent paths: 1. Keep supervise optional (default-on for ergonomics) and add Plane A as a separate egress-plane feature. 2. Reframe "always-on control" as the enforcement plane: make **Plane A unconditional**, leave the agent-facing proposal tools (Plane B) opt-in. Under this framing #249 becomes "the egress control plane is always present" — a more defensible invariant than "every agent runs the agent-facing supervisor." Unsupervised (headless/CI/ephemeral) agents remain first-class either way: they are still subject to the mandatory meter + kill switch; they simply lack the agent-facing proposal tools they couldn't use anyway. ## Open questions - Where does the usage signal come from — request counts at the proxy, response-body token parsing, or a provider usage API? What's accurate enough to gate on? - Per-bottle thresholds vs. a global budget across all bottles? - Host-level controller vs. per-bottle control daemon for the dashboard channel. - On cutoff: drop egress only, or also pause/stop the bottle? - Does the dashboard need authn for remote control, and over what transport?
didericis added the Kind/Feature label 2026-06-23 21:45:18 -04:00
Author
Owner

Where does the usage signal come from — request counts at the proxy, response-body token parsing, or a provider usage API? What's accurate enough to gate on?

Agent providers should have an abstract “count_tokens” method that takes a request and returns the tokens it uses. By default it should use a good enough token estimation function. Ideally stdlib only, but it’s ok to use a python library we add to a set of python dependencies for the sidecar if needed for the fallback.

The built in codex and claude endpoints should use openai and anthropic endpoints for counting tokens, respectively

> Where does the usage signal come from — request counts at the proxy, response-body token parsing, or a provider usage API? What's accurate enough to gate on? Agent providers should have an abstract “count_tokens” method that takes a request and returns the tokens it uses. By default it should use a good enough token estimation function. Ideally stdlib only, but it’s ok to use a python library we add to a set of python dependencies for the sidecar if needed for the fallback. The built in codex and claude endpoints should use openai and anthropic endpoints for counting tokens, respectively
Author
Owner

Per-bottle thresholds vs. a global budget across all bottles?

Probably makes sense to have a “global” budget… this is something we eventually want to add to a control plane that can operate across hosts. It also might make sense to introduce sqlite at this point… I think for an initial mvp, we want to avoid the need for an external api and the use a local sqlite instance. We’ll then want to move some of the state, auditing, and tracking to the sqlite db. sqlite db should be host level, and we should probably wrap operations to it in an api so we can easily swap to a cloud service in the near future.

Before executing on this, evaluate the pros and cons of introducing sqlite and let me know if it makes sense to introduce sqlite now.

EDIT: we want the global token budgets/that should be higher priority, but we should also allow budgets per active agent and bottle as well… lets keep the agent/bottle budgets so they override each other based on precedent (agent overrides bottle, which overrides parent bottle, etc), and let’s have also have a budget option when launching a new bottle.

> Per-bottle thresholds vs. a global budget across all bottles? Probably makes sense to have a “global” budget… this is something we eventually want to add to a control plane that can operate across hosts. It also might make sense to introduce sqlite at this point… I think for an initial mvp, we want to avoid the need for an external api and the use a local sqlite instance. We’ll then want to move some of the state, auditing, and tracking to the sqlite db. sqlite db should be host level, and we should probably wrap operations to it in an api so we can easily swap to a cloud service in the near future. Before executing on this, evaluate the pros and cons of introducing sqlite and let me know if it makes sense to introduce sqlite now. EDIT: we want the global token budgets/that should be higher priority, but we should also allow budgets per active agent and bottle as well… lets keep the agent/bottle budgets so they override each other based on precedent (agent overrides bottle, which overrides parent bottle, etc), and let’s have also have a budget option when launching a new bottle.
Author
Owner

Host-level controller vs. per-bottle control daemon for the dashboard channel.

Should be host level. Once this is in place, we should also probably move the supervisor UI to the same dashboard. To start I think we want one host level “dashboard” TUI, and we want any state needed to drive it to live in a host level sqlite db.

> Host-level controller vs. per-bottle control daemon for the dashboard channel. Should be host level. Once this is in place, we should also probably move the supervisor UI to the same dashboard. To start I think we want one host level “dashboard” TUI, and we want any state needed to drive it to live in a host level sqlite db.
Author
Owner

On cutoff: drop egress only, or also pause/stop the bottle?

Good question: probably should add a customizable “cutoff policy” to bottles, with the following options:

  • ”cutoff” (default, closes egress and keeps the agent/bottle running)
  • ”freeze” (commits the state and then kills the agent/bottle)
  • “kill” (kills the agent/bottle without saving state)
> On cutoff: drop egress only, or also pause/stop the bottle? Good question: probably should add a customizable “cutoff policy” to bottles, with the following options: - ”cutoff” (default, closes egress and keeps the agent/bottle running) - ”freeze” (commits the state and then kills the agent/bottle) - “kill” (kills the agent/bottle without saving state)
Author
Owner

Does the dashboard need authn for remote control, and over what transport?

Eventually yes (for web and mobile remote control, which will also be cross host), but to start lets stick with a TUI host only dashboard without remote control.

> Does the dashboard need authn for remote control, and over what transport? Eventually yes (for web and mobile remote control, which will also be cross host), but to start lets stick with a TUI host only dashboard without remote control.
Author
Owner

Also will want to introduce a “settings.yml” that lives in the root of the .bot-bottle folder and has a “budget” section for host level budgets. Budgets should be keyed by agent providers (same name as those available to the bottles) and use token counts for budgeting (not dollar values). Should also include the shutdown behavior.

Also will want to introduce a “settings.yml” that lives in the root of the .bot-bottle folder and has a “budget” section for host level budgets. Budgets should be keyed by agent providers (same name as those available to the bottles) and use token counts for budgeting (not dollar values). Should also include the shutdown behavior.
Collaborator

Walked the codebase against these decisions. Responding to the threads, leading with the SQLite question you flagged as a gate.

SQLite — yes, introduce it now (scoped)

The dependency objection doesn't apply. sqlite3 is in the Python stdlib, so it does not break the AGENTS.md "stdlib-first / no runtime pip deps" stance — same category as the hand-rolled yaml_subset.py, except here the stdlib already ships the whole thing. So the usual reason to defer ("don't add a dependency yet") is absent.

The actual problem you're describing is one SQLite is good at. A global token budget decremented concurrently by N egress sidecars (today ~/.bot-bottle/ already has state/, audit/, queue/ written by multiple bottles in parallel) is a read-modify-write race. Doing that over JSON files means hand-rolled file locking; SQLite gives atomic transactions + WAL for free. And the per-agent/per-bottle precedence rollup (agent → bottle → parent) plus "sum across all bottles" is a GROUP BY, not an N-directory rescan-and-parse.

It's also the cleanest analogue for the eventual cloud swap. Your "wrap operations in an API so we can swap to a cloud service" maps directly onto a thin repository/DAO over SQLite → Postgres later. A JSON-file store is a worse rehearsal for a remote DB than SQL is.

Costs (real but bounded):

  • New paradigm in a flat-file repo → needs a schema-version + migration discipline (a schema_version table + idempotent migrations on startup).
  • SQLite serializes writers; needs WAL mode + busy_timeout. At host scale (handful of bottles) this is a non-issue.
  • Test fixtures need temp DBs. Minor.

Recommendation: add it now, but narrowly — only the new metering/budget/audit ledger goes in SQLite, behind a thin repo API at e.g. ~/.bot-bottle/bot-bottle.db. Do not migrate existing per-bottle state (resume metadata.json, transcripts, Dockerfile overrides) — those are per-identity blobs that files handle fine and that don't have the concurrency/aggregation problem. Migrating them now would be churn for no benefit.

count_tokens — split "gate" from "account"

Agent providers should have an abstract count_tokens method … built-in codex/claude should use openai/anthropic endpoints

One refinement worth nailing before building: there are two distinct needs, and the response body is strictly better for one of them.

  • Pre-flight gate (block before sending): you only have the request, so an estimator / provider count_tokens endpoint is the only option. Good fit for the abstract method.
  • Actual accounting (decrement the real budget): the API response already carries authoritative usage (Anthropic input_tokens/output_tokens, OpenAI usage). The egress addon already has a response(flow) hook — so we can read the real number for free, no extra network call. Calling count_tokens for accounting would both be less accurate and add a metered egress call per request.

So I'd suggest: count_tokens (estimator, stdlib fallback) for the gate; parse response usage for the ledger. Caveat: agent traffic is mostly streaming SSE, so the response hook needs to tail the stream for the final usage event — worth scoping explicitly.

settings.yml — which .bot-bottle, and the parser constraint

settings.yml … in the root of the .bot-bottle folder … budget section keyed by agent providers, token counts, shutdown behavior

Two notes:

  1. There are two .bot-bottle/ roots: the repo one (committed per-repo manifests) and the host one (~/.bot-bottle/, state/audit/queue). Host-level budgets belong in ~/.bot-bottle/settings.yml, not the repo dir — otherwise budgets get committed per-repo. Assuming you mean the host one.
  2. It'll be parsed by yaml_subset.py, which is deliberately a bounded subset (no anchors, no multi-line block scalars). A flat budget: mapping of provider: <int> plus a shutdown: scalar fits fine — just keep it within that shape.

cutoff policy — primitives already exist

cutoff (default) / freeze / kill

These map onto existing code: freeze = bot_bottle/backend/freeze.py (snapshot → resumable, already built), kill = backend teardown, cutoff = the route-drop on the egress plane. So the policy is mostly wiring an existing enum to existing primitives + the new trigger.


Per the AGENTS.md convention ("durable why lives in a PRD / research note / decision record, not the issue thread") I'll fold the SQLite verdict + the gate-vs-account split into a research note / PRD when we move to execute, rather than letting it live only here. Want me to draft that PRD next?

Walked the codebase against these decisions. Responding to the threads, leading with the SQLite question you flagged as a gate. ## SQLite — yes, introduce it now (scoped) **The dependency objection doesn't apply.** `sqlite3` is in the Python stdlib, so it does *not* break the AGENTS.md "stdlib-first / no runtime pip deps" stance — same category as the hand-rolled `yaml_subset.py`, except here the stdlib already ships the whole thing. So the usual reason to defer ("don't add a dependency yet") is absent. **The actual problem you're describing is one SQLite is good at.** A *global* token budget decremented concurrently by N egress sidecars (today `~/.bot-bottle/` already has `state/`, `audit/`, `queue/` written by multiple bottles in parallel) is a read-modify-write race. Doing that over JSON files means hand-rolled file locking; SQLite gives atomic transactions + WAL for free. And the per-agent/per-bottle precedence rollup (agent → bottle → parent) plus "sum across all bottles" is a `GROUP BY`, not an N-directory rescan-and-parse. **It's also the cleanest analogue for the eventual cloud swap.** Your "wrap operations in an API so we can swap to a cloud service" maps directly onto a thin repository/DAO over SQLite → Postgres later. A JSON-file store is a *worse* rehearsal for a remote DB than SQL is. **Costs (real but bounded):** - New paradigm in a flat-file repo → needs a schema-version + migration discipline (a `schema_version` table + idempotent migrations on startup). - SQLite serializes writers; needs WAL mode + `busy_timeout`. At host scale (handful of bottles) this is a non-issue. - Test fixtures need temp DBs. Minor. **Recommendation:** add it now, but **narrowly** — only the *new* metering/budget/audit ledger goes in SQLite, behind a thin repo API at e.g. `~/.bot-bottle/bot-bottle.db`. Do **not** migrate existing per-bottle state (resume `metadata.json`, transcripts, Dockerfile overrides) — those are per-identity blobs that files handle fine and that don't have the concurrency/aggregation problem. Migrating them now would be churn for no benefit. ## count_tokens — split "gate" from "account" > Agent providers should have an abstract `count_tokens` method … built-in codex/claude should use openai/anthropic endpoints One refinement worth nailing before building: there are two distinct needs, and the response body is strictly better for one of them. - **Pre-flight gate** (block *before* sending): you only have the request, so an estimator / provider `count_tokens` endpoint is the only option. Good fit for the abstract method. - **Actual accounting** (decrement the real budget): the API *response* already carries authoritative `usage` (Anthropic `input_tokens`/`output_tokens`, OpenAI `usage`). The egress addon already has a `response(flow)` hook — so we can read the real number for free, no extra network call. Calling `count_tokens` for accounting would both be less accurate *and* add a metered egress call per request. So I'd suggest: `count_tokens` (estimator, stdlib fallback) for the gate; parse response `usage` for the ledger. Caveat: agent traffic is mostly streaming SSE, so the response hook needs to tail the stream for the final usage event — worth scoping explicitly. ## settings.yml — which `.bot-bottle`, and the parser constraint > settings.yml … in the root of the .bot-bottle folder … budget section keyed by agent providers, token counts, shutdown behavior Two notes: 1. There are two `.bot-bottle/` roots: the **repo** one (committed per-repo manifests) and the **host** one (`~/.bot-bottle/`, state/audit/queue). Host-level budgets belong in **`~/.bot-bottle/settings.yml`**, not the repo dir — otherwise budgets get committed per-repo. Assuming you mean the host one. 2. It'll be parsed by `yaml_subset.py`, which is deliberately a bounded subset (no anchors, no multi-line block scalars). A flat `budget:` mapping of `provider: <int>` plus a `shutdown:` scalar fits fine — just keep it within that shape. ## cutoff policy — primitives already exist > cutoff (default) / freeze / kill These map onto existing code: **freeze** = `bot_bottle/backend/freeze.py` (snapshot → resumable, already built), **kill** = backend teardown, **cutoff** = the route-drop on the egress plane. So the policy is mostly wiring an existing enum to existing primitives + the new trigger. --- Per the AGENTS.md convention ("durable why lives in a PRD / research note / decision record, not the issue thread") I'll fold the SQLite verdict + the gate-vs-account split into a research note / PRD when we move to execute, rather than letting it live only here. Want me to draft that PRD next?
Author
Owner

@didericis-claude yes, draft the prd in a new pr

@didericis-claude yes, draft the prd in a new pr
Collaborator

Drafted the PRD in a new PR: #285 (branch prd-egress-control-plane).

It folds in every decision from this thread, leading with the SQLite verdict you gated on:

  • SQLite — yes, now but narrow. Host-level ledger at ~/.bot-bottle/bot-bottle.db behind a thin repository API; only the new metering/budget/enforcement-audit state goes in SQL, existing per-bottle flat-file state stays on disk. schema_version + idempotent migrations, WAL + busy_timeout called out as required.
  • Gate vs. account split. Authoritative accounting from response usage via the existing egress response hook (with SSE final-usage tailing scoped as work); count_tokens estimator/endpoint reserved for an optional pre-flight gate.
  • Budgets keyed by provider, token counts, four-scope precedence agent → bottle → parent → global, plus a --budget launch flag.
  • ~/.bot-bottle/settings.yml at the host root (not per-repo), within the yaml_subset.py subset.
  • Cutoff policy cutoff / freeze / kill mapped to existing primitives (egress route-drop, freeze.py, backend teardown); the trigger lives in the egress plane, never the supervise queue.
  • Host-level controller + TUI dashboard, host-only; remote control / authn deferred.

Open questions I couldn't resolve from the thread are carried in the PRD (Open questions section) rather than left here — chiefly SSE usage-tailing robustness and whether mid-request budget crossing needs the pre-flight gate for v1. Review there.

Drafted the PRD in a new PR: #285 (branch `prd-egress-control-plane`). It folds in every decision from this thread, leading with the SQLite verdict you gated on: - **SQLite — yes, now but narrow.** Host-level ledger at `~/.bot-bottle/bot-bottle.db` behind a thin repository API; only the new metering/budget/enforcement-audit state goes in SQL, existing per-bottle flat-file state stays on disk. `schema_version` + idempotent migrations, WAL + `busy_timeout` called out as required. - **Gate vs. account split.** Authoritative accounting from response `usage` via the existing egress `response` hook (with SSE final-usage tailing scoped as work); `count_tokens` estimator/endpoint reserved for an optional pre-flight gate. - **Budgets** keyed by provider, token counts, four-scope precedence agent → bottle → parent → global, plus a `--budget` launch flag. - **`~/.bot-bottle/settings.yml`** at the host root (not per-repo), within the `yaml_subset.py` subset. - **Cutoff policy** `cutoff` / `freeze` / `kill` mapped to existing primitives (egress route-drop, `freeze.py`, backend teardown); the trigger lives in the egress plane, never the supervise queue. - **Host-level controller + TUI dashboard**, host-only; remote control / authn deferred. Open questions I couldn't resolve from the thread are carried in the PRD (Open questions section) rather than left here — chiefly SSE usage-tailing robustness and whether mid-request budget crossing needs the pre-flight gate for v1. Review there.
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: didericis/bot-bottle#251