PRD 0029: Codex host credentials through egress #110
Reference in New Issue
Block a user
Delete Branch "codex/prd-codex-host-credentials"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #109
Summary
Notes
This starts with the PRD-only commit, per workflow.
f89ae45f29to8e5262b539@@ -97,6 +97,7 @@ class SmolmachinesBottlePlan(BottlePlan):agent_prompt_mode: PromptMode = "append_file"agent_provider_template: str = "claude"agent_dockerfile_path: str = ""codex_auth_file: Path | None = NoneThis should be grouped with the agent_provider template in some way/probably scoped under the agent provider, and only respected when the template is set to
codexReplying to #110 (comment)
The user-facing knob is now scoped under
agent_providerasagent_provider.forward_host_credentials, and manifest validation rejects it unlessagent_provider.template == "codex". The downstreamcodex_auth_filefield is only a resolved backend plan artifact: prepare writes a non-secret dummy auth file when that Codex-only flag is enabled, and provision copies it if present.I left the plan field explicit rather than nesting another object there because the plan is already backend-resolved state consumed by launch/provision, not the manifest shape; the provider/template scoping is enforced before either backend can populate it.
Replying to #110 (comment)
In the future I'd like users to be able to specify their own agent provisioning logic, so we should be grouping agent related plans in their own self contained object and provisioning functions that could eventually utilize some sort of "plugins" folder. Take a stab at a redesign of the plan/provisioning steps which accommodates this, then respond here with your plan before implementing.
Replying to #110 (comment)
Agreed. I’d redesign this around a first-class provider provisioning plan instead of sprinkling provider-specific fields onto each backend plan. Proposed shape:
Add a shared
AgentProvisionPlandataclass owned by the provider layer, roughly:template: provider id (claude,codex, later plugin name)command/prompt_mode/ image defaults resolved from the runtime templatedockerfile_pathor image-build inputguest_env: env that must be present for every agent exechome_state: a small set of files/directories to install under the provider home, with mode/owner metadataverify: provider-specific post-copy checks, e.g. Codexlogin statusMove Codex dummy auth handling into provider-plan resolution:
agent_provider.forward_host_credentialstemplate == "codex"auth.json,config.toml,CODEX_HOME, and verify command insideAgentProvisionPlanagent_provision: AgentProvisionPlan, notcodex_auth_fileMake backend provisioning generic:
AgentProvisionPlan.home_statewithdocker cp/docker exec chown chmodsmolvm machine cp/machine execLeave an obvious plugin seam without committing to a plugin system in this PR:
plugins/<provider>/resolver would implementAgentProvisionPlanplus runtime metadata; the backends do not need to know its custom fieldsFor this PR I’d keep the implementation narrow: introduce
AgentProvisionPlan, migrate the existing Codex auth/config work into it, update Docker/smolmachines to consume it generically, and keep actual external plugin loading out of scope.Implemented in
10c009c(refactor(agent): group provider provisioning into plan).What changed:
AgentProvisionPlanplus dir/file/command entries inagent_provider.pycodex_auth_filefields withagent_provisionon Docker and smolmachines plansI could not run the Python unit suite in this bot-bottle sandbox because it has no
python,python3, ordockerbinary available. I did rungit diff --check, which passes; CI should cover the test suite on the pushed branch.@@ -131,0 +137,4 @@# egress MITM CA so Codex trusts the proxy and egress can inject# the host bearer — without this, codex bottles need# pipelock tls_passthrough, which disables auth injection.guest_env["CODEX_CA_CERTIFICATE"] = (env provisioning should be a part of the agent provider plan/we shouldn't need to know anything about codex here.
Replying to #110 (comment)
I’d redesign this around a provider-owned provisioning plan so provider logic can grow its own surface without leaking into backend-specific plan fields. Concretely:
agent_provider, includingforward_host_credentials.AgentProvisionPlanthat can carry files, env, commands, and optional verify hooks.codex_auth_file.plugins/<provider>/resolvers return the same plan shape.That keeps the current PR narrow while giving us a clean place to move provider-specific provisioning later.
Clarification: this is the plan I propose to implement next, not something I have already done. The current code still has the provider-specific plan fields; the point of the redesign is to replace that with a provider-owned
AgentProvisionPlanand generic backend interpreters before adding any future plugin loading.this should also be in the agent provisioner now, assuming we can evaluate
has_provider_authat that stage. If not we'll need a more generic hook to call into here/there should not be any logic specific to an specific type of agent in here anymore.Resolved in
1fceaae: the Claude telemetry defaults are now set inagent_provision_plan()whentemplate == "claude"andhas_provider_authis true. Both backends only mergeagent_provision.env_varsinto the guest env during prepare, so the provisioning step owns the defaults.@@ -114,0 +122,4 @@# ("request header contains secret") and blocks. Pass them through so# pipelock still enforces the host allowlist on CONNECT but does not# decrypt + rescan egress-owned auth. The auto-added routes live in# egress_routes_for_bottle, not bottle.egress.routes, so add thewhy can't we just provision the bottle egress routes based on an agent provision plan?
Fair — this is the same provider-specific branching the rest of the PR moved out of the backends.
The root cause is that
EgressRoutehas notls_passthroughfield, sopipelock_effective_tls_passthroughcannot discover the synthesized Codex credential routes need passthrough by inspectingegress_routes_for_bottle(). The quickest clean fix:tls_passthrough: bool = FalsetoEgressRoute.Truein_ensure_codex_host_credential_route.pipelock_effective_tls_passthroughcallegress_routes_for_bottle(bottle)and checkroute.tls_passthrough— the manifest routes already carryroute.Pipelock.TlsPassthrough, so that flag would need to be lifted intoEgressRoutetoo (or we fold the manifest passthrough flag intoegress_manifest_routes).That removes the provider check from pipelock entirely. Want me to rework it along those lines?
RE: #110 (comment)
I want to remove
_ensure_codex_host_credential_route, and have that happen via an agent provider abstraction that can be generalized to different providers. There should be some way we can add specific egress routes needed by a specific provider, and have that be a part of the preparation/provisioning flow.RE: #110 (comment)
Agreed on the direction. Concrete shape I have in mind:
egress_routes: tuple[EgressRoute, ...]toAgentProvisionPlan— provider declares the routes it needs (withtls_passthroughonEgressRouteto carry that signal).egress_routes_for_bottleand intoagent_provision_plan(), populatingegress_routeswhenforward_host_credentials=True._ensure_codex_host_credential_routegoes away.Egress.prepare()accepts the plan’segress_routesand merges them with the manifest routes (conflict detection stays, just moves here).pipelock_effective_tls_passthroughreads the merged routes and checksroute.tls_passthrough— no provider branching.One sequencing wrinkle: both backends currently call
Egress.prepare()beforeagent_provision_plan(). Fixing this means callingagent_provision_plan()first and threading the plan intoEgress.prepare(). That’s a straightforward reorder but worth flagging. Should I go ahead and implement this?Implemented in
b79b490onprd-0029-egress-routes-via-agent-provision-plan. Summary of what changed:EgressRoutegainstls_passthrough: bool.egress_manifest_routeslifts the manifestpipelock.tls_passthroughflag here so the field is set for all routes regardless of origin.AgentProvisionPlangainsegress_routes: tuple[EgressRoute, ...].agent_provision_planpopulates it for Codex +forward_host_credentialswithtls_passthrough=True.egress_routes_for_bottledrops thetemplate == "codex"branch. It now takesprovider_routesand merges them generically via_merge_provider_route(conflict detection preserved — different auth on the same host still dies).pipelock_effective_tls_passthroughdrops the provider check. It readsroute.tls_passthroughfromegress_routes_for_bottle(bottle, provider_routes)— both manifest and provider routes in one pass.agent_provision_planfirst, then passplan.egress_routestoEgress.prepareandPipelockProxy.prepare.has_provider_authis computed fromegress_manifest_routes(provider routes carry no auth roles so the result is identical).Debugging a live codex smolmachines bottle surfaced three independent failures past the sign-in screen; fix each so forward_host_credentials works end to end: - codex_auth: dummy access/id tokens now inherit the *real* host token's exp instead of now+1h. Codex (0.135) refreshes when its local token's JWT exp lapses; with a placeholder refresh_token that refresh fails and drops to the sign-in screen. Aligning exp tracks the real token's life. - prepare: set CODEX_CA_CERTIFICATE to the agent CA bundle for codex bottles. Codex is rustls and ignores the system store / NODE_EXTRA_CA_ CERTS; it reads CODEX_CA_CERTIFICATE (fallback SSL_CERT_FILE) for custom roots across HTTPS + wss, so it must be pointed at the egress MITM CA or injection can't work without tls_passthrough. - pipelock: auto tls_passthrough the Codex API hosts when forward_host_credentials is on. Egress injects the bearer before pipelock, whose header DLP then flags the JWT ("request header contains secret") and the retry storm trips its 429. passthrough host-gates the CONNECT but skips decrypt+rescan of egress-owned auth. The auto-added routes aren't in bottle.egress.routes, so the hosts are added explicitly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>Operators can now declare: agent_provider: template: claude auth_token: BOT_BOTTLE_CLAUDE_OAUTH_TOKEN and the provisioner injects a provider-owned api.anthropic.com egress route (Bearer, tls_passthrough) rather than requiring a manually declared route with the former claude_code_oauth role. Changes: - Add auth_token field to AgentProvider; validate claude-only. - Remove claude_code_oauth from EGRESS_ROLES / PROVIDER_EGRESS_ROLES. Manifests that declare the role now fail at parse time with "unknown role" — the provisioner owns the route. - agent_provision_plan: replace manifest_egress_routes/has_provider_auth with auth_token; Claude branch injects the api.anthropic.com route, placeholder env, and nonessential-traffic flags when auth_token is set. - Add hidden_env_names: frozenset[str] to AgentProvisionPlan; Claude branch populates it with CLAUDE_CODE_OAUTH_TOKEN. - Remove auth_role from AgentProviderRuntime and placeholder_env_for(). - print_util.visible_agent_env_names: accept hidden_env_names from the plan instead of dispatching on agent_provider_template. - Both backends: drop manifest_egress_routes call, pass auth_token. - PRD 0029 rescoped to cover both Codex and Claude provider auth. Assisted-by: Claude Code650f3aa93etof8fc29ce87