Commit Graph

19 Commits

Author SHA1 Message Date
didericis 2dd8113f7c fix(smolmachines): retry CA install after exec SIGKILL
test / unit (push) Successful in 38s
test / integration (push) Successful in 54s
2026-06-01 23:28:33 -04:00
didericis-codex e808e81b87 refactor(agent): group provider provisioning into plan 2026-06-01 22:24:17 -04:00
didericis-codex 36ce7aed4f refactor(codex): derive trusted paths from guest home 2026-06-01 22:24:17 -04:00
didericis-codex a5d83bdcdc fix(codex): trust launch home directory 2026-06-01 22:24:17 -04:00
didericis-codex 8e6583fcb7 fix(codex): trust bottle workspace on launch 2026-06-01 22:24:17 -04:00
didericis-codex ac1aa197d4 fix(smolmachines): reset codex runtime db before auth check 2026-06-01 22:24:17 -04:00
didericis-codex 68e5097534 fix(codex): make host-credential bottles actually authenticate
Debugging a live codex smolmachines bottle surfaced three independent
failures past the sign-in screen; fix each so forward_host_credentials
works end to end:

- codex_auth: dummy access/id tokens now inherit the *real* host token's
  exp instead of now+1h. Codex (0.135) refreshes when its local token's
  JWT exp lapses; with a placeholder refresh_token that refresh fails and
  drops to the sign-in screen. Aligning exp tracks the real token's life.

- prepare: set CODEX_CA_CERTIFICATE to the agent CA bundle for codex
  bottles. Codex is rustls and ignores the system store / NODE_EXTRA_CA_
  CERTS; it reads CODEX_CA_CERTIFICATE (fallback SSL_CERT_FILE) for custom
  roots across HTTPS + wss, so it must be pointed at the egress MITM CA or
  injection can't work without tls_passthrough.

- pipelock: auto tls_passthrough the Codex API hosts when
  forward_host_credentials is on. Egress injects the bearer before
  pipelock, whose header DLP then flags the JWT ("request header contains
  secret") and the retry storm trips its 429. passthrough host-gates the
  CONNECT but skips decrypt+rescan of egress-owned auth. The auto-added
  routes aren't in bottle.egress.routes, so the hosts are added explicitly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 22:24:17 -04:00
didericis-codex a6332b9535 fix(codex): provision dummy user auth state 2026-06-01 22:24:17 -04:00
didericis-codex 6ea19a8d53 fix(git-gate): use smart http for smolmachines pushes
test / unit (pull_request) Successful in 40s
test / integration (pull_request) Successful in 54s
test / unit (push) Successful in 37s
test / integration (push) Successful in 44s
2026-05-29 23:21:50 -04:00
didericis-codex c08b09dc9f refactor!: rename project to bot-bottle
Assisted-by: Codex
2026-05-28 17:56:14 -04:00
didericis-codex 59ee32cc8d refactor(manifest): key git config by host
test / unit (pull_request) Successful in 33s
test / integration (pull_request) Successful in 42s
2026-05-28 00:49:34 -04:00
didericis-claude c9cdd41110 feat(smolmachines): apply git_user via git config --global on provision (issue #86)
Mirror the docker backend's third provisioning subcase in
`backend/smolmachines/provision/git.py`:

  _provision_git_user(plan, target)

Runs `smolvm machine exec --name <M> -e HOME=/home/node -e
USER=node -- runuser -u node -- git config --global user.<X>
<value>` for each git_user field. No-op when
`git_user.is_empty()`.

`runuser -u node --` switches the UID without invoking a login
shell (matching the existing `Bottle.exec_claude` pattern).
HOME / USER are forced via `smolvm -e` because bare runuser
inherits root's HOME=/root, which would put --global in
/root/.gitconfig instead of /home/node/.gitconfig (where the
existing `_provision_git_gate_config` writes).

4 unit tests in test_smolmachines_provision.TestProvisionGitUser:
no-op, both-set (asserts runuser prefix + HOME/USER env),
name-only, email-only. 661 unit tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 23:00:21 -04:00
didericis-claude 5e0130b56f fix(smolmachines): build agent image in launch, not prepare
test / unit (push) Successful in 26s
test / integration (push) Successful in 43s
When starting a smolmachines agent from the dashboard the
docker-build output rendered on top of the curses preflight
modal — the build was kicked off before the operator had
confirmed launch. The docker backend's `prepare` is pure
resolution (no docker calls); smolmachines was inconsistent
because `prepare` called `_ensure_smolmachine` which ran
`docker build` → `docker save` → `crane push` → `smolvm pack
create`, several seconds of stderr noise rendered before the
y/N prompt.

Move the pipeline:

- `_ensure_smolmachine` (+ `_SMOLMACHINE_CACHE_DIR` + `_REPO_DIR`
  + the local-registry / smolvm imports) moves from
  `backend/smolmachines/prepare.py` to
  `backend/smolmachines/launch.py`. Called right before
  `_smolvm.machine_create` so the resulting `.smolmachine`
  sidecar path lands as a local in `launch`, not on the plan.

- `SmolmachinesBottlePlan.agent_from_path: Path` becomes
  `agent_image_ref: str`. `prepare` stashes only the docker tag
  (`$CLAUDE_BOTTLE_IMAGE` || `claude-bottle:latest`); `launch`
  resolves it into the artifact at bringup.

This puts smolmachines on the same prepare-vs-launch boundary
the docker backend uses: the preflight summary in the dashboard
prints, the operator confirms, then `launch` runs — and its
stderr is routed via `_route_op_to_right_pane` (in tmux) or via
`curses.endwin` (foreground handoff) so the build output lands
cleanly.

Tests:
- `tests/unit/test_smolmachines_prepare_image.py` →
  `tests/unit/test_smolmachines_launch_image.py`, updated to
  import `_ensure_smolmachine` from `launch` rather than
  `prepare`.
- `test_smolmachines_provision.py`: plan fixture switches
  `agent_from_path` → `agent_image_ref`.

593 unit tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 19:44:53 -04:00
didericis-claude d02fe50193 fix(smolmachines): run claude mcp add as node so config lands in node's home
test / unit (pull_request) Successful in 26s
test / integration (pull_request) Successful in 40s
provision_supervise dispatched `claude mcp add --scope user`
through `smolvm machine_exec`, which runs as root by default.
The MCP entry got written to root's ~/.claude.json — but the
agent's claude reads /home/node/.claude.json, so `/mcp` showed
"No MCP servers configured" inside the bottle.

Wrap the exec in `runuser -u node -- env HOME=/home/node ...`
so the config writes to the right home. Same pattern as the
interactive exec_claude / Bottle.exec wrappers — `smolvm
machine_exec` is always root, so any command that touches user
state has to switch UID + set HOME explicitly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 16:08:08 -04:00
didericis-claude 5486170be1 fix(smolmachines): route agent through egress when routes declared, wait for VM warm-up
test / unit (pull_request) Successful in 26s
test / integration (pull_request) Successful in 42s
Two related bugs:

1. Auth chain bypassed egress. After the Docker-Desktop port
   pivot, the agent always dialed pipelock directly — meaning
   egress (which holds the real OAuth token and rewrites the
   Authorization header) wasn't in the request path. Bearer
   placeholder reached anthropic verbatim → 401 "Invalid bearer
   token". Fix: when the bottle declares egress.routes, the
   agent's first hop is egress (publish egress port 9099 to host
   loopback, leave pipelock bundle-internal). Without routes,
   the agent dials pipelock directly. Same hop order as the
   docker backend.

2. provision_ca's update-ca-certificates SIGKILLed at ~100ms
   on Docker Desktop. Back-to-back `smolvm machine exec` calls
   immediately after machine_start hit a VM warm-up race in
   libkrun's exec channel; the second exec's child got
   SIGKILL'd before producing more than the first line of
   stdout. The agent's trust store never got the egress MITM
   CA's hash symlink, so curl/openssl couldn't validate the
   TLS chain. Fix: 1.5s sleep after machine_start (empirically
   enough), plus fold provision_ca's chown + chmod +
   update-ca-certificates into one `sh -c` so we only pay one
   exec round trip. Bail with a clear error if update-ca-
   certificates doesn't report "1 added" (failing silently was
   how the original SIGKILL went unnoticed).

Net effect on Docker Desktop / macOS: claude's HTTPS_PROXY is
`http://127.0.0.1:<egress port>`, egress rewrites auth, pipelock
allowlists + DLPs, request reaches api.anthropic.com with a
real token. End-to-end verified.

Also drops the PRD-0023-chunk-3 EGRESS_LISTEN_HOST=127.0.0.1
mitigation. The original concern (agent bypassing pipelock by
dialing egress's port on the bundle IP) doesn't apply in this
topology: the agent can only reach whatever port we publish on
host loopback, and egress is the only HTTP/HTTPS chokepoint
that gets published.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 15:57:18 -04:00
didericis-claude 4f136a9932 fix(smolmachines): agent dials bundle via host loopback ports, not docker bridge IP
test / unit (pull_request) Successful in 26s
test / integration (pull_request) Successful in 39s
Claude hung on outbound network calls under
CLAUDE_BOTTLE_BACKEND=smolmachines:

  Unable to connect to API (FailedToOpenSocket)

Root cause: the PRD-0023 design pinned the bundle at a docker
bridge IP (192.168.X.2) and set the smolvm guest's TSI allowlist
to `<bundle-ip>/32`. On native Linux this works — host shares
the docker bridge's network namespace, TSI's syscall
impersonation reaches the bridge IP directly. On Docker Desktop
(macOS), the daemon runs in its own Linux VM and docker bridge
IPs aren't reachable from macOS networking, so the smolvm
guest's TSI requests die "Network is unreachable" before they
hit pipelock.

Fix: publish each agent-facing bundle daemon's port on host
loopback (-p 127.0.0.1::PORT), discover the random host-side
ports after start, and route the agent through
`127.0.0.1:<host port>` instead of the bridge IP. macOS loopback
is the surface Docker Desktop's gvproxy forwards into the
daemon's VM, so the chain (guest TSI -> macOS loopback ->
daemon VM port-forward -> bundle container) works on both
Docker Desktop and native Linux.

Concrete changes:
- BundleLaunchSpec: add `ports_to_publish` so start_bundle adds
  `-p 127.0.0.1::PORT` for the agent-facing ports (pipelock
  always; git-gate when upstreams declared; supervise when
  enabled). Egress's port stays bundle-internal.
- sidecar_bundle.bundle_host_port(): wrap `docker port <bundle>
  <container_port>/tcp` so launch can look up the random
  host-side mapping after start.
- launch.py: discover the host ports, build URLs of the form
  `http://127.0.0.1:<host port>` / `git://127.0.0.1:<host port>`,
  stamp onto guest_env + new agent_*_url fields on the plan.
- launch.py: TSI allow_cidrs flips to `["127.0.0.1/32"]`. The
  bundle IP is no longer the agent's target.
- prepare.py: stop synthesizing HTTPS_PROXY / GIT_GATE_URL /
  MCP_SUPERVISE_URL at prepare time — launch owns those now
  (the values depend on a port docker hasn't assigned yet).
- provision_git: gate_host from plan.agent_git_gate_host.
- provision_supervise: URL from plan.agent_supervise_url.

End-to-end verified on Docker Desktop / macOS: guest dials
pipelock through TSI, pipelock forwards to api.anthropic.com,
the API responds with 401 (i.e. it received the request).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 15:31:44 -04:00
didericis-claude ac8c7ba696 feat(smolmachines): provision_ca + provision_git + provision_supervise (PRD 0023 chunk 4d)
test / unit (pull_request) Successful in 26s
test / integration (pull_request) Successful in 43s
test / unit (push) Successful in 26s
test / integration (push) Successful in 42s
End-to-end provisioning parity with the docker backend. After this
chunk a smolmachines bottle has a working trust store, git-gate
gitconfig, and supervise MCP registration — same shape as docker,
dispatched via `smolvm machine cp` / `smolvm machine exec` instead
of `docker cp` / `docker exec`.

Adds three new provision modules:
- ca.py:        select egress vs pipelock CA (same logic as
                docker), machine cp + update-ca-certificates,
                log sha256 fingerprint.
- git.py:       copy host .git when --cwd was passed; render
                ~/.gitconfig with insteadOf URLs. URL prefix is
                `git://<bundle_ip>:9418/...` (no DNS in the
                TSI-allowlisted guest) vs docker's
                `git://git-gate/...`.
- supervise.py: `claude mcp add` via machine_exec; URL is
                `http://<bundle_ip>:9100/`. Failure is logged but
                non-fatal (matches docker).

Shared render: `render_git_gate_gitconfig` moves out of
backend/docker/provision/git.py into the platform-neutral
claude_bottle/git_gate.py (renamed to git_gate_render_gitconfig
for consistency with the existing git_gate_render_* helpers),
parameterized on a `gate_host` argument so both backends use the
same logic with different addresses.

Path/user fixups for the post-chunk-4c agent image (real
claude-bottle image, USER node, $HOME=/home/node):
- prompt.py default path moves from /root/... to
  /home/node/.claude-bottle-prompt.txt; chown + chmod after
  machine cp.
- skills.py default skills dir moves from /root/.claude/skills to
  /home/node/.claude/skills; chown -R per skill.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 14:15:58 -04:00
didericis-claude 1dfc359141 feat(smolmachines): thread inner Plans + bundle daemons run (PRD 0023 chunk 4b)
test / unit (pull_request) Successful in 21s
test / integration (pull_request) Successful in 42s
Bundle daemons (pipelock, egress, optionally git-gate + supervise)
now actually start with their config files bind-mounted from the
inner Plans the docker backend already produces. Chunks 2d + 3
ran with daemons_csv="" so the bundle's init supervisor idled;
chunk 4b wires up the real path: agent → pipelock → egress →
internet (when routes declared) is now functional, modulo agent-
image gaps (claude-code / TLS-trust-store / git in the guest)
that chunk 4c addresses.

bottle_plan.py — added the four inner Plan fields:
  proxy_plan: PipelockProxyPlan
  git_gate_plan: GitGatePlan
  egress_plan: EgressPlan
  supervise_plan: SupervisePlan | None

Same shape the docker backend's plan uses. Docker-network-only
fields (internal_network, egress_network) stay at dataclass
defaults — the smolmachines bundle is on a per-bottle bridge
with a pinned IP, not docker's --internal + egress topology.

prepare.py — instantiates DockerPipelockProxy / DockerEgress /
DockerGitGate / DockerSupervise and calls their .prepare()
methods to write the per-bottle config files (pipelock.yaml,
routes.yaml, git-gate entrypoint/hooks, supervise queue dir)
under the per-bottle state dir. (The "Docker" prefix on the
class names is a misnomer here — .prepare() is platform-neutral,
inherited from each sidecar's ABC. A future cleanup could factor
the prepare logic out of the docker subpackage.)

launch.py — major rewrite:
  - pipelock_tls_init at launch (always); egress_tls_init only
    when the bottle declares routes (otherwise the CA files
    aren't bind-mounted and openssl runs would be wasted).
  - Inner Plans updated in place with launch-time CA paths +
    EGRESS_UPSTREAM_PROXY = http://127.0.0.1:8888 (egress's
    upstream is pipelock on the bundle's own loopback; same
    container's network namespace).
  - BundleLaunchSpec env + volumes built from the inner Plans:
    pipelock.yaml + CA + key (always); egress routes + CAs +
    upstream env + token-slot bare names (when routes); git-gate
    entrypoint + hooks + per-upstream identity files (when
    upstreams); supervise queue dir + env (when enabled).
  - daemons_csv = ["egress", "pipelock"] + ["git-gate"] (if
    upstreams) + ["supervise"] (if enabled).
  - Token env values resolved from host env via
    `egress_resolve_token_values` and threaded into the
    docker-run subprocess env (bare-name -e entries in spec
    inherit from there — values never land on argv).

Tests:
- 552 unit passing (no new unit cases; fixture updated to
  populate the new plan fields).
- 5 integration cases passing locally (Darwin + smolvm + docker
  + not GITEA_ACTIONS):
    * test_smoke_exec_echo — still works.
    * test_localhost_reach_probe — host loopback still refused.
    * test_egress_port_bypass_probe — <bundle-ip>:9099 still
      refused, NOW WITH EGRESS ACTUALLY RUNNING (chunk 3's
      127.0.0.1 bind-address is doing its job).
    * test_prompt_file_lands_in_guest — still works.
    * test_pipelock_answers_on_bundle_ip — NEW. From inside the
      guest, wget to <bundle-ip>:8888 gets an HTTP response
      (not "connection refused") — proves pipelock is actually
      listening and the bind-mount + CA generation path works.

What's left in chunk 4:
- 4c: agent-image-conversion (claude-code + git + curl +
  ca-certificates in the guest). Chunk 2d's alpine placeholder
  stays for now.
- 4d: provision_ca + provision_git + provision_supervise once
  the agent image has the required tools.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 05:29:02 -04:00
didericis-claude 9e3b7e441e feat(smolmachines): provision_prompt + provision_skills (PRD 0023 chunk 4a)
test / unit (pull_request) Successful in 21s
test / integration (pull_request) Successful in 43s
First slice of chunk 4: implement the two provisioning methods
that don't depend on agent-image tooling beyond `cp` and
`mkdir`. provision_ca / provision_git / provision_supervise
land once the agent-image gap is solved (chunk 4b+) — they need
update-ca-certificates, git, and the claude binary respectively,
none of which the chunk-2d alpine placeholder provides.

What this PR ships:

- `claude_bottle/backend/smolmachines/provision/` subpackage
  with `prompt.py` + `skills.py`. Each routes through
  `smolvm.machine_cp` / `machine_exec`. provision_prompt mirrors
  the docker contract (file always copied; return value drives
  --append-system-prompt-file iff the agent has a non-empty
  prompt). provision_skills mkdir + cp per skill, matching
  the docker backend's loop.
- prepare.py now writes the prompt file under
  agent_state_dir(slug) with the agent's `prompt` body, mode
  0o600. The in-guest path is `/root/.claude-bottle-prompt.txt`
  (alpine has no `node` user; will become `/home/node/...` once
  the real claude-bottle image lands).
- launch.py calls `provision(plan, machine_name)` after
  machine_start. The returned prompt path threads to
  SmolmachinesBottle so exec_claude can add
  --append-system-prompt-file when the agent has a prompt.
- backend.py: provision_prompt / provision_skills now real;
  provision_git is a deliberate stub (waiting on the git-gate
  inner Plan + git in the agent image). provision_supervise
  stays the chunk-2d stub.

Tests:
- 7 new unit cases (test_smolmachines_provision.py): argv
  shape (mocked smolvm.machine_cp / .machine_exec),
  prompt return-value contract, no-op-with-no-skills,
  CLAUDE_BOTTLE_GUEST_SKILLS_DIR override, fail-on-missing-skill.
- 1 new integration case in test_smolmachines_launch.py:
  end-to-end verification that the prompt file lands in the
  alpine guest at /root/.claude-bottle-prompt.txt with the
  expected content (via `bottle.exec("cat ...")`). The smoke +
  the two TSI probes stay green.

552 unit + 4 integration (Darwin+smolvm+docker gated) passing.

What's left in chunk 4:
- 4b: thread the inner Plans (PipelockProxyPlan / EgressPlan /
  GitGatePlan / SupervisePlan) through prepare + launch so the
  bundle daemons actually run (currently daemons_csv="").
- 4c: the agent-image-conversion gap — get claude-code + git +
  curl + ca-certificates into the guest image (build a
  .smolmachine via `pack create --from-vm` after manual setup,
  or push the docker image to a registry smolvm can pull).
- 4d: provision_ca + provision_git + provision_supervise once
  4b + 4c land.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 05:08:17 -04:00