fix(smolmachines): docker push fails on Docker Desktop — daemon-side route differs from host loopback #74
Merged
didericis-claude
merged 13 commits from 2026-05-27 16:10:46 -04:00
fix-local-registry-docker-desktop into main
13 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
d02fe50193 |
fix(smolmachines): run claude mcp add as node so config lands in node's home
provision_supervise dispatched `claude mcp add --scope user` through `smolvm machine_exec`, which runs as root by default. The MCP entry got written to root's ~/.claude.json — but the agent's claude reads /home/node/.claude.json, so `/mcp` showed "No MCP servers configured" inside the bottle. Wrap the exec in `runuser -u node -- env HOME=/home/node ...` so the config writes to the right home. Same pattern as the interactive exec_claude / Bottle.exec wrappers — `smolvm machine_exec` is always root, so any command that touches user state has to switch UID + set HOME explicitly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|
|
515306cd4a |
fix(smolmachines): restore /tmp + /var/tmp perms after smolvm pack remap
smolvm's pack process remaps OCI-layer ownership to the host invoker's uid for *every* directory, not just /home/node — so /tmp lands as `0755 501:dialout` instead of the standard `1777 root:root`. Non-root processes can't create per-uid scratch dirs in there. Claude-code's first Bash tool call fails with `EACCES: permission denied, mkdir '/tmp/claude-1000'`. Same workaround folded into the existing perms-repair sh -c: `chown root:root /tmp /var/tmp && chmod 1777 /tmp /var/tmp` next to the /home/node chown. One machine_exec round trip total. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|
|
45c821a8f3 |
docs(smolmachines): note loopback-scope limitation + tracking issue
PR #74's Docker-Desktop pivot widened the smolmachines TSI allowlist from `<bundle-ip>/32` to `127.0.0.1/32` (TSI can't filter by port, and docker bridge IPs aren't reachable from macOS networking). The agent VM can therefore reach any service on macOS's loopback while the bottle is running — not just the bundle's published ports. README gets a "Smolmachines backend" subsection under Quickstart spelling this out as a known v1 limitation. PRD 0023 grows a new open question #8 with the proposed v2 fix (per-bottle loopback alias + TSI allowlist scoped to that /32, via sudo `ifconfig lo0 alias`). Tracking issue: gitea.dideric.is/didericis/claude-bottle/issues/75. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|
|
5486170be1 |
fix(smolmachines): route agent through egress when routes declared, wait for VM warm-up
Two related bugs: 1. Auth chain bypassed egress. After the Docker-Desktop port pivot, the agent always dialed pipelock directly — meaning egress (which holds the real OAuth token and rewrites the Authorization header) wasn't in the request path. Bearer placeholder reached anthropic verbatim → 401 "Invalid bearer token". Fix: when the bottle declares egress.routes, the agent's first hop is egress (publish egress port 9099 to host loopback, leave pipelock bundle-internal). Without routes, the agent dials pipelock directly. Same hop order as the docker backend. 2. provision_ca's update-ca-certificates SIGKILLed at ~100ms on Docker Desktop. Back-to-back `smolvm machine exec` calls immediately after machine_start hit a VM warm-up race in libkrun's exec channel; the second exec's child got SIGKILL'd before producing more than the first line of stdout. The agent's trust store never got the egress MITM CA's hash symlink, so curl/openssl couldn't validate the TLS chain. Fix: 1.5s sleep after machine_start (empirically enough), plus fold provision_ca's chown + chmod + update-ca-certificates into one `sh -c` so we only pay one exec round trip. Bail with a clear error if update-ca- certificates doesn't report "1 added" (failing silently was how the original SIGKILL went unnoticed). Net effect on Docker Desktop / macOS: claude's HTTPS_PROXY is `http://127.0.0.1:<egress port>`, egress rewrites auth, pipelock allowlists + DLPs, request reaches api.anthropic.com with a real token. End-to-end verified. Also drops the PRD-0023-chunk-3 EGRESS_LISTEN_HOST=127.0.0.1 mitigation. The original concern (agent bypassing pipelock by dialing egress's port on the bundle IP) doesn't apply in this topology: the agent can only reach whatever port we publish on host loopback, and egress is the only HTTP/HTTPS chokepoint that gets published. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|
|
4f136a9932 |
fix(smolmachines): agent dials bundle via host loopback ports, not docker bridge IP
Claude hung on outbound network calls under CLAUDE_BOTTLE_BACKEND=smolmachines: Unable to connect to API (FailedToOpenSocket) Root cause: the PRD-0023 design pinned the bundle at a docker bridge IP (192.168.X.2) and set the smolvm guest's TSI allowlist to `<bundle-ip>/32`. On native Linux this works — host shares the docker bridge's network namespace, TSI's syscall impersonation reaches the bridge IP directly. On Docker Desktop (macOS), the daemon runs in its own Linux VM and docker bridge IPs aren't reachable from macOS networking, so the smolvm guest's TSI requests die "Network is unreachable" before they hit pipelock. Fix: publish each agent-facing bundle daemon's port on host loopback (-p 127.0.0.1::PORT), discover the random host-side ports after start, and route the agent through `127.0.0.1:<host port>` instead of the bridge IP. macOS loopback is the surface Docker Desktop's gvproxy forwards into the daemon's VM, so the chain (guest TSI -> macOS loopback -> daemon VM port-forward -> bundle container) works on both Docker Desktop and native Linux. Concrete changes: - BundleLaunchSpec: add `ports_to_publish` so start_bundle adds `-p 127.0.0.1::PORT` for the agent-facing ports (pipelock always; git-gate when upstreams declared; supervise when enabled). Egress's port stays bundle-internal. - sidecar_bundle.bundle_host_port(): wrap `docker port <bundle> <container_port>/tcp` so launch can look up the random host-side mapping after start. - launch.py: discover the host ports, build URLs of the form `http://127.0.0.1:<host port>` / `git://127.0.0.1:<host port>`, stamp onto guest_env + new agent_*_url fields on the plan. - launch.py: TSI allow_cidrs flips to `["127.0.0.1/32"]`. The bundle IP is no longer the agent's target. - prepare.py: stop synthesizing HTTPS_PROXY / GIT_GATE_URL / MCP_SUPERVISE_URL at prepare time — launch owns those now (the values depend on a port docker hasn't assigned yet). - provision_git: gate_host from plan.agent_git_gate_host. - provision_supervise: URL from plan.agent_supervise_url. End-to-end verified on Docker Desktop / macOS: guest dials pipelock through TSI, pipelock forwards to api.anthropic.com, the API responds with 401 (i.e. it received the request). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|
|
da1e5e1ba8 |
fix(smolmachines): pass --net explicitly when allow_cidrs is set
smolvm 0.8.0 docs say `--allow-cidr` implies `--net`, but empirically the implication only fires when no `--from` is set. `--from PATH --allow-cidr X/32` silently produces a machine with network: false and no routes in the guest — claude lands inside with HTTPS_PROXY pointing at the bundle's pinned IP but every connect fails with "Network is unreachable" / FailedToOpenSocket in claude's UI. Reproduce + verify: $ smolvm machine create --from <pack> --allow-cidr X/32 nettest $ smolvm machine ls --json | jq '.[].network' # false $ smolvm machine create --from <pack> --net --allow-cidr X/32 nettest2 $ smolvm machine ls --json | jq '.[].network' # true Add `--net` whenever `allow_cidrs` is non-empty. No change to the no-allow-cidr code path. Test added to lock down both branches. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|
|
91955ec59f |
fix(smolmachines): forward guest env on every exec + chown /home/node
Two issues kept claude's TUI from drawing after launch: 1. smolvm pack remaps OCI-layer ownership to the host invoker's uid (501 on macOS) instead of preserving the image's USER node (uid 1000). /home/node ends up owned by some uid that doesn't exist in the VM, so when claude runs as node it can't appendFileSync to ~/.claude.json on startup — fails with ENOENT and the TUI hangs. Fix: chown -R node:node /home/node after machine_start, before provision. 2. smolvm machine_create -e sets env on PID 1 but it doesn't propagate to fresh exec process trees (verified empirically: `smolvm machine exec -- printenv` shows none of the machine_create env vars). Claude was running with no HTTPS_PROXY / CLAUDE_CODE_OAUTH_TOKEN / NODE_EXTRA_CA_CERTS, so even the auth-validation step bailed silently. Fix: thread `guest_env` through to the SmolmachinesBottle handle and re-pass every entry via `-e K=V` on every machine_exec call (interactive claude and shell exec both). Also fills in the same `CLAUDE_CODE_OAUTH_TOKEN=egress- placeholder` + telemetry-off env the docker backend's forwarded_env carries, plus the NODE_EXTRA_CA_CERTS / SSL_CERT_FILE / REQUESTS_CA_BUNDLE trust trio. Verified end-to-end on Docker Desktop / macOS: claude's TUI renders cleanly with the bypass-permissions banner. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|
|
35edf50f21 |
fix(smolmachines): drop runuser -l in favor of UID switch + explicit HOME/USER
Interactive claude session hung silently after `attaching interactive claude session...` — `runuser -l` invokes a login shell that triggers PAM session setup / /etc/profile sourcing, and the minimal Debian agent VM doesn't have the PAM config files for that to complete cleanly. claude never got to draw its TUI. Switch UID via plain `runuser -u <user> --` (no `-l`) and inject HOME / USER through `smolvm machine exec -e` so the child process sees them. Avoids login-shell wiring entirely. Same pattern in `exec_claude` and `exec(script)`. `_HOME_FOR` maps the two users the codebase currently asks for (`node`, `root`); anything else falls back to `/home/<user>`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|
|
af65c10361 |
refactor: Bottle.exec takes a user= kwarg, default node
Promote the user-switch from a hardcoded `node` to a keyword arg so callers can opt into root (or any other user) when needed. Default stays `node` — matches the docker image's USER and the smolmachines runuser default. Lifts the change through the base ABC, docker, and smolmachines backends: - Base: `def exec(self, script, *, user="node")`. - Docker: adds `-u <user>` to `docker exec` (no-op when user is node, the image's default). - Smolmachines: `runuser -l <user> -c <script>` — `runuser -l root` is the trivial no-op form when the caller asked for root. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|
|
e26d459a97 |
fix(smolmachines): run claude + shell exec as the node user
`smolvm machine exec` runs commands as root in the VM, but the
agent image's USER is `node`. claude-code refuses
`--dangerously-skip-permissions` when invoked as root, killing
the interactive session right after `attaching interactive claude
session...`:
--dangerously-skip-permissions cannot be used with root/sudo
privileges for security reasons
Wrap both `exec_claude` and `exec(script)` in
`runuser -l node -c ...` so commands run as the node user with
node's $HOME / $USER (login shell). The docker backend gets
this behavior for free via the image's USER directive; this
restores parity.
shlex-quote each claude argv element when stitching the runuser
-c shell command so paths / flags with shell-special chars
survive the parse.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
||
|
|
906c9fd1bb |
fix(smolmachines): preflight print uses plan-level egress routes
`SmolmachinesBottlePlan.print` iterated over
`bottle.egress.routes` (the manifest's capitalized-attribute form
on `manifest.EgressRoute`) but accessed `r.host` (lowercase).
Worked when no egress routes were declared; AttributeError
("EgressRoute has no attribute 'host'") on the first bottle with
a route.
Switch to `self.egress_plan.routes` — the resolved plan-level
EgressRoute (lowercase `host`), same source the docker backend's
print uses.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
||
|
|
47eb56bd10 |
fix(smolmachines): use containerized crane to push, bypassing docker daemon's HTTPS preference
The previous fix (`host.docker.internal:<port>` for daemon-side push) still failed: Get "https://host.docker.internal:53958/v2/": http: server gave HTTP response to HTTPS client `host.docker.internal` is reachable from Docker Desktop's daemon VM but isn't in the daemon's default insecure-registries CIDRs (only `::1/128` and `127.0.0.0/8` are), so docker push tries HTTPS, hits a plain-HTTP registry, and refuses. The daemon.json fix (`"insecure-registries": ["host.docker.internal"]`) works but is a one-time manual step in Docker Desktop's UI — not something we can do for the user. Sidestep the daemon push entirely: 1. docker build (as before) — local layer cache makes no-change rebuilds cheap. 2. docker save the image to a per-digest tarball alongside the cached `.smolmachine`. 3. Start an ephemeral registry container on a per-session docker network, with `-p :5000` so the host can also reach it for the pack step. 4. docker run a one-shot crane container on the SAME network, mount the tarball, `crane push --insecure /img.tar <registry-container>:5000/...`. Container DNS resolves the registry on the network; `--insecure` forces plain HTTP. 5. `smolvm pack create --image localhost:<host port>/...` from the host. smolvm's bundled crane auto-falls-back to HTTP for localhost addresses, so no insecure-registries config is needed on that side. 6. Tear down everything; reap the tarball (registries hold the same bytes, no need to keep both around). Net effect: the docker daemon never does an HTTP/HTTPS-policy decision on our behalf. `docker push` is gone from the prepare path; `docker save`, `docker network create`, `docker run` (for registry + crane) replace it. Tested end-to-end on Docker Desktop / macOS: `_ensure_smolmachine ("claude-bottle:latest")` produces a 204MB `.smolmachine.smolmachine` artifact. Adds: - backend/docker/util.py:save() — thin docker save wrapper. - local_registry.crane_push_tarball() — one-shot crane run on the registry's network. - CRANE_IMAGE constant pinned by digest (gcr.io/go-containerregistry/crane@sha256:0ae17ecb...). Removes: - backend/docker/util.py:tag() / push() — unused without daemon push. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|
|
f4026ea3ae |
fix(smolmachines): docker push fails on Docker Desktop — daemon-side route differs from host loopback
`./cli.py start <agent>` under CLAUDE_BOTTLE_BACKEND=smolmachines died at `docker push localhost:<port>/claude-bottle:<id>` with `Get "http://localhost:<port>/v2/": context deadline exceeded`. Cause: chunk 4c bound the ephemeral registry to `127.0.0.1::5000` and used `localhost:<port>` as the only image-ref hostname. On Docker Desktop the daemon runs inside its own Linux VM — its `localhost` is the VM's loopback, not the host's, so the daemon cannot reach a registry bound to the host's 127.0.0.1. Fix: bind the registry to all interfaces (`-p :5000`) so it's reachable from both sides, and yield two endpoints: - `daemon_endpoint` — `host.docker.internal:<port>` on Docker Desktop (daemon-side hostname for the host VM gateway), `localhost:<port>` on a native Linux daemon that shares the host's network namespace. Used for `docker tag` + `docker push`. - `host_endpoint` — always `localhost:<port>`. Used for `smolvm pack create`, which runs as a host process. The registry stores images by repo+tag, so a push to `host.docker.internal:<port>/cb:<id>` and a pull from `localhost:<port>/cb:<id>` resolve to the same blob — the hostname in a ref is just routing. Detection uses `docker info --format '{{.OperatingSystem}}'`, which returns "Docker Desktop" on macOS/Windows Desktop and the host's OS name on native daemons. Trade-off: all-interface binding briefly publishes the registry on every interface (~5-10s during prepare). The pushed image is built from the public repo Dockerfile (no secrets), the port is random, and the window is short — acceptable for v1 of a personal dev tool. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |