fix(git-gate): bound daemon client sessions #114
Reference in New Issue
Block a user
Delete Branch "fix/git-gate-daemon-timeouts"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Closes #113
Tests
@@ -44,0 +44,4 @@# Bound half-open git client sessions. If an agent/tool runner is# interrupted during push, git daemon should reap the receive-pack# child instead of keeping the gate wedged indefinitely.GIT_GATE_DAEMON_TIMEOUT_SECS = 60Seems a bit long... can we reduce to 15s (weigh pros and cons of this)
Adjusted this to 15s in
7bffaa7. The tradeoff looks right here: this timeout covers the local VM-to-git-gate daemon session, so a healthy push should keep sending pack/protocol data and should not sit idle for 15 seconds. The upside is much faster cleanup for interrupted tool runs; the main downside is that an extremely stalled local transfer could be terminated sooner, but upstream SSH already has its own bounded connect behavior and this path is local to the bottle/sidecar.Re-ran
python3 -m unittest tests.unit.test_git_gatesuccessfully.I'm also wondering if there's a better solution than just a timeout... the daemon should be getting a permissions error, right? Can't we have a better way to fail fast on permission errors when trying to push?
I do not think this instance was a permission-error path. The stuck state happened before git-gate had accepted the pack and run the pre-receive hook: the attempted commit was not present in
/git/bot-bottle.git, and the matching sidecar children weregit receive-packwaiting on the client while the VM still hadgit push/pack-objectschildren alive after the tool interruption. A direct no-op upstream SSH push from git-gate to Gitea also succeeded, so upstream permission/auth was healthy.If the push reaches the hook and upstream rejects it, we should already fail fast: the hook runs
GIT_SSH_COMMAND=... git push origin ..., printsgit-gate: upstream push failed for <ref>, and exits non-zero. The bounded daemon timeout is aimed at the different failure mode where the local client side is abandoned mid-protocol and neither side gets a clean EOF.A stronger follow-up would be adding a regression/integration test that simulates an interrupted local git client and asserts the receive-pack child is reaped. I would keep that separate from this small fix because it needs process-level harnessing around git daemon.
I'm fairly sure the error happened specifically after a https push to something blocked by the egress proxy, not git-gate.(EDIT: seems like this was wrong/I misdiagnosed the issue). I'd like to try to replicate exactly what caused the hang so we can have misbehaving agents that try to https fail fast, which I don't think they do.Regardless, we should try to replicate exactly what caused the hang/write a test for that in this PR. This project is pre-release/no need for a quick fix, want to go deep.
Agreed. I added
630e65ewith a regression test that exercises the HTTPS/egress-shaped failure path using a realgit pushclient.The test spins up a local smart-HTTP-shaped endpoint that returns the same egress 403 when git asks for
info/refs?service=git-receive-pack, then asserts the git process exits within the test timeout instead of hanging. That covers the misbehaving-agent case where it tries to push over HTTP(S) through egress and should fail fast before any pack transfer or git-gate path is involved.I kept the daemon timeout too, because it covers the separate half-open git-gate client/session cleanup case we observed after the interrupted push.
Tested with:
Doesn't solve the core issue, which I'm still getting. Basic test is this:
claude-implementer)Here's the full transcript:
Replying to #114 (comment):
What we have learned so far about the hang:
Request receive-pack for '/bot-bottle.git', but never reaches the hook'sgit-gate: gitleaks scanning...output.git pushis sleeping and childgit pack-objectsis stuck; in the sidecar,git receive-packis waiting under agit-daemon --servechild./git/bot-bottle.gitsucceeds and runs the hook/upstream forward. Git-gate -> upstream SSH succeeds.So the current working theory is that the failure is specific to the smolvm guest -> Docker-published git-gate port path for push pack streams. It is not currently pointing at git-gate policy rejection, gitleaks, or Gitea permission failure.
Follow-up from the deeper dig:
The hang is reproducible on the smolvm guest -> Docker-published
git://receive-pack path, but not on the equivalent smart-HTTP path. The decisive checks were:git://works, but smolvm push overgit://stalls while preparing/sending the pack;I pushed
6ea19a8, which switches the smolmachines agent-facing git-gate transport to smart HTTP while keeping the same bare repos, pre-receive hook, gitleaks scan, and upstream SSH forwarding. Docker-backend git-gate still usesgit://. I also added coverage for the smart-HTTP wrapper, including theContent-Encoding: gzipCGI header case that initially made large real clones return an empty 200 response.#114 (comment)
Did we need to add smart-HTTP for the fix? Couldn't we have just used
ssh://instead ofgit://when setting up the git config instead? I thought we had already configured git-gate to use ssh. Did we remove that?We did not remove the SSH path. There are two separate hops here:
The second hop is still SSH. The git-gate mirror's pre-receive hook forwards accepted refs to upstream with the configured identity file and known_hosts file.
The broken hop is the first one. Today the agent-facing rewrite points the agent at the local gate using
git://.../<name>.git. The sidecar hasopenssh-clientfor the upstream hop, but it does not runsshd, does not provision an agent-facing SSH user/key, and does not expose an SSH/git-shell service for the agent to connect to.So yes,
ssh://could be a valid design for agent -> gate, but it would be a different gate surface: add an SSH server to the sidecar, provision trust/keys or some other auth model for the guest, restrict the account to git-shell or forced commands, publish another port through the smolvm loopback path, and test that whole setup. Smart HTTP is smaller here becausegit http-backendcan serve the existing bare repos and reuse the existing pre-receive/gitleaks/upstream-SSH enforcement without introducing another credential path.In short: SSH was and remains configured for gate -> Gitea. It was not configured for agent -> gate, and this fix only changes that agent-facing local hop from
git://to smart HTTP for smolmachines.@@ -0,0 +20,4 @@DEFAULT_PORT = 9420class GitHttpHandler(BaseHTTPRequestHandler):Would prefer we not roll our own git http handler/seems like we should be using either something already build into git (does it not have an inbuilt tool to do this) or some very lightweight, standard tool.
That being said, this isn't very big/complicated, so it may actually be better than something that introduces a bunch of complex features we don't need or a supply chain attack.
Replying to #114 (comment):
git http-backendis the built-in — the Python class is a CGI adapter around it, not a custom git implementation._run_backend()sets up the CGI environment and execsgit http-backendas a subprocess; the class itself contains no git protocol logic.The alternatives that would avoid the wrapper:
git http-backendnatively, but introduces a heavier daemon + config as a sidecar depwsgiref.handlers.CGIHandler: requires the same CGI env setup and a WSGI server — similar surface areagit http-backendhas no--servemode; it is CGI-only, so some server must bridge it. The class is ~120 lines, stdlib-only (zero new packages), and all enforcement (gitleaks, pre-receive, upstream SSH forward) still runs inside git hooks exactly as before.Sounds good, keep it