didericis/bot-bottle

Fork 0

Files

T

didericis-claude d1556f4659

test / unit (push) Successful in 41s

Details

test / integration (push) Successful in 48s

Details

docs(research): local ollama deployment, harness selection, and model sizing

2026-06-03 21:37:55 -04:00

13 KiB

Raw Permalink Blame History

Local Ollama: Deployment Topology, Harness Selection, and Model Sizing

Research notes on running Ollama locally for a bot-bottle coding agent workflow. Covers the native-vs-VM question, which harness integrates best with an agent loop, and which models make sense on an RTX 3070 (8 GB VRAM / 30 GB RAM) machine.

1. Deployment topology: native, container, or VM?

The core question is whether running Ollama in a VM significantly degrades inference performance. The short answer: a full KVM/QEMU VM with GPU passthrough adds roughly 2–5% overhead, Docker on Linux adds roughly 1–2%, and LXC containers add sub-1%. None of these are significant for interactive coding use.

Native (bare metal)

Zero overhead, immediate GPU access, simplest setup. The right default for a solo developer doing inference on their own workstation.

Docker containers on Linux + NVIDIA

With nvidia-container-toolkit and --gpus all, containerized Ollama runs at essentially native speed (~1–2% overhead on Linux). The dramatic exception is macOS, where Docker Desktop runs a Linux VM with no access to Apple's Metal/GPU — inference is 5–6× slower. On Linux/Windows with NVIDIA hardware, Docker is fine.

Common pitfall: if docker exec ollama ollama ps shows 0 GPU layers, the container fell back to CPU. Usual causes: stale VRAM allocation, missing nvidia-container-toolkit, or a host driver too old for the container's CUDA version.

KVM/QEMU VM with full PCIe passthrough

Full GPU passthrough makes the GPU invisible to the host while the VM owns it. Overhead from the IOMMU translation layer and virtualized PCIe bus is ~2–5%. This is viable if you need VM-level isolation (snapshotting, migration, separate kernel). Setup complexity is non-trivial: BIOS IOMMU, IOMMU group management, VFIO driver binding. Once configured it is stable.

Critical gotcha: set the VM's CPU type to host. If left at the default (x86-64-v2-AES / "QEMU Virtual CPU version 2.5+"), Ollama may silently disable GPU support even when drivers appear correct.

LXC containers (Proxmox et al.)

The sweet spot for isolation without overhead. Sub-1% performance difference from bare metal because LXC shares the host kernel; GPU device files are bind-mounted into the container. The tradeoff is weaker isolation (shared kernel) and the requirement that host and container driver versions match. Not suitable if you need VM-level snapshots or live migration.

Summary

Topology	GPU overhead	Isolation	Complexity
Native	0%	None	Low
Docker (Linux)	~1–2%	Process	Low
LXC	<1%	Namespace	Medium
KVM passthrough	2–5%	Full VM	High
VM no passthrough	CPU-only	Full VM	Medium

Running Ollama in a VM will not significantly slow inference as long as GPU passthrough is configured. Without passthrough (software rendering / CPU fallback) performance collapses — that is what the user is rightly worried about.

Local vs. remote server

Factor	Local machine	Remote server
Latency	Near-zero	Network round-trip; cumulative in agent loops
Cost	Zero after hardware	Per-token or subscription
Privacy	100% on-device	Data leaves the machine
Model size ceiling	VRAM-limited	No hard limit (671B+ feasible)
Offline use	Yes	No
Concurrency under load	Sequential by default	Scales horizontally

For agentic coding workflows making 20–50 tool calls per session, network latency accumulates quickly. Local inference eliminates this. A practical hybrid pattern: use the local GPU for routine coding loops; route only to a remote API for tasks requiring a 70B+ model or very long context (>128K tokens).

2. Harness selection

The landscape in 2026 has settled into three categories: IDE plugins, terminal agents, and chat UIs.

Continue.dev — recommended IDE plugin

Open-source VS Code / JetBrains / Zed / Vim extension. Routes autocomplete, chat, and refactoring commands to any configured LLM backend (Ollama, cloud APIs). The recommended setup uses two models: a small FIM-capable model for inline autocomplete (Qwen2.5-Coder 7B) and a larger model for chat/edit. Handles inline completions, multi-file edits, and codebase-aware chat. No API key, no data leaving the machine.

Aider — recommended for git-native terminal workflows

Terminal-based coding agent. Builds a codebase map before editing, makes changes directly, and auto-commits to git with readable messages. Every change is one git revert away. Supports 100+ languages; connects to any Ollama-served model via the OpenAI-compatible API. Best for terminal-first developers who want version-controlled agent interactions. Does not do inline autocomplete.

OpenCode — recommended for bot-bottle–style agent loops

Terminal-based coding agent with 15 built-in tools (bash execution, file read/write/edit, grep, glob, web fetch, MCP support) and connections to 75+ model providers including local Ollama models. This is the closest open-source equivalent to a Claude Code–style plan → tool-call → execute → observe → loop. Native Ollama integration.

Critical setup note: Ollama defaults to a 4096-token context window, which is completely insufficient for an agent loop carrying conversation history, tool schemas, a system prompt, and code simultaneously. Configure at least 64K tokens explicitly in the model's context settings.

Cline — agentic VS Code assistant

VS Code extension that operates as an autonomous agent: plans, edits files, runs commands in a loop, connects to Ollama's local endpoint. Compared to OpenCode it lives inside the IDE rather than the terminal; compared to Continue.dev it is a full agent rather than a plugin. Its system prompt overhead is higher (~7,000–10,000 tokens) than minimal harnesses.

Open WebUI / Jan / LM Studio — chat UIs, not coding harnesses

These are browser or desktop chat interfaces useful for ad-hoc conversations (explaining APIs, drafting documentation, exploring ideas) but without IDE integration, autocomplete, or git integration. LM Studio offers the smoothest onboarding (visual model browser with VRAM estimates). Jan is the most privacy-auditable (fully open-source, Apache 2.0, no telemetry). Neither is a replacement for a coding harness.

Harness comparison

Harness	Type	Autocomplete	Agent loop	Ollama	Git integration
Continue.dev	IDE plugin	Yes (FIM)	Basic	Native	No
Aider	Terminal agent	No	Multi-turn	Via API	Auto-commit
OpenCode	Terminal agent	No	Full tools	Native	Via bash
Cline	IDE agent	No	Full tools	Via API	Via bash
Open WebUI	Chat UI	No	No	Native	No
Jan	Chat UI	No	No	Native	No

For a bot-bottle workflow (an isolated sandbox running an agentic loop with tool access), OpenCode is the closest open-source match. For an IDE-first developer who wants autocomplete + chat, Continue.dev + Qwen2.5-Coder 7B is the recommended pair.

3. Model selection: RTX 3070 (8 GB VRAM / 30 GB RAM)

VRAM hard limits at Q4_K_M quantization

Model size	Approx. VRAM (Q4_K_M)	Fits in 8 GB?	Tokens/sec (RTX 3070)
3–4B	2.5–3.5 GB	Yes, with headroom	60–90
7–8B	5–6 GB	Yes	35–55
12–14B	7.5–9 GB	Edge / RAM offload	8–18
22B+	14+ GB	No	—

The RTX 3070 has high memory bandwidth for its VRAM tier and consistently outperforms the newer RTX 4060 Ti on token generation speed. Bandwidth matters more than raw compute for inference.

Does Gemma 4 exist?

Yes. Google released Gemma 4 on 2 April 2026 (Apache 2.0). The family includes E2B (2B), E4B (4B), a 26B MoE, and a 31B Dense. A 12B multimodal variant was announced 2026-06-04. The 31B scores 80.0% on LiveCodeBench v6 — a major jump from Gemma 3 27B at 29.1%. However, only the E4B fits comfortably within 8 GB VRAM:

Variant	VRAM (approx.)	Fits?
Gemma 4 E2B	~2 GB	Yes
Gemma 4 E4B	~5 GB	Yes
Gemma 4 12B	~8–9 GB (Q4)	Edge
Gemma 4 26B MoE	14–18 GB	No
Gemma 4 31B Dense	~20 GB	No

Model-by-model evaluation

Qwen2.5-Coder 7B — primary recommendation

The strongest purpose-built coding model that fits fully within 8 GB VRAM. Leads HumanEval among 7–8B-class models. Strong on Python, JavaScript, TypeScript. Has FIM (fill-in-the-middle) support for inline autocomplete. 35–55 tok/sec on RTX 3070.

ollama pull qwen2.5-coder:7b

Qwen2.5-Coder 14B — secondary, with RAM offloading

At Q4_K_M this needs ~8.7 GB, just over the 8 GB limit. With 30 GB system RAM, Ollama automatically offloads the overflow layers to CPU. Performance drops to ~8–18 tok/sec versus 35–55 tok/sec for the 7B fully in VRAM. Quality is noticeably better for complex multi-file reasoning. Viable for chat-based coding tasks where quality matters more than speed; too slow for live autocomplete. Keep context window at 8K tokens to minimize VRAM pressure during offloaded inference.

ollama pull qwen2.5-coder:14b

Gemma 4 E4B (~5 GB VRAM)

Fits comfortably with 3 GB to spare. Strong on reasoning, multimodal, and general-purpose tasks. Less specialized for coding than Qwen2.5-Coder 7B. Good choice for one model that covers coding + general reasoning + image analysis. The E4B outperforms Gemma 3 equivalents significantly on coding benchmarks.

ollama pull gemma4:e4b

Phi-4 Mini 3.8B (~3 GB VRAM)

Best reasoning-per-VRAM model; leaves ~5 GB free for other applications. Strong on math, logic, and structured output. Good for agentic sub-tasks requiring tight reasoning. Not the strongest at raw code synthesis but excellent for reasoning-heavy parts of a coding loop. Viable as the autocomplete model in a two-model Continue.dev setup.

ollama pull phi4-mini

DeepSeek-R1 8B (~5–6 GB VRAM)

Strong reasoning model for logic-heavy code (algorithms, correctness proofs). The full DeepSeek-Coder-V2 (236B MoE) is impractical here — only the 8B distilled variants are relevant. Outperforms Gemma 4 E4B on reasoning-heavy benchmarks; weaker on raw code generation than Qwen2.5-Coder 7B.

Codestral — not viable at 8 GB

The top FIM autocomplete model on HumanEval-FIM benchmarks, but requires 12–16 GB VRAM minimum. Not an option here. Worth revisiting if upgrading to a 12 GB+ card (RTX 4070 Super or newer).

RAM offloading: does 30 GB help?

Yes, meaningfully. Ollama automatically splits layers between GPU and system RAM when VRAM is exceeded. With 30 GB RAM, models up to ~14B at Q4_K_M run with partial offloading. The tradeoff is a 2–5× throughput penalty (8–18 tok/sec vs 35–55 tok/sec). Acceptable for batch tasks (reviewing a PR, generating an algorithm); too slow for live autocomplete.

Recommended setup

Autocomplete (fast, always-in-VRAM): qwen2.5-coder:7b

Configure in Continue.dev as the tab-completion model
FIM-capable; 35–55 tok/sec; fits with 2–3 GB VRAM to spare

Chat / agent loop (quality-first): qwen2.5-coder:14b or gemma4:e4b

14B for strongest multi-file coding; expect 8–18 tok/sec with RAM offload
Gemma 4 E4B if you want vision + general reasoning + coding in one model; ~60 tok/sec

Two-model Continue.dev config (lower VRAM pressure): phi4-mini (autocomplete) + qwen2.5-coder:7b (chat) — both fit simultaneously with ~1–2 GB to spare, keeping the OS and IDE from contending for VRAM.

13 KiB Raw Permalink Blame History Unescape Escape