Files
bot-bottle/docs/research/local-ollama-harness-and-model-selection.md
didericis-claude d1556f4659
test / unit (push) Successful in 41s
test / integration (push) Successful in 48s
docs(research): local ollama deployment, harness selection, and model sizing
2026-06-03 21:37:55 -04:00

13 KiB
Raw Permalink Blame History

Local Ollama: Deployment Topology, Harness Selection, and Model Sizing

Research notes on running Ollama locally for a bot-bottle coding agent workflow. Covers the native-vs-VM question, which harness integrates best with an agent loop, and which models make sense on an RTX 3070 (8 GB VRAM / 30 GB RAM) machine.


1. Deployment topology: native, container, or VM?

The core question is whether running Ollama in a VM significantly degrades inference performance. The short answer: a full KVM/QEMU VM with GPU passthrough adds roughly 25% overhead, Docker on Linux adds roughly 12%, and LXC containers add sub-1%. None of these are significant for interactive coding use.

Native (bare metal)

Zero overhead, immediate GPU access, simplest setup. The right default for a solo developer doing inference on their own workstation.

Docker containers on Linux + NVIDIA

With nvidia-container-toolkit and --gpus all, containerized Ollama runs at essentially native speed (~12% overhead on Linux). The dramatic exception is macOS, where Docker Desktop runs a Linux VM with no access to Apple's Metal/GPU — inference is 56× slower. On Linux/Windows with NVIDIA hardware, Docker is fine.

Common pitfall: if docker exec ollama ollama ps shows 0 GPU layers, the container fell back to CPU. Usual causes: stale VRAM allocation, missing nvidia-container-toolkit, or a host driver too old for the container's CUDA version.

KVM/QEMU VM with full PCIe passthrough

Full GPU passthrough makes the GPU invisible to the host while the VM owns it. Overhead from the IOMMU translation layer and virtualized PCIe bus is ~25%. This is viable if you need VM-level isolation (snapshotting, migration, separate kernel). Setup complexity is non-trivial: BIOS IOMMU, IOMMU group management, VFIO driver binding. Once configured it is stable.

Critical gotcha: set the VM's CPU type to host. If left at the default (x86-64-v2-AES / "QEMU Virtual CPU version 2.5+"), Ollama may silently disable GPU support even when drivers appear correct.

LXC containers (Proxmox et al.)

The sweet spot for isolation without overhead. Sub-1% performance difference from bare metal because LXC shares the host kernel; GPU device files are bind-mounted into the container. The tradeoff is weaker isolation (shared kernel) and the requirement that host and container driver versions match. Not suitable if you need VM-level snapshots or live migration.

Summary

Topology GPU overhead Isolation Complexity
Native 0% None Low
Docker (Linux) ~12% Process Low
LXC <1% Namespace Medium
KVM passthrough 25% Full VM High
VM no passthrough CPU-only Full VM Medium

Running Ollama in a VM will not significantly slow inference as long as GPU passthrough is configured. Without passthrough (software rendering / CPU fallback) performance collapses — that is what the user is rightly worried about.

Local vs. remote server

Factor Local machine Remote server
Latency Near-zero Network round-trip; cumulative in agent loops
Cost Zero after hardware Per-token or subscription
Privacy 100% on-device Data leaves the machine
Model size ceiling VRAM-limited No hard limit (671B+ feasible)
Offline use Yes No
Concurrency under load Sequential by default Scales horizontally

For agentic coding workflows making 2050 tool calls per session, network latency accumulates quickly. Local inference eliminates this. A practical hybrid pattern: use the local GPU for routine coding loops; route only to a remote API for tasks requiring a 70B+ model or very long context (>128K tokens).


2. Harness selection

The landscape in 2026 has settled into three categories: IDE plugins, terminal agents, and chat UIs.

Open-source VS Code / JetBrains / Zed / Vim extension. Routes autocomplete, chat, and refactoring commands to any configured LLM backend (Ollama, cloud APIs). The recommended setup uses two models: a small FIM-capable model for inline autocomplete (Qwen2.5-Coder 7B) and a larger model for chat/edit. Handles inline completions, multi-file edits, and codebase-aware chat. No API key, no data leaving the machine.

Terminal-based coding agent. Builds a codebase map before editing, makes changes directly, and auto-commits to git with readable messages. Every change is one git revert away. Supports 100+ languages; connects to any Ollama-served model via the OpenAI-compatible API. Best for terminal-first developers who want version-controlled agent interactions. Does not do inline autocomplete.

Terminal-based coding agent with 15 built-in tools (bash execution, file read/write/edit, grep, glob, web fetch, MCP support) and connections to 75+ model providers including local Ollama models. This is the closest open-source equivalent to a Claude Codestyle plan → tool-call → execute → observe → loop. Native Ollama integration.

Critical setup note: Ollama defaults to a 4096-token context window, which is completely insufficient for an agent loop carrying conversation history, tool schemas, a system prompt, and code simultaneously. Configure at least 64K tokens explicitly in the model's context settings.

Cline — agentic VS Code assistant

VS Code extension that operates as an autonomous agent: plans, edits files, runs commands in a loop, connects to Ollama's local endpoint. Compared to OpenCode it lives inside the IDE rather than the terminal; compared to Continue.dev it is a full agent rather than a plugin. Its system prompt overhead is higher (~7,00010,000 tokens) than minimal harnesses.

Open WebUI / Jan / LM Studio — chat UIs, not coding harnesses

These are browser or desktop chat interfaces useful for ad-hoc conversations (explaining APIs, drafting documentation, exploring ideas) but without IDE integration, autocomplete, or git integration. LM Studio offers the smoothest onboarding (visual model browser with VRAM estimates). Jan is the most privacy-auditable (fully open-source, Apache 2.0, no telemetry). Neither is a replacement for a coding harness.

Harness comparison

Harness Type Autocomplete Agent loop Ollama Git integration
Continue.dev IDE plugin Yes (FIM) Basic Native No
Aider Terminal agent No Multi-turn Via API Auto-commit
OpenCode Terminal agent No Full tools Native Via bash
Cline IDE agent No Full tools Via API Via bash
Open WebUI Chat UI No No Native No
Jan Chat UI No No Native No

For a bot-bottle workflow (an isolated sandbox running an agentic loop with tool access), OpenCode is the closest open-source match. For an IDE-first developer who wants autocomplete + chat, Continue.dev + Qwen2.5-Coder 7B is the recommended pair.


3. Model selection: RTX 3070 (8 GB VRAM / 30 GB RAM)

VRAM hard limits at Q4_K_M quantization

Model size Approx. VRAM (Q4_K_M) Fits in 8 GB? Tokens/sec (RTX 3070)
34B 2.53.5 GB Yes, with headroom 6090
78B 56 GB Yes 3555
1214B 7.59 GB Edge / RAM offload 818
22B+ 14+ GB No

The RTX 3070 has high memory bandwidth for its VRAM tier and consistently outperforms the newer RTX 4060 Ti on token generation speed. Bandwidth matters more than raw compute for inference.

Does Gemma 4 exist?

Yes. Google released Gemma 4 on 2 April 2026 (Apache 2.0). The family includes E2B (2B), E4B (4B), a 26B MoE, and a 31B Dense. A 12B multimodal variant was announced 2026-06-04. The 31B scores 80.0% on LiveCodeBench v6 — a major jump from Gemma 3 27B at 29.1%. However, only the E4B fits comfortably within 8 GB VRAM:

Variant VRAM (approx.) Fits?
Gemma 4 E2B ~2 GB Yes
Gemma 4 E4B ~5 GB Yes
Gemma 4 12B ~89 GB (Q4) Edge
Gemma 4 26B MoE 1418 GB No
Gemma 4 31B Dense ~20 GB No

Model-by-model evaluation

Qwen2.5-Coder 7B — primary recommendation

The strongest purpose-built coding model that fits fully within 8 GB VRAM. Leads HumanEval among 78B-class models. Strong on Python, JavaScript, TypeScript. Has FIM (fill-in-the-middle) support for inline autocomplete. 3555 tok/sec on RTX 3070.

ollama pull qwen2.5-coder:7b

Qwen2.5-Coder 14B — secondary, with RAM offloading

At Q4_K_M this needs ~8.7 GB, just over the 8 GB limit. With 30 GB system RAM, Ollama automatically offloads the overflow layers to CPU. Performance drops to ~818 tok/sec versus 3555 tok/sec for the 7B fully in VRAM. Quality is noticeably better for complex multi-file reasoning. Viable for chat-based coding tasks where quality matters more than speed; too slow for live autocomplete. Keep context window at 8K tokens to minimize VRAM pressure during offloaded inference.

ollama pull qwen2.5-coder:14b

Gemma 4 E4B (~5 GB VRAM)

Fits comfortably with 3 GB to spare. Strong on reasoning, multimodal, and general-purpose tasks. Less specialized for coding than Qwen2.5-Coder 7B. Good choice for one model that covers coding + general reasoning + image analysis. The E4B outperforms Gemma 3 equivalents significantly on coding benchmarks.

ollama pull gemma4:e4b

Phi-4 Mini 3.8B (~3 GB VRAM)

Best reasoning-per-VRAM model; leaves ~5 GB free for other applications. Strong on math, logic, and structured output. Good for agentic sub-tasks requiring tight reasoning. Not the strongest at raw code synthesis but excellent for reasoning-heavy parts of a coding loop. Viable as the autocomplete model in a two-model Continue.dev setup.

ollama pull phi4-mini

DeepSeek-R1 8B (~56 GB VRAM)

Strong reasoning model for logic-heavy code (algorithms, correctness proofs). The full DeepSeek-Coder-V2 (236B MoE) is impractical here — only the 8B distilled variants are relevant. Outperforms Gemma 4 E4B on reasoning-heavy benchmarks; weaker on raw code generation than Qwen2.5-Coder 7B.

Codestral — not viable at 8 GB

The top FIM autocomplete model on HumanEval-FIM benchmarks, but requires 1216 GB VRAM minimum. Not an option here. Worth revisiting if upgrading to a 12 GB+ card (RTX 4070 Super or newer).

RAM offloading: does 30 GB help?

Yes, meaningfully. Ollama automatically splits layers between GPU and system RAM when VRAM is exceeded. With 30 GB RAM, models up to ~14B at Q4_K_M run with partial offloading. The tradeoff is a 25× throughput penalty (818 tok/sec vs 3555 tok/sec). Acceptable for batch tasks (reviewing a PR, generating an algorithm); too slow for live autocomplete.

Autocomplete (fast, always-in-VRAM): qwen2.5-coder:7b

  • Configure in Continue.dev as the tab-completion model
  • FIM-capable; 3555 tok/sec; fits with 23 GB VRAM to spare

Chat / agent loop (quality-first): qwen2.5-coder:14b or gemma4:e4b

  • 14B for strongest multi-file coding; expect 818 tok/sec with RAM offload
  • Gemma 4 E4B if you want vision + general reasoning + coding in one model; ~60 tok/sec

Two-model Continue.dev config (lower VRAM pressure): phi4-mini (autocomplete) + qwen2.5-coder:7b (chat) — both fit simultaneously with ~12 GB to spare, keeping the OS and IDE from contending for VRAM.


Sources