13 KiB
Local Ollama: Deployment Topology, Harness Selection, and Model Sizing
Research notes on running Ollama locally for a bot-bottle coding agent workflow. Covers the native-vs-VM question, which harness integrates best with an agent loop, and which models make sense on an RTX 3070 (8 GB VRAM / 30 GB RAM) machine.
1. Deployment topology: native, container, or VM?
The core question is whether running Ollama in a VM significantly degrades inference performance. The short answer: a full KVM/QEMU VM with GPU passthrough adds roughly 2–5% overhead, Docker on Linux adds roughly 1–2%, and LXC containers add sub-1%. None of these are significant for interactive coding use.
Native (bare metal)
Zero overhead, immediate GPU access, simplest setup. The right default for a solo developer doing inference on their own workstation.
Docker containers on Linux + NVIDIA
With nvidia-container-toolkit and --gpus all, containerized Ollama runs at
essentially native speed (~1–2% overhead on Linux). The dramatic exception is macOS,
where Docker Desktop runs a Linux VM with no access to Apple's Metal/GPU — inference
is 5–6× slower. On Linux/Windows with NVIDIA hardware, Docker is fine.
Common pitfall: if docker exec ollama ollama ps shows 0 GPU layers, the container
fell back to CPU. Usual causes: stale VRAM allocation, missing nvidia-container-toolkit,
or a host driver too old for the container's CUDA version.
KVM/QEMU VM with full PCIe passthrough
Full GPU passthrough makes the GPU invisible to the host while the VM owns it. Overhead from the IOMMU translation layer and virtualized PCIe bus is ~2–5%. This is viable if you need VM-level isolation (snapshotting, migration, separate kernel). Setup complexity is non-trivial: BIOS IOMMU, IOMMU group management, VFIO driver binding. Once configured it is stable.
Critical gotcha: set the VM's CPU type to host. If left at the default
(x86-64-v2-AES / "QEMU Virtual CPU version 2.5+"), Ollama may silently disable GPU
support even when drivers appear correct.
LXC containers (Proxmox et al.)
The sweet spot for isolation without overhead. Sub-1% performance difference from bare metal because LXC shares the host kernel; GPU device files are bind-mounted into the container. The tradeoff is weaker isolation (shared kernel) and the requirement that host and container driver versions match. Not suitable if you need VM-level snapshots or live migration.
Summary
| Topology | GPU overhead | Isolation | Complexity |
|---|---|---|---|
| Native | 0% | None | Low |
| Docker (Linux) | ~1–2% | Process | Low |
| LXC | <1% | Namespace | Medium |
| KVM passthrough | 2–5% | Full VM | High |
| VM no passthrough | CPU-only | Full VM | Medium |
Running Ollama in a VM will not significantly slow inference as long as GPU passthrough is configured. Without passthrough (software rendering / CPU fallback) performance collapses — that is what the user is rightly worried about.
Local vs. remote server
| Factor | Local machine | Remote server |
|---|---|---|
| Latency | Near-zero | Network round-trip; cumulative in agent loops |
| Cost | Zero after hardware | Per-token or subscription |
| Privacy | 100% on-device | Data leaves the machine |
| Model size ceiling | VRAM-limited | No hard limit (671B+ feasible) |
| Offline use | Yes | No |
| Concurrency under load | Sequential by default | Scales horizontally |
For agentic coding workflows making 20–50 tool calls per session, network latency accumulates quickly. Local inference eliminates this. A practical hybrid pattern: use the local GPU for routine coding loops; route only to a remote API for tasks requiring a 70B+ model or very long context (>128K tokens).
2. Harness selection
The landscape in 2026 has settled into three categories: IDE plugins, terminal agents, and chat UIs.
Continue.dev — recommended IDE plugin
Open-source VS Code / JetBrains / Zed / Vim extension. Routes autocomplete, chat, and refactoring commands to any configured LLM backend (Ollama, cloud APIs). The recommended setup uses two models: a small FIM-capable model for inline autocomplete (Qwen2.5-Coder 7B) and a larger model for chat/edit. Handles inline completions, multi-file edits, and codebase-aware chat. No API key, no data leaving the machine.
Aider — recommended for git-native terminal workflows
Terminal-based coding agent. Builds a codebase map before editing, makes changes
directly, and auto-commits to git with readable messages. Every change is one
git revert away. Supports 100+ languages; connects to any Ollama-served model
via the OpenAI-compatible API. Best for terminal-first developers who want
version-controlled agent interactions. Does not do inline autocomplete.
OpenCode — recommended for bot-bottle–style agent loops
Terminal-based coding agent with 15 built-in tools (bash execution, file read/write/edit, grep, glob, web fetch, MCP support) and connections to 75+ model providers including local Ollama models. This is the closest open-source equivalent to a Claude Code–style plan → tool-call → execute → observe → loop. Native Ollama integration.
Critical setup note: Ollama defaults to a 4096-token context window, which is completely insufficient for an agent loop carrying conversation history, tool schemas, a system prompt, and code simultaneously. Configure at least 64K tokens explicitly in the model's context settings.
Cline — agentic VS Code assistant
VS Code extension that operates as an autonomous agent: plans, edits files, runs commands in a loop, connects to Ollama's local endpoint. Compared to OpenCode it lives inside the IDE rather than the terminal; compared to Continue.dev it is a full agent rather than a plugin. Its system prompt overhead is higher (~7,000–10,000 tokens) than minimal harnesses.
Open WebUI / Jan / LM Studio — chat UIs, not coding harnesses
These are browser or desktop chat interfaces useful for ad-hoc conversations (explaining APIs, drafting documentation, exploring ideas) but without IDE integration, autocomplete, or git integration. LM Studio offers the smoothest onboarding (visual model browser with VRAM estimates). Jan is the most privacy-auditable (fully open-source, Apache 2.0, no telemetry). Neither is a replacement for a coding harness.
Harness comparison
| Harness | Type | Autocomplete | Agent loop | Ollama | Git integration |
|---|---|---|---|---|---|
| Continue.dev | IDE plugin | Yes (FIM) | Basic | Native | No |
| Aider | Terminal agent | No | Multi-turn | Via API | Auto-commit |
| OpenCode | Terminal agent | No | Full tools | Native | Via bash |
| Cline | IDE agent | No | Full tools | Via API | Via bash |
| Open WebUI | Chat UI | No | No | Native | No |
| Jan | Chat UI | No | No | Native | No |
For a bot-bottle workflow (an isolated sandbox running an agentic loop with tool access), OpenCode is the closest open-source match. For an IDE-first developer who wants autocomplete + chat, Continue.dev + Qwen2.5-Coder 7B is the recommended pair.
3. Model selection: RTX 3070 (8 GB VRAM / 30 GB RAM)
VRAM hard limits at Q4_K_M quantization
| Model size | Approx. VRAM (Q4_K_M) | Fits in 8 GB? | Tokens/sec (RTX 3070) |
|---|---|---|---|
| 3–4B | 2.5–3.5 GB | Yes, with headroom | 60–90 |
| 7–8B | 5–6 GB | Yes | 35–55 |
| 12–14B | 7.5–9 GB | Edge / RAM offload | 8–18 |
| 22B+ | 14+ GB | No | — |
The RTX 3070 has high memory bandwidth for its VRAM tier and consistently outperforms the newer RTX 4060 Ti on token generation speed. Bandwidth matters more than raw compute for inference.
Does Gemma 4 exist?
Yes. Google released Gemma 4 on 2 April 2026 (Apache 2.0). The family includes E2B (2B), E4B (4B), a 26B MoE, and a 31B Dense. A 12B multimodal variant was announced 2026-06-04. The 31B scores 80.0% on LiveCodeBench v6 — a major jump from Gemma 3 27B at 29.1%. However, only the E4B fits comfortably within 8 GB VRAM:
| Variant | VRAM (approx.) | Fits? |
|---|---|---|
| Gemma 4 E2B | ~2 GB | Yes |
| Gemma 4 E4B | ~5 GB | Yes |
| Gemma 4 12B | ~8–9 GB (Q4) | Edge |
| Gemma 4 26B MoE | 14–18 GB | No |
| Gemma 4 31B Dense | ~20 GB | No |
Model-by-model evaluation
Qwen2.5-Coder 7B — primary recommendation
The strongest purpose-built coding model that fits fully within 8 GB VRAM. Leads HumanEval among 7–8B-class models. Strong on Python, JavaScript, TypeScript. Has FIM (fill-in-the-middle) support for inline autocomplete. 35–55 tok/sec on RTX 3070.
ollama pull qwen2.5-coder:7b
Qwen2.5-Coder 14B — secondary, with RAM offloading
At Q4_K_M this needs ~8.7 GB, just over the 8 GB limit. With 30 GB system RAM, Ollama automatically offloads the overflow layers to CPU. Performance drops to ~8–18 tok/sec versus 35–55 tok/sec for the 7B fully in VRAM. Quality is noticeably better for complex multi-file reasoning. Viable for chat-based coding tasks where quality matters more than speed; too slow for live autocomplete. Keep context window at 8K tokens to minimize VRAM pressure during offloaded inference.
ollama pull qwen2.5-coder:14b
Gemma 4 E4B (~5 GB VRAM)
Fits comfortably with 3 GB to spare. Strong on reasoning, multimodal, and general-purpose tasks. Less specialized for coding than Qwen2.5-Coder 7B. Good choice for one model that covers coding + general reasoning + image analysis. The E4B outperforms Gemma 3 equivalents significantly on coding benchmarks.
ollama pull gemma4:e4b
Phi-4 Mini 3.8B (~3 GB VRAM)
Best reasoning-per-VRAM model; leaves ~5 GB free for other applications. Strong on math, logic, and structured output. Good for agentic sub-tasks requiring tight reasoning. Not the strongest at raw code synthesis but excellent for reasoning-heavy parts of a coding loop. Viable as the autocomplete model in a two-model Continue.dev setup.
ollama pull phi4-mini
DeepSeek-R1 8B (~5–6 GB VRAM)
Strong reasoning model for logic-heavy code (algorithms, correctness proofs). The full DeepSeek-Coder-V2 (236B MoE) is impractical here — only the 8B distilled variants are relevant. Outperforms Gemma 4 E4B on reasoning-heavy benchmarks; weaker on raw code generation than Qwen2.5-Coder 7B.
Codestral — not viable at 8 GB
The top FIM autocomplete model on HumanEval-FIM benchmarks, but requires 12–16 GB VRAM minimum. Not an option here. Worth revisiting if upgrading to a 12 GB+ card (RTX 4070 Super or newer).
RAM offloading: does 30 GB help?
Yes, meaningfully. Ollama automatically splits layers between GPU and system RAM when VRAM is exceeded. With 30 GB RAM, models up to ~14B at Q4_K_M run with partial offloading. The tradeoff is a 2–5× throughput penalty (8–18 tok/sec vs 35–55 tok/sec). Acceptable for batch tasks (reviewing a PR, generating an algorithm); too slow for live autocomplete.
Recommended setup
Autocomplete (fast, always-in-VRAM): qwen2.5-coder:7b
- Configure in Continue.dev as the tab-completion model
- FIM-capable; 35–55 tok/sec; fits with 2–3 GB VRAM to spare
Chat / agent loop (quality-first): qwen2.5-coder:14b or gemma4:e4b
- 14B for strongest multi-file coding; expect 8–18 tok/sec with RAM offload
- Gemma 4 E4B if you want vision + general reasoning + coding in one model; ~60 tok/sec
Two-model Continue.dev config (lower VRAM pressure):
phi4-mini (autocomplete) + qwen2.5-coder:7b (chat) — both fit simultaneously with
~1–2 GB to spare, keeping the OS and IDE from contending for VRAM.
Sources
- Ollama on Proxmox: GPU Passthrough for LXC and VM AI Workloads
- Run Ollama with NVIDIA GPU in Proxmox VMs and LXC containers
- Ollama Performance Tuning: Getting Maximum Speed from Local LLMs
- Pros and Cons: Containerized Ollama vs. Local Setup
- Best Local Coding Models Ranked: Every VRAM Tier (2026)
- Best Local LLMs for RTX 4060, RTX 3070, and RTX 5060
- Best Local LLMs for 8GB VRAM: Real Hardware Benchmarks (2026)
- Self-Hosted AI Coding Agent: Ollama + Continue + Open WebUI Setup in 2026
- Best Local-First AI Coding Tools 2026: 14 Compared
- OpenCode + Ollama: Private Local AI Coding Agent Setup
- Gemma 4: Google DeepMind
- Running Gemma 4 Locally: VRAM Requirements
- Phi-4 Mini vs. Gemma 3 vs. Qwen 2.5: Best SLM for Coding Tasks in 2026
- Qwen2.5-Coder 14B VRAM Requirements Guide
- Comparing AI Harnesses: OpenCode, Ollama, LM Studio, Claude Code, Open WebUI, and VS Code