Research: local Ollama deployment, harness selection, and model sizing #183

Merged
didericis merged 1 commits from research-local-ollama-harness into main 2026-06-03 21:37:56 -04:00
@@ -0,0 +1,278 @@
# Local Ollama: Deployment Topology, Harness Selection, and Model Sizing
Research notes on running Ollama locally for a bot-bottle coding agent workflow.
Covers the native-vs-VM question, which harness integrates best with an agent loop,
and which models make sense on an RTX 3070 (8 GB VRAM / 30 GB RAM) machine.
---
## 1. Deployment topology: native, container, or VM?
The core question is whether running Ollama in a VM significantly degrades inference
performance. The short answer: a full KVM/QEMU VM with GPU passthrough adds roughly
25% overhead, Docker on Linux adds roughly 12%, and LXC containers add sub-1%. None
of these are significant for interactive coding use.
### Native (bare metal)
Zero overhead, immediate GPU access, simplest setup. The right default for a solo
developer doing inference on their own workstation.
### Docker containers on Linux + NVIDIA
With `nvidia-container-toolkit` and `--gpus all`, containerized Ollama runs at
essentially native speed (~12% overhead on Linux). The dramatic exception is macOS,
where Docker Desktop runs a Linux VM with no access to Apple's Metal/GPU — inference
is 56× slower. On Linux/Windows with NVIDIA hardware, Docker is fine.
Common pitfall: if `docker exec ollama ollama ps` shows 0 GPU layers, the container
fell back to CPU. Usual causes: stale VRAM allocation, missing `nvidia-container-toolkit`,
or a host driver too old for the container's CUDA version.
### KVM/QEMU VM with full PCIe passthrough
Full GPU passthrough makes the GPU invisible to the host while the VM owns it. Overhead
from the IOMMU translation layer and virtualized PCIe bus is ~25%. This is viable if
you need VM-level isolation (snapshotting, migration, separate kernel). Setup complexity
is non-trivial: BIOS IOMMU, IOMMU group management, VFIO driver binding. Once configured
it is stable.
**Critical gotcha:** set the VM's CPU type to `host`. If left at the default
(`x86-64-v2-AES` / "QEMU Virtual CPU version 2.5+"), Ollama may silently disable GPU
support even when drivers appear correct.
### LXC containers (Proxmox et al.)
The sweet spot for isolation without overhead. Sub-1% performance difference from bare
metal because LXC shares the host kernel; GPU device files are bind-mounted into the
container. The tradeoff is weaker isolation (shared kernel) and the requirement that
host and container driver versions match. Not suitable if you need VM-level snapshots
or live migration.
### Summary
| Topology | GPU overhead | Isolation | Complexity |
|---|---|---|---|
| Native | 0% | None | Low |
| Docker (Linux) | ~12% | Process | Low |
| LXC | <1% | Namespace | Medium |
| KVM passthrough | 25% | Full VM | High |
| VM no passthrough | CPU-only | Full VM | Medium |
Running Ollama in a VM will **not** significantly slow inference as long as GPU passthrough
is configured. Without passthrough (software rendering / CPU fallback) performance
collapses — that is what the user is rightly worried about.
### Local vs. remote server
| Factor | Local machine | Remote server |
|---|---|---|
| Latency | Near-zero | Network round-trip; cumulative in agent loops |
| Cost | Zero after hardware | Per-token or subscription |
| Privacy | 100% on-device | Data leaves the machine |
| Model size ceiling | VRAM-limited | No hard limit (671B+ feasible) |
| Offline use | Yes | No |
| Concurrency under load | Sequential by default | Scales horizontally |
For agentic coding workflows making 2050 tool calls per session, network latency
accumulates quickly. Local inference eliminates this. A practical hybrid pattern:
use the local GPU for routine coding loops; route only to a remote API for tasks
requiring a 70B+ model or very long context (>128K tokens).
---
## 2. Harness selection
The landscape in 2026 has settled into three categories: IDE plugins, terminal agents,
and chat UIs.
### Continue.dev — recommended IDE plugin
Open-source VS Code / JetBrains / Zed / Vim extension. Routes autocomplete, chat, and
refactoring commands to any configured LLM backend (Ollama, cloud APIs). The recommended
setup uses two models: a small FIM-capable model for inline autocomplete (Qwen2.5-Coder 7B)
and a larger model for chat/edit. Handles inline completions, multi-file edits, and
codebase-aware chat. No API key, no data leaving the machine.
### Aider — recommended for git-native terminal workflows
Terminal-based coding agent. Builds a codebase map before editing, makes changes
directly, and auto-commits to git with readable messages. Every change is one
`git revert` away. Supports 100+ languages; connects to any Ollama-served model
via the OpenAI-compatible API. Best for terminal-first developers who want
version-controlled agent interactions. Does not do inline autocomplete.
### OpenCode — recommended for bot-bottlestyle agent loops
Terminal-based coding agent with 15 built-in tools (bash execution, file read/write/edit,
grep, glob, web fetch, MCP support) and connections to 75+ model providers including
local Ollama models. This is the closest open-source equivalent to a Claude Codestyle
plan → tool-call → execute → observe → loop. Native Ollama integration.
**Critical setup note:** Ollama defaults to a 4096-token context window, which is
completely insufficient for an agent loop carrying conversation history, tool schemas,
a system prompt, and code simultaneously. Configure at least 64K tokens explicitly
in the model's context settings.
### Cline — agentic VS Code assistant
VS Code extension that operates as an autonomous agent: plans, edits files, runs commands
in a loop, connects to Ollama's local endpoint. Compared to OpenCode it lives inside the
IDE rather than the terminal; compared to Continue.dev it is a full agent rather than a
plugin. Its system prompt overhead is higher (~7,00010,000 tokens) than minimal harnesses.
### Open WebUI / Jan / LM Studio — chat UIs, not coding harnesses
These are browser or desktop chat interfaces useful for ad-hoc conversations (explaining
APIs, drafting documentation, exploring ideas) but without IDE integration, autocomplete,
or git integration. LM Studio offers the smoothest onboarding (visual model browser with
VRAM estimates). Jan is the most privacy-auditable (fully open-source, Apache 2.0, no
telemetry). Neither is a replacement for a coding harness.
### Harness comparison
| Harness | Type | Autocomplete | Agent loop | Ollama | Git integration |
|---|---|---|---|---|---|
| Continue.dev | IDE plugin | Yes (FIM) | Basic | Native | No |
| Aider | Terminal agent | No | Multi-turn | Via API | Auto-commit |
| OpenCode | Terminal agent | No | Full tools | Native | Via bash |
| Cline | IDE agent | No | Full tools | Via API | Via bash |
| Open WebUI | Chat UI | No | No | Native | No |
| Jan | Chat UI | No | No | Native | No |
For a bot-bottle workflow (an isolated sandbox running an agentic loop with tool access),
**OpenCode** is the closest open-source match. For an IDE-first developer who wants
autocomplete + chat, **Continue.dev + Qwen2.5-Coder 7B** is the recommended pair.
---
## 3. Model selection: RTX 3070 (8 GB VRAM / 30 GB RAM)
### VRAM hard limits at Q4_K_M quantization
| Model size | Approx. VRAM (Q4_K_M) | Fits in 8 GB? | Tokens/sec (RTX 3070) |
|---|---|---|---|
| 34B | 2.53.5 GB | Yes, with headroom | 6090 |
| 78B | 56 GB | Yes | 3555 |
| 1214B | 7.59 GB | Edge / RAM offload | 818 |
| 22B+ | 14+ GB | No | — |
The RTX 3070 has high memory bandwidth for its VRAM tier and consistently outperforms
the newer RTX 4060 Ti on token generation speed. Bandwidth matters more than raw compute
for inference.
### Does Gemma 4 exist?
Yes. Google released **Gemma 4** on 2 April 2026 (Apache 2.0). The family includes
E2B (2B), E4B (4B), a 26B MoE, and a 31B Dense. A 12B multimodal variant was announced
2026-06-04. The 31B scores 80.0% on LiveCodeBench v6 — a major jump from Gemma 3 27B
at 29.1%. However, only the E4B fits comfortably within 8 GB VRAM:
| Variant | VRAM (approx.) | Fits? |
|---|---|---|
| Gemma 4 E2B | ~2 GB | Yes |
| Gemma 4 E4B | ~5 GB | Yes |
| Gemma 4 12B | ~89 GB (Q4) | Edge |
| Gemma 4 26B MoE | 1418 GB | No |
| Gemma 4 31B Dense | ~20 GB | No |
### Model-by-model evaluation
**Qwen2.5-Coder 7B — primary recommendation**
The strongest purpose-built coding model that fits fully within 8 GB VRAM. Leads
HumanEval among 78B-class models. Strong on Python, JavaScript, TypeScript. Has
FIM (fill-in-the-middle) support for inline autocomplete. 3555 tok/sec on RTX 3070.
```
ollama pull qwen2.5-coder:7b
```
**Qwen2.5-Coder 14B — secondary, with RAM offloading**
At Q4_K_M this needs ~8.7 GB, just over the 8 GB limit. With 30 GB system RAM, Ollama
automatically offloads the overflow layers to CPU. Performance drops to ~818 tok/sec
versus 3555 tok/sec for the 7B fully in VRAM. Quality is noticeably better for complex
multi-file reasoning. Viable for chat-based coding tasks where quality matters more than
speed; too slow for live autocomplete. Keep context window at 8K tokens to minimize
VRAM pressure during offloaded inference.
```
ollama pull qwen2.5-coder:14b
```
**Gemma 4 E4B (~5 GB VRAM)**
Fits comfortably with 3 GB to spare. Strong on reasoning, multimodal, and general-purpose
tasks. Less specialized for coding than Qwen2.5-Coder 7B. Good choice for one model that
covers coding + general reasoning + image analysis. The E4B outperforms Gemma 3 equivalents
significantly on coding benchmarks.
```
ollama pull gemma4:e4b
```
**Phi-4 Mini 3.8B (~3 GB VRAM)**
Best reasoning-per-VRAM model; leaves ~5 GB free for other applications. Strong on math,
logic, and structured output. Good for agentic sub-tasks requiring tight reasoning. Not the
strongest at raw code synthesis but excellent for reasoning-heavy parts of a coding loop.
Viable as the autocomplete model in a two-model Continue.dev setup.
```
ollama pull phi4-mini
```
**DeepSeek-R1 8B (~56 GB VRAM)**
Strong reasoning model for logic-heavy code (algorithms, correctness proofs). The full
DeepSeek-Coder-V2 (236B MoE) is impractical here — only the 8B distilled variants are
relevant. Outperforms Gemma 4 E4B on reasoning-heavy benchmarks; weaker on raw code
generation than Qwen2.5-Coder 7B.
**Codestral — not viable at 8 GB**
The top FIM autocomplete model on HumanEval-FIM benchmarks, but requires 1216 GB VRAM
minimum. Not an option here. Worth revisiting if upgrading to a 12 GB+ card (RTX 4070
Super or newer).
### RAM offloading: does 30 GB help?
Yes, meaningfully. Ollama automatically splits layers between GPU and system RAM when
VRAM is exceeded. With 30 GB RAM, models up to ~14B at Q4_K_M run with partial offloading.
The tradeoff is a 25× throughput penalty (818 tok/sec vs 3555 tok/sec). Acceptable
for batch tasks (reviewing a PR, generating an algorithm); too slow for live autocomplete.
### Recommended setup
**Autocomplete (fast, always-in-VRAM):** `qwen2.5-coder:7b`
- Configure in Continue.dev as the tab-completion model
- FIM-capable; 3555 tok/sec; fits with 23 GB VRAM to spare
**Chat / agent loop (quality-first):** `qwen2.5-coder:14b` or `gemma4:e4b`
- 14B for strongest multi-file coding; expect 818 tok/sec with RAM offload
- Gemma 4 E4B if you want vision + general reasoning + coding in one model; ~60 tok/sec
**Two-model Continue.dev config (lower VRAM pressure):**
`phi4-mini` (autocomplete) + `qwen2.5-coder:7b` (chat) — both fit simultaneously with
~12 GB to spare, keeping the OS and IDE from contending for VRAM.
---
## Sources
- [Ollama on Proxmox: GPU Passthrough for LXC and VM AI Workloads](https://linuxprofessional.ie/article.php?slug=ollama-proxmox-gpu-passthrough-lxc-vm)
- [Run Ollama with NVIDIA GPU in Proxmox VMs and LXC containers](https://www.virtualizationhowto.com/2025/05/run-ollama-with-nvidia-gpu-in-proxmox-vms-and-lxc-containers/)
- [Ollama Performance Tuning: Getting Maximum Speed from Local LLMs](https://dasroot.net/posts/2026/01/ollama-performance-tuning-gpu-acceleration-model-quantization/)
- [Pros and Cons: Containerized Ollama vs. Local Setup](https://alain-airom.medium.com/pros-and-cons-using-containerized-ollama-vs-local-setup-d9bdf225bbb5)
- [Best Local Coding Models Ranked: Every VRAM Tier (2026)](https://insiderllm.com/guides/best-local-coding-models-2026/)
- [Best Local LLMs for RTX 4060, RTX 3070, and RTX 5060](https://aiagentskit.com/blog/best-local-llms-rtx-4060-3070-5060/)
- [Best Local LLMs for 8GB VRAM: Real Hardware Benchmarks (2026)](https://localllm.in/blog/best-local-llms-8gb-vram-2025)
- [Self-Hosted AI Coding Agent: Ollama + Continue + Open WebUI Setup in 2026](https://www.web3aiblog.com/blog/self-hosted-ai-coding-agent-ollama-continue-2026)
- [Best Local-First AI Coding Tools 2026: 14 Compared](https://nimbalyst.com/blog/best-local-first-ai-coding-tools-2026/)
- [OpenCode + Ollama: Private Local AI Coding Agent Setup](https://lushbinary.com/blog/opencode-ollama-local-ai-coding-privacy-guide/)
- [Gemma 4: Google DeepMind](https://deepmind.google/models/gemma/gemma-4/)
- [Running Gemma 4 Locally: VRAM Requirements](https://knightli.com/en/2026/05/01/gemma-4-local-vram-quantization-table/)
- [Phi-4 Mini vs. Gemma 3 vs. Qwen 2.5: Best SLM for Coding Tasks in 2026](https://botmonster.com/ai/phi-4-mini-vs-gemma-3-vs-qwen-25-best-slm-coding-2026/)
- [Qwen2.5-Coder 14B VRAM Requirements Guide](https://willitrunai.com/blog/qwen-2-5-coder-14b-vram-requirements)
- [Comparing AI Harnesses: OpenCode, Ollama, LM Studio, Claude Code, Open WebUI, and VS Code](https://jace.pro/blog/comparing-ai-harnesses-opencode-ollama-lm-studio-claude-code-open-webui-and-vs-code/)