From d1556f46590246c5426af2fcaa1271904633b37e Mon Sep 17 00:00:00 2001 From: claude Date: Thu, 4 Jun 2026 01:26:11 +0000 Subject: [PATCH] docs(research): local ollama deployment, harness selection, and model sizing --- ...ocal-ollama-harness-and-model-selection.md | 278 ++++++++++++++++++ 1 file changed, 278 insertions(+) create mode 100644 docs/research/local-ollama-harness-and-model-selection.md diff --git a/docs/research/local-ollama-harness-and-model-selection.md b/docs/research/local-ollama-harness-and-model-selection.md new file mode 100644 index 0000000..df365c5 --- /dev/null +++ b/docs/research/local-ollama-harness-and-model-selection.md @@ -0,0 +1,278 @@ +# Local Ollama: Deployment Topology, Harness Selection, and Model Sizing + +Research notes on running Ollama locally for a bot-bottle coding agent workflow. +Covers the native-vs-VM question, which harness integrates best with an agent loop, +and which models make sense on an RTX 3070 (8 GB VRAM / 30 GB RAM) machine. + +--- + +## 1. Deployment topology: native, container, or VM? + +The core question is whether running Ollama in a VM significantly degrades inference +performance. The short answer: a full KVM/QEMU VM with GPU passthrough adds roughly +2–5% overhead, Docker on Linux adds roughly 1–2%, and LXC containers add sub-1%. None +of these are significant for interactive coding use. + +### Native (bare metal) + +Zero overhead, immediate GPU access, simplest setup. The right default for a solo +developer doing inference on their own workstation. + +### Docker containers on Linux + NVIDIA + +With `nvidia-container-toolkit` and `--gpus all`, containerized Ollama runs at +essentially native speed (~1–2% overhead on Linux). The dramatic exception is macOS, +where Docker Desktop runs a Linux VM with no access to Apple's Metal/GPU — inference +is 5–6× slower. On Linux/Windows with NVIDIA hardware, Docker is fine. + +Common pitfall: if `docker exec ollama ollama ps` shows 0 GPU layers, the container +fell back to CPU. Usual causes: stale VRAM allocation, missing `nvidia-container-toolkit`, +or a host driver too old for the container's CUDA version. + +### KVM/QEMU VM with full PCIe passthrough + +Full GPU passthrough makes the GPU invisible to the host while the VM owns it. Overhead +from the IOMMU translation layer and virtualized PCIe bus is ~2–5%. This is viable if +you need VM-level isolation (snapshotting, migration, separate kernel). Setup complexity +is non-trivial: BIOS IOMMU, IOMMU group management, VFIO driver binding. Once configured +it is stable. + +**Critical gotcha:** set the VM's CPU type to `host`. If left at the default +(`x86-64-v2-AES` / "QEMU Virtual CPU version 2.5+"), Ollama may silently disable GPU +support even when drivers appear correct. + +### LXC containers (Proxmox et al.) + +The sweet spot for isolation without overhead. Sub-1% performance difference from bare +metal because LXC shares the host kernel; GPU device files are bind-mounted into the +container. The tradeoff is weaker isolation (shared kernel) and the requirement that +host and container driver versions match. Not suitable if you need VM-level snapshots +or live migration. + +### Summary + +| Topology | GPU overhead | Isolation | Complexity | +|---|---|---|---| +| Native | 0% | None | Low | +| Docker (Linux) | ~1–2% | Process | Low | +| LXC | <1% | Namespace | Medium | +| KVM passthrough | 2–5% | Full VM | High | +| VM no passthrough | CPU-only | Full VM | Medium | + +Running Ollama in a VM will **not** significantly slow inference as long as GPU passthrough +is configured. Without passthrough (software rendering / CPU fallback) performance +collapses — that is what the user is rightly worried about. + +### Local vs. remote server + +| Factor | Local machine | Remote server | +|---|---|---| +| Latency | Near-zero | Network round-trip; cumulative in agent loops | +| Cost | Zero after hardware | Per-token or subscription | +| Privacy | 100% on-device | Data leaves the machine | +| Model size ceiling | VRAM-limited | No hard limit (671B+ feasible) | +| Offline use | Yes | No | +| Concurrency under load | Sequential by default | Scales horizontally | + +For agentic coding workflows making 20–50 tool calls per session, network latency +accumulates quickly. Local inference eliminates this. A practical hybrid pattern: +use the local GPU for routine coding loops; route only to a remote API for tasks +requiring a 70B+ model or very long context (>128K tokens). + +--- + +## 2. Harness selection + +The landscape in 2026 has settled into three categories: IDE plugins, terminal agents, +and chat UIs. + +### Continue.dev — recommended IDE plugin + +Open-source VS Code / JetBrains / Zed / Vim extension. Routes autocomplete, chat, and +refactoring commands to any configured LLM backend (Ollama, cloud APIs). The recommended +setup uses two models: a small FIM-capable model for inline autocomplete (Qwen2.5-Coder 7B) +and a larger model for chat/edit. Handles inline completions, multi-file edits, and +codebase-aware chat. No API key, no data leaving the machine. + +### Aider — recommended for git-native terminal workflows + +Terminal-based coding agent. Builds a codebase map before editing, makes changes +directly, and auto-commits to git with readable messages. Every change is one +`git revert` away. Supports 100+ languages; connects to any Ollama-served model +via the OpenAI-compatible API. Best for terminal-first developers who want +version-controlled agent interactions. Does not do inline autocomplete. + +### OpenCode — recommended for bot-bottle–style agent loops + +Terminal-based coding agent with 15 built-in tools (bash execution, file read/write/edit, +grep, glob, web fetch, MCP support) and connections to 75+ model providers including +local Ollama models. This is the closest open-source equivalent to a Claude Code–style +plan → tool-call → execute → observe → loop. Native Ollama integration. + +**Critical setup note:** Ollama defaults to a 4096-token context window, which is +completely insufficient for an agent loop carrying conversation history, tool schemas, +a system prompt, and code simultaneously. Configure at least 64K tokens explicitly +in the model's context settings. + +### Cline — agentic VS Code assistant + +VS Code extension that operates as an autonomous agent: plans, edits files, runs commands +in a loop, connects to Ollama's local endpoint. Compared to OpenCode it lives inside the +IDE rather than the terminal; compared to Continue.dev it is a full agent rather than a +plugin. Its system prompt overhead is higher (~7,000–10,000 tokens) than minimal harnesses. + +### Open WebUI / Jan / LM Studio — chat UIs, not coding harnesses + +These are browser or desktop chat interfaces useful for ad-hoc conversations (explaining +APIs, drafting documentation, exploring ideas) but without IDE integration, autocomplete, +or git integration. LM Studio offers the smoothest onboarding (visual model browser with +VRAM estimates). Jan is the most privacy-auditable (fully open-source, Apache 2.0, no +telemetry). Neither is a replacement for a coding harness. + +### Harness comparison + +| Harness | Type | Autocomplete | Agent loop | Ollama | Git integration | +|---|---|---|---|---|---| +| Continue.dev | IDE plugin | Yes (FIM) | Basic | Native | No | +| Aider | Terminal agent | No | Multi-turn | Via API | Auto-commit | +| OpenCode | Terminal agent | No | Full tools | Native | Via bash | +| Cline | IDE agent | No | Full tools | Via API | Via bash | +| Open WebUI | Chat UI | No | No | Native | No | +| Jan | Chat UI | No | No | Native | No | + +For a bot-bottle workflow (an isolated sandbox running an agentic loop with tool access), +**OpenCode** is the closest open-source match. For an IDE-first developer who wants +autocomplete + chat, **Continue.dev + Qwen2.5-Coder 7B** is the recommended pair. + +--- + +## 3. Model selection: RTX 3070 (8 GB VRAM / 30 GB RAM) + +### VRAM hard limits at Q4_K_M quantization + +| Model size | Approx. VRAM (Q4_K_M) | Fits in 8 GB? | Tokens/sec (RTX 3070) | +|---|---|---|---| +| 3–4B | 2.5–3.5 GB | Yes, with headroom | 60–90 | +| 7–8B | 5–6 GB | Yes | 35–55 | +| 12–14B | 7.5–9 GB | Edge / RAM offload | 8–18 | +| 22B+ | 14+ GB | No | — | + +The RTX 3070 has high memory bandwidth for its VRAM tier and consistently outperforms +the newer RTX 4060 Ti on token generation speed. Bandwidth matters more than raw compute +for inference. + +### Does Gemma 4 exist? + +Yes. Google released **Gemma 4** on 2 April 2026 (Apache 2.0). The family includes +E2B (2B), E4B (4B), a 26B MoE, and a 31B Dense. A 12B multimodal variant was announced +2026-06-04. The 31B scores 80.0% on LiveCodeBench v6 — a major jump from Gemma 3 27B +at 29.1%. However, only the E4B fits comfortably within 8 GB VRAM: + +| Variant | VRAM (approx.) | Fits? | +|---|---|---| +| Gemma 4 E2B | ~2 GB | Yes | +| Gemma 4 E4B | ~5 GB | Yes | +| Gemma 4 12B | ~8–9 GB (Q4) | Edge | +| Gemma 4 26B MoE | 14–18 GB | No | +| Gemma 4 31B Dense | ~20 GB | No | + +### Model-by-model evaluation + +**Qwen2.5-Coder 7B — primary recommendation** + +The strongest purpose-built coding model that fits fully within 8 GB VRAM. Leads +HumanEval among 7–8B-class models. Strong on Python, JavaScript, TypeScript. Has +FIM (fill-in-the-middle) support for inline autocomplete. 35–55 tok/sec on RTX 3070. + +``` +ollama pull qwen2.5-coder:7b +``` + +**Qwen2.5-Coder 14B — secondary, with RAM offloading** + +At Q4_K_M this needs ~8.7 GB, just over the 8 GB limit. With 30 GB system RAM, Ollama +automatically offloads the overflow layers to CPU. Performance drops to ~8–18 tok/sec +versus 35–55 tok/sec for the 7B fully in VRAM. Quality is noticeably better for complex +multi-file reasoning. Viable for chat-based coding tasks where quality matters more than +speed; too slow for live autocomplete. Keep context window at 8K tokens to minimize +VRAM pressure during offloaded inference. + +``` +ollama pull qwen2.5-coder:14b +``` + +**Gemma 4 E4B (~5 GB VRAM)** + +Fits comfortably with 3 GB to spare. Strong on reasoning, multimodal, and general-purpose +tasks. Less specialized for coding than Qwen2.5-Coder 7B. Good choice for one model that +covers coding + general reasoning + image analysis. The E4B outperforms Gemma 3 equivalents +significantly on coding benchmarks. + +``` +ollama pull gemma4:e4b +``` + +**Phi-4 Mini 3.8B (~3 GB VRAM)** + +Best reasoning-per-VRAM model; leaves ~5 GB free for other applications. Strong on math, +logic, and structured output. Good for agentic sub-tasks requiring tight reasoning. Not the +strongest at raw code synthesis but excellent for reasoning-heavy parts of a coding loop. +Viable as the autocomplete model in a two-model Continue.dev setup. + +``` +ollama pull phi4-mini +``` + +**DeepSeek-R1 8B (~5–6 GB VRAM)** + +Strong reasoning model for logic-heavy code (algorithms, correctness proofs). The full +DeepSeek-Coder-V2 (236B MoE) is impractical here — only the 8B distilled variants are +relevant. Outperforms Gemma 4 E4B on reasoning-heavy benchmarks; weaker on raw code +generation than Qwen2.5-Coder 7B. + +**Codestral — not viable at 8 GB** + +The top FIM autocomplete model on HumanEval-FIM benchmarks, but requires 12–16 GB VRAM +minimum. Not an option here. Worth revisiting if upgrading to a 12 GB+ card (RTX 4070 +Super or newer). + +### RAM offloading: does 30 GB help? + +Yes, meaningfully. Ollama automatically splits layers between GPU and system RAM when +VRAM is exceeded. With 30 GB RAM, models up to ~14B at Q4_K_M run with partial offloading. +The tradeoff is a 2–5× throughput penalty (8–18 tok/sec vs 35–55 tok/sec). Acceptable +for batch tasks (reviewing a PR, generating an algorithm); too slow for live autocomplete. + +### Recommended setup + +**Autocomplete (fast, always-in-VRAM):** `qwen2.5-coder:7b` +- Configure in Continue.dev as the tab-completion model +- FIM-capable; 35–55 tok/sec; fits with 2–3 GB VRAM to spare + +**Chat / agent loop (quality-first):** `qwen2.5-coder:14b` or `gemma4:e4b` +- 14B for strongest multi-file coding; expect 8–18 tok/sec with RAM offload +- Gemma 4 E4B if you want vision + general reasoning + coding in one model; ~60 tok/sec + +**Two-model Continue.dev config (lower VRAM pressure):** +`phi4-mini` (autocomplete) + `qwen2.5-coder:7b` (chat) — both fit simultaneously with +~1–2 GB to spare, keeping the OS and IDE from contending for VRAM. + +--- + +## Sources + +- [Ollama on Proxmox: GPU Passthrough for LXC and VM AI Workloads](https://linuxprofessional.ie/article.php?slug=ollama-proxmox-gpu-passthrough-lxc-vm) +- [Run Ollama with NVIDIA GPU in Proxmox VMs and LXC containers](https://www.virtualizationhowto.com/2025/05/run-ollama-with-nvidia-gpu-in-proxmox-vms-and-lxc-containers/) +- [Ollama Performance Tuning: Getting Maximum Speed from Local LLMs](https://dasroot.net/posts/2026/01/ollama-performance-tuning-gpu-acceleration-model-quantization/) +- [Pros and Cons: Containerized Ollama vs. Local Setup](https://alain-airom.medium.com/pros-and-cons-using-containerized-ollama-vs-local-setup-d9bdf225bbb5) +- [Best Local Coding Models Ranked: Every VRAM Tier (2026)](https://insiderllm.com/guides/best-local-coding-models-2026/) +- [Best Local LLMs for RTX 4060, RTX 3070, and RTX 5060](https://aiagentskit.com/blog/best-local-llms-rtx-4060-3070-5060/) +- [Best Local LLMs for 8GB VRAM: Real Hardware Benchmarks (2026)](https://localllm.in/blog/best-local-llms-8gb-vram-2025) +- [Self-Hosted AI Coding Agent: Ollama + Continue + Open WebUI Setup in 2026](https://www.web3aiblog.com/blog/self-hosted-ai-coding-agent-ollama-continue-2026) +- [Best Local-First AI Coding Tools 2026: 14 Compared](https://nimbalyst.com/blog/best-local-first-ai-coding-tools-2026/) +- [OpenCode + Ollama: Private Local AI Coding Agent Setup](https://lushbinary.com/blog/opencode-ollama-local-ai-coding-privacy-guide/) +- [Gemma 4: Google DeepMind](https://deepmind.google/models/gemma/gemma-4/) +- [Running Gemma 4 Locally: VRAM Requirements](https://knightli.com/en/2026/05/01/gemma-4-local-vram-quantization-table/) +- [Phi-4 Mini vs. Gemma 3 vs. Qwen 2.5: Best SLM for Coding Tasks in 2026](https://botmonster.com/ai/phi-4-mini-vs-gemma-3-vs-qwen-25-best-slm-coding-2026/) +- [Qwen2.5-Coder 14B VRAM Requirements Guide](https://willitrunai.com/blog/qwen-2-5-coder-14b-vram-requirements) +- [Comparing AI Harnesses: OpenCode, Ollama, LM Studio, Claude Code, Open WebUI, and VS Code](https://jace.pro/blog/comparing-ai-harnesses-opencode-ollama-lm-studio-claude-code-open-webui-and-vs-code/)