docs(research): local ollama deployment, harness selection, and model sizing

2026-06-04 01:26:11 +00:00
parent 98e4e2b7dc
commit 8f05226a4a
1 changed files with 278 additions and 0 deletions
@@ -0,0 +1,278 @@
+# Local Ollama: Deployment Topology, Harness Selection, and Model Sizing
+
+Research notes on running Ollama locally for a bot-bottle coding agent workflow.
+Covers the native-vs-VM question, which harness integrates best with an agent loop,
+and which models make sense on an RTX 3070 (8 GB VRAM / 30 GB RAM) machine.
+
+---
+
+## 1. Deployment topology: native, container, or VM?
+
+The core question is whether running Ollama in a VM significantly degrades inference
+performance. The short answer: a full KVM/QEMU VM with GPU passthrough adds roughly
+2–5% overhead, Docker on Linux adds roughly 1–2%, and LXC containers add sub-1%. None
+of these are significant for interactive coding use.
+
+### Native (bare metal)
+
+Zero overhead, immediate GPU access, simplest setup. The right default for a solo
+developer doing inference on their own workstation.
+
+### Docker containers on Linux + NVIDIA
+
+With `nvidia-container-toolkit` and `--gpus all`, containerized Ollama runs at
+essentially native speed (~1–2% overhead on Linux). The dramatic exception is macOS,
+where Docker Desktop runs a Linux VM with no access to Apple's Metal/GPU — inference
+is 5–6× slower. On Linux/Windows with NVIDIA hardware, Docker is fine.
+
+Common pitfall: if `docker exec ollama ollama ps` shows 0 GPU layers, the container
+fell back to CPU. Usual causes: stale VRAM allocation, missing `nvidia-container-toolkit`,
+or a host driver too old for the container's CUDA version.
+
+### KVM/QEMU VM with full PCIe passthrough
+
+Full GPU passthrough makes the GPU invisible to the host while the VM owns it. Overhead
+from the IOMMU translation layer and virtualized PCIe bus is ~2–5%. This is viable if
+you need VM-level isolation (snapshotting, migration, separate kernel). Setup complexity
+is non-trivial: BIOS IOMMU, IOMMU group management, VFIO driver binding. Once configured
+it is stable.
+
+**Critical gotcha:** set the VM's CPU type to `host`. If left at the default
+(`x86-64-v2-AES` / "QEMU Virtual CPU version 2.5+"), Ollama may silently disable GPU
+support even when drivers appear correct.
+
+### LXC containers (Proxmox et al.)
+
+The sweet spot for isolation without overhead. Sub-1% performance difference from bare
+metal because LXC shares the host kernel; GPU device files are bind-mounted into the
+container. The tradeoff is weaker isolation (shared kernel) and the requirement that
+host and container driver versions match. Not suitable if you need VM-level snapshots
+or live migration.
+
+### Summary
+
+| Topology | GPU overhead | Isolation | Complexity |
+|---|---|---|---|
+| Native | 0% | None | Low |
+| Docker (Linux) | ~1–2% | Process | Low |
+| LXC | <1% | Namespace | Medium |
+| KVM passthrough | 2–5% | Full VM | High |
+| VM no passthrough | CPU-only | Full VM | Medium |
+
+Running Ollama in a VM will **not** significantly slow inference as long as GPU passthrough
+is configured. Without passthrough (software rendering / CPU fallback) performance
+collapses — that is what the user is rightly worried about.
+
+### Local vs. remote server
+
+| Factor | Local machine | Remote server |
+|---|---|---|
+| Latency | Near-zero | Network round-trip; cumulative in agent loops |
+| Cost | Zero after hardware | Per-token or subscription |
+| Privacy | 100% on-device | Data leaves the machine |
+| Model size ceiling | VRAM-limited | No hard limit (671B+ feasible) |
+| Offline use | Yes | No |
+| Concurrency under load | Sequential by default | Scales horizontally |
+
+For agentic coding workflows making 20–50 tool calls per session, network latency
+accumulates quickly. Local inference eliminates this. A practical hybrid pattern:
+use the local GPU for routine coding loops; route only to a remote API for tasks
+requiring a 70B+ model or very long context (>128K tokens).
+
+---
+
+## 2. Harness selection
+
+The landscape in 2026 has settled into three categories: IDE plugins, terminal agents,
+and chat UIs.
+
+### Continue.dev — recommended IDE plugin
+
+Open-source VS Code / JetBrains / Zed / Vim extension. Routes autocomplete, chat, and
+refactoring commands to any configured LLM backend (Ollama, cloud APIs). The recommended
+setup uses two models: a small FIM-capable model for inline autocomplete (Qwen2.5-Coder 7B)
+and a larger model for chat/edit. Handles inline completions, multi-file edits, and
+codebase-aware chat. No API key, no data leaving the machine.
+
+### Aider — recommended for git-native terminal workflows
+
+Terminal-based coding agent. Builds a codebase map before editing, makes changes
+directly, and auto-commits to git with readable messages. Every change is one
+`git revert` away. Supports 100+ languages; connects to any Ollama-served model
+via the OpenAI-compatible API. Best for terminal-first developers who want
+version-controlled agent interactions. Does not do inline autocomplete.
+
+### OpenCode — recommended for bot-bottle–style agent loops
+
+Terminal-based coding agent with 15 built-in tools (bash execution, file read/write/edit,
+grep, glob, web fetch, MCP support) and connections to 75+ model providers including
+local Ollama models. This is the closest open-source equivalent to a Claude Code–style
+plan → tool-call → execute → observe → loop. Native Ollama integration.
+
+**Critical setup note:** Ollama defaults to a 4096-token context window, which is
+completely insufficient for an agent loop carrying conversation history, tool schemas,
+a system prompt, and code simultaneously. Configure at least 64K tokens explicitly
+in the model's context settings.
+
+### Cline — agentic VS Code assistant
+
+VS Code extension that operates as an autonomous agent: plans, edits files, runs commands
+in a loop, connects to Ollama's local endpoint. Compared to OpenCode it lives inside the
+IDE rather than the terminal; compared to Continue.dev it is a full agent rather than a
+plugin. Its system prompt overhead is higher (~7,000–10,000 tokens) than minimal harnesses.
+
+### Open WebUI / Jan / LM Studio — chat UIs, not coding harnesses
+
+These are browser or desktop chat interfaces useful for ad-hoc conversations (explaining
+APIs, drafting documentation, exploring ideas) but without IDE integration, autocomplete,
+or git integration. LM Studio offers the smoothest onboarding (visual model browser with
+VRAM estimates). Jan is the most privacy-auditable (fully open-source, Apache 2.0, no
+telemetry). Neither is a replacement for a coding harness.
+
+### Harness comparison
+
+| Harness | Type | Autocomplete | Agent loop | Ollama | Git integration |
+|---|---|---|---|---|---|
+| Continue.dev | IDE plugin | Yes (FIM) | Basic | Native | No |
+| Aider | Terminal agent | No | Multi-turn | Via API | Auto-commit |
+| OpenCode | Terminal agent | No | Full tools | Native | Via bash |
+| Cline | IDE agent | No | Full tools | Via API | Via bash |
+| Open WebUI | Chat UI | No | No | Native | No |
+| Jan | Chat UI | No | No | Native | No |
+
+For a bot-bottle workflow (an isolated sandbox running an agentic loop with tool access),
+**OpenCode** is the closest open-source match. For an IDE-first developer who wants
+autocomplete + chat, **Continue.dev + Qwen2.5-Coder 7B** is the recommended pair.
+
+---
+
+## 3. Model selection: RTX 3070 (8 GB VRAM / 30 GB RAM)
+
+### VRAM hard limits at Q4_K_M quantization
+
+| Model size | Approx. VRAM (Q4_K_M) | Fits in 8 GB? | Tokens/sec (RTX 3070) |
+|---|---|---|---|
+| 3–4B | 2.5–3.5 GB | Yes, with headroom | 60–90 |
+| 7–8B | 5–6 GB | Yes | 35–55 |
+| 12–14B | 7.5–9 GB | Edge / RAM offload | 8–18 |
+| 22B+ | 14+ GB | No | — |
+
+The RTX 3070 has high memory bandwidth for its VRAM tier and consistently outperforms
+the newer RTX 4060 Ti on token generation speed. Bandwidth matters more than raw compute
+for inference.
+
+### Does Gemma 4 exist?
+
+Yes. Google released **Gemma 4** on 2 April 2026 (Apache 2.0). The family includes
+E2B (2B), E4B (4B), a 26B MoE, and a 31B Dense. A 12B multimodal variant was announced
+2026-06-04. The 31B scores 80.0% on LiveCodeBench v6 — a major jump from Gemma 3 27B
+at 29.1%. However, only the E4B fits comfortably within 8 GB VRAM:
+
+| Variant | VRAM (approx.) | Fits? |
+|---|---|---|
+| Gemma 4 E2B | ~2 GB | Yes |
+| Gemma 4 E4B | ~5 GB | Yes |
+| Gemma 4 12B | ~8–9 GB (Q4) | Edge |
+| Gemma 4 26B MoE | 14–18 GB | No |
+| Gemma 4 31B Dense | ~20 GB | No |
+
+### Model-by-model evaluation
+
+**Qwen2.5-Coder 7B — primary recommendation**
+
+The strongest purpose-built coding model that fits fully within 8 GB VRAM. Leads
+HumanEval among 7–8B-class models. Strong on Python, JavaScript, TypeScript. Has
+FIM (fill-in-the-middle) support for inline autocomplete. 35–55 tok/sec on RTX 3070.
+
+```
+ollama pull qwen2.5-coder:7b
+```
+
+**Qwen2.5-Coder 14B — secondary, with RAM offloading**
+
+At Q4_K_M this needs ~8.7 GB, just over the 8 GB limit. With 30 GB system RAM, Ollama
+automatically offloads the overflow layers to CPU. Performance drops to ~8–18 tok/sec
+versus 35–55 tok/sec for the 7B fully in VRAM. Quality is noticeably better for complex
+multi-file reasoning. Viable for chat-based coding tasks where quality matters more than
+speed; too slow for live autocomplete. Keep context window at 8K tokens to minimize
+VRAM pressure during offloaded inference.
+
+```
+ollama pull qwen2.5-coder:14b
+```
+
+**Gemma 4 E4B (~5 GB VRAM)**
+
+Fits comfortably with 3 GB to spare. Strong on reasoning, multimodal, and general-purpose
+tasks. Less specialized for coding than Qwen2.5-Coder 7B. Good choice for one model that
+covers coding + general reasoning + image analysis. The E4B outperforms Gemma 3 equivalents
+significantly on coding benchmarks.
+
+```
+ollama pull gemma4:e4b
+```
+
+**Phi-4 Mini 3.8B (~3 GB VRAM)**
+
+Best reasoning-per-VRAM model; leaves ~5 GB free for other applications. Strong on math,
+logic, and structured output. Good for agentic sub-tasks requiring tight reasoning. Not the
+strongest at raw code synthesis but excellent for reasoning-heavy parts of a coding loop.
+Viable as the autocomplete model in a two-model Continue.dev setup.
+
+```
+ollama pull phi4-mini
+```
+
+**DeepSeek-R1 8B (~5–6 GB VRAM)**
+
+Strong reasoning model for logic-heavy code (algorithms, correctness proofs). The full
+DeepSeek-Coder-V2 (236B MoE) is impractical here — only the 8B distilled variants are
+relevant. Outperforms Gemma 4 E4B on reasoning-heavy benchmarks; weaker on raw code
+generation than Qwen2.5-Coder 7B.
+
+**Codestral — not viable at 8 GB**
+
+The top FIM autocomplete model on HumanEval-FIM benchmarks, but requires 12–16 GB VRAM
+minimum. Not an option here. Worth revisiting if upgrading to a 12 GB+ card (RTX 4070
+Super or newer).
+
+### RAM offloading: does 30 GB help?
+
+Yes, meaningfully. Ollama automatically splits layers between GPU and system RAM when
+VRAM is exceeded. With 30 GB RAM, models up to ~14B at Q4_K_M run with partial offloading.
+The tradeoff is a 2–5× throughput penalty (8–18 tok/sec vs 35–55 tok/sec). Acceptable
+for batch tasks (reviewing a PR, generating an algorithm); too slow for live autocomplete.
+
+### Recommended setup
+
+**Autocomplete (fast, always-in-VRAM):** `qwen2.5-coder:7b`
+- Configure in Continue.dev as the tab-completion model
+- FIM-capable; 35–55 tok/sec; fits with 2–3 GB VRAM to spare
+
+**Chat / agent loop (quality-first):** `qwen2.5-coder:14b` or `gemma4:e4b`
+- 14B for strongest multi-file coding; expect 8–18 tok/sec with RAM offload
+- Gemma 4 E4B if you want vision + general reasoning + coding in one model; ~60 tok/sec
+
+**Two-model Continue.dev config (lower VRAM pressure):**
+`phi4-mini` (autocomplete) + `qwen2.5-coder:7b` (chat) — both fit simultaneously with
+~1–2 GB to spare, keeping the OS and IDE from contending for VRAM.
+
+---
+
+## Sources
+
+- [Ollama on Proxmox: GPU Passthrough for LXC and VM AI Workloads](https://linuxprofessional.ie/article.php?slug=ollama-proxmox-gpu-passthrough-lxc-vm)
+- [Run Ollama with NVIDIA GPU in Proxmox VMs and LXC containers](https://www.virtualizationhowto.com/2025/05/run-ollama-with-nvidia-gpu-in-proxmox-vms-and-lxc-containers/)
+- [Ollama Performance Tuning: Getting Maximum Speed from Local LLMs](https://dasroot.net/posts/2026/01/ollama-performance-tuning-gpu-acceleration-model-quantization/)
+- [Pros and Cons: Containerized Ollama vs. Local Setup](https://alain-airom.medium.com/pros-and-cons-using-containerized-ollama-vs-local-setup-d9bdf225bbb5)
+- [Best Local Coding Models Ranked: Every VRAM Tier (2026)](https://insiderllm.com/guides/best-local-coding-models-2026/)
+- [Best Local LLMs for RTX 4060, RTX 3070, and RTX 5060](https://aiagentskit.com/blog/best-local-llms-rtx-4060-3070-5060/)
+- [Best Local LLMs for 8GB VRAM: Real Hardware Benchmarks (2026)](https://localllm.in/blog/best-local-llms-8gb-vram-2025)
+- [Self-Hosted AI Coding Agent: Ollama + Continue + Open WebUI Setup in 2026](https://www.web3aiblog.com/blog/self-hosted-ai-coding-agent-ollama-continue-2026)
+- [Best Local-First AI Coding Tools 2026: 14 Compared](https://nimbalyst.com/blog/best-local-first-ai-coding-tools-2026/)
+- [OpenCode + Ollama: Private Local AI Coding Agent Setup](https://lushbinary.com/blog/opencode-ollama-local-ai-coding-privacy-guide/)
+- [Gemma 4: Google DeepMind](https://deepmind.google/models/gemma/gemma-4/)
+- [Running Gemma 4 Locally: VRAM Requirements](https://knightli.com/en/2026/05/01/gemma-4-local-vram-quantization-table/)
+- [Phi-4 Mini vs. Gemma 3 vs. Qwen 2.5: Best SLM for Coding Tasks in 2026](https://botmonster.com/ai/phi-4-mini-vs-gemma-3-vs-qwen-25-best-slm-coding-2026/)
+- [Qwen2.5-Coder 14B VRAM Requirements Guide](https://willitrunai.com/blog/qwen-2-5-coder-14b-vram-requirements)
+- [Comparing AI Harnesses: OpenCode, Ollama, LM Studio, Claude Code, Open WebUI, and VS Code](https://jace.pro/blog/comparing-ai-harnesses-opencode-ollama-lm-studio-claude-code-open-webui-and-vs-code/)