Dedicated EPYC servers for Ollama — real CPU inference benchmarks on CCX33 through CCX63
Running a managed Ollama deployment service (NestAI). Just shipped dedicated AMD EPYC CCX tiers. Sharing what each tier actually gives you for inference since I couldn't find good benchmarks for Hetzner CCX + Ollama anywhere.
Hardware is Hetzner CCX (EPYC Milan/Genoa dedicated vCPU):
CCX33 (8 dedicated vCPU, 32GB RAM) — +$29/mo:
- Mistral 7B: ~12-15 tok/s
- DeepSeek R1 14B: ~5-7 tok/s
- Qwen 2.5 32B Q4: fits but slow, ~3-4 tok/s
CCX43 (16 dedicated vCPU, 64GB RAM) — +$59/mo:
- Mistral 7B: ~15-18 tok/s
- Phi-4 14B: ~7-10 tok/s
- DeepSeek R1 32B: ~5-7 tok/s
- Llama 3.3 70B Q4: fits, ~2-3 tok/s
CCX53 (32 dedicated vCPU, 128GB RAM) — +$119/mo:
- 7B models: ~20+ tok/s
- 32B models: ~8-10 tok/s
- 70B models: ~3-5 tok/s
- Can load multiple models simultaneously
CCX63 (48 dedicated vCPU, 192GB RAM) — +$179/mo:
- Can run 70B + 7B simultaneously
- Best case 70B: ~4-6 tok/s
- Enough RAM for multiple 32B models loaded at once
All running Ollama latest with OLLAMA_FLASH_ATTENTION=1, OLLAMA_KEEP_ALIVE=-1, Q4_K_M quantization default.
The big difference from shared vCPU isn't peak speed — it's consistency. Shared CX43 can spike to 15 tok/s at 3 AM and drop to 6 tok/s at peak hours. Dedicated stays flat.
Still CPU, not GPU. If someone asks "why not just get an A4000" — you're absolutely right for raw performance. But for teams that need data residency guarantees (EU/GDPR, Singapore/PDPA) and can't ship data to a GPU cloud provider, dedicated CPU in a specific Hetzner datacenter is the tradeoff.
These tiers are add-ons on top of NestAI's managed plans ($39-299/mo). The managed part handles provisioning, Open WebUI, SSL, monitoring, team auth.