u/chiruwonder

▲ 1 r/ollama+1 crossposts

Dedicated EPYC servers for Ollama — real CPU inference benchmarks on CCX33 through CCX63

Running a managed Ollama deployment service (NestAI). Just shipped dedicated AMD EPYC CCX tiers. Sharing what each tier actually gives you for inference since I couldn't find good benchmarks for Hetzner CCX + Ollama anywhere.

Hardware is Hetzner CCX (EPYC Milan/Genoa dedicated vCPU):

CCX33 (8 dedicated vCPU, 32GB RAM) — +$29/mo:

  • Mistral 7B: ~12-15 tok/s
  • DeepSeek R1 14B: ~5-7 tok/s
  • Qwen 2.5 32B Q4: fits but slow, ~3-4 tok/s

CCX43 (16 dedicated vCPU, 64GB RAM) — +$59/mo:

  • Mistral 7B: ~15-18 tok/s
  • Phi-4 14B: ~7-10 tok/s
  • DeepSeek R1 32B: ~5-7 tok/s
  • Llama 3.3 70B Q4: fits, ~2-3 tok/s

CCX53 (32 dedicated vCPU, 128GB RAM) — +$119/mo:

  • 7B models: ~20+ tok/s
  • 32B models: ~8-10 tok/s
  • 70B models: ~3-5 tok/s
  • Can load multiple models simultaneously

CCX63 (48 dedicated vCPU, 192GB RAM) — +$179/mo:

  • Can run 70B + 7B simultaneously
  • Best case 70B: ~4-6 tok/s
  • Enough RAM for multiple 32B models loaded at once

All running Ollama latest with OLLAMA_FLASH_ATTENTION=1, OLLAMA_KEEP_ALIVE=-1, Q4_K_M quantization default.

The big difference from shared vCPU isn't peak speed — it's consistency. Shared CX43 can spike to 15 tok/s at 3 AM and drop to 6 tok/s at peak hours. Dedicated stays flat.

Still CPU, not GPU. If someone asks "why not just get an A4000" — you're absolutely right for raw performance. But for teams that need data residency guarantees (EU/GDPR, Singapore/PDPA) and can't ship data to a GPU cloud provider, dedicated CPU in a specific Hetzner datacenter is the tradeoff.

These tiers are add-ons on top of NestAI's managed plans ($39-299/mo). The managed part handles provisioning, Open WebUI, SSL, monitoring, team auth.

reddit.com
u/chiruwonder — 18 hours ago
▲ 0 r/ollama+1 crossposts

Made major changes to NestAI based on your feedback — honest benchmarks, new models, dedicated resources

Hey everyone,

First off genuinely thankful for the responses on my last post. Some of you absolutely ripped apart my claims (deserved), some gave really solid technical advice, and a few even DM me to chat about hosting infrastructure. All of it helped.

I want to share what I actually changed based on your feedback because I think it's only fair you guys took the time to test and point things out, so here's where NestAI stands now.

The big one --- I was wrong about the 70B speed claim

One of you tested Llama 3.3 70B on a zen5 12-core 64GB setup and got under 2 t/s. Then called out my "15-20 tokens/sec" claim. You were right, that number was completely wrong. I don't even know how it ended up on the site honestly, probably from some optimistic early test I did with a smaller prompt or something. No excuse.

Real numbers from actual benchmarks on similar AMD EPYC hardware (which is what Hetzner CCX servers use):

  • Llama 70B Q4_K_M on dual EPYC 7282 (32 cores) → ~2.87 t/s
  • Llama 70B F16 on EPYC Turin (16 cores) → ~2.34 t/s
  • Our CCX43 has 16 vCPU EPYC Milan at 2.4 GHz → realistically 2-3 t/s for 70B

So I've updated every single page on NestAI to show honest speeds. Every model now has a realistic tokens/sec estimate for single-user CPU inference. No more made-up numbers.

The "finance/healthcare on CPU with Ollama" criticism

Fair point partially. Running 70B on CPU-only for real-time chat in a healthcare setting yeah that's a stretch at 2-3 t/s. But here's the thing, a legal firm running Mistral 7B at 10-15 t/s for contract review? That actually works. The value isn't "GPT-4 speed on CPU", it's "your client data never leaves your server." That's what these teams are paying for.

But I've repositioned accordingly:

  • 70B model description now says "best for async tasks, document analysis, and batch processing, not real-time chat"
  • Added concurrency warning everywhere, CPU inference is single-user optimised, multiple users queue up

New model lineup (this was overdue)

Removed all the outdated stuff. The model library was still showing Gemma 2, Code Llama, Phi-3 as recommended. Embarrassing in 2026.

New top picks:

  • Qwen 3.5 4B — now the default recommendation everywhere. Multimodal (text + vision), 128K context, outperforms most 7B models. This is genuinely impressive for its size
  • Qwen 3.5 9B — best all-rounder, also multimodal
  • DeepSeek R1 7B — still the best reasoning model at 7B, kept this
  • GPT-OSS 20B — yeah OpenAI's open-weight model. Added to the heavy tier
  • Phi-4 Reasoning 14B — Microsoft's math/logic variant, added
  • Qwen 2.5 Coder 32B — replaced the old DeepSeek Coder V2 33B in flagship tier

Trial users now get Qwen 3.5 4B instead of Phi-3. Way better first experience.

Dedicated resources (now available on any plan)

Previously the "Ultra" plan was only for 70B models. Redesigned this completely:

  • Any plan (Solo/Team/Business) can now add dedicated resources as an optional toggle
  • 4 tiers: Starter (8 vCPU, 32GB), Pro (16 vCPU, 64GB), Power (32 vCPU, 128GB), Enterprise (48 vCPU, 192GB)
  • All AMD EPYC dedicated vCPU -- no shared CPU, no noisy neighbors
  • 70B model forces dedicated ON (you can't run it on standard anyway)
  • Someone running Mistral 7B who just wants guaranteed performance can add Starter for ₹2,499/mo extra

Onboarding flow now shows this as Step 2 after model selection, clean toggle, pick your tier if you want it, skip if you don't.

Current setup flow

  1. Pick your model (Qwen 3.5 4B recommended)
  2. Dedicated resources? (optional toggle - forced ON for 70B)
  3. Name your team
  4. Pick your subdomain (yourteam.nestai.chirai.dev)
  5. Review & pay → server deploys in 25-33 mins

Standard server is CX43: 8 vCPU, 16GB RAM, 160GB SSD. Runs all 7B models comfortably at 10-15 t/s for single user.

Honest speed reference (CPU-only, single user)

Model Size RAM needed Speed (single user)
Qwen 3.5 4B 2.8 GB ~4 GB ~15-20 t/s
Llama 3.2 3B 2.0 GB ~3 GB ~15-25 t/s
Phi-4 Mini 2.5 GB ~4 GB ~15-20 t/s
DeepSeek R1 7B 4.7 GB ~6 GB ~10-14 t/s
Mistral 7B 4.1 GB ~5.5 GB ~10-15 t/s
Phi-4 14B 8.9 GB ~11 GB ~4-6 t/s
DeepSeek R1 32B 19 GB ~22 GB ~4-6 t/s (dedicated)
Llama 3.3 70B 43 GB ~48 GB ~2-3 t/s (dedicated)

These are for single user. Add 2-3 concurrent users and divide roughly by that number. CPU inference serialises requests this is an Ollama/llama.cpp limitation, not something I can fix.

Re: llama.cpp vs Ollama

Some of you suggested ditching Ollama for raw llama.cpp. Looked into this properly. Ollama literally runs llama.cpp under the hood, it's a wrapper that handles model management, API serving, and the OpenAI-compatible endpoints that Open WebUI needs. Performance difference is maybe 5-10% at best. For a managed product where the users are legal firms and startups (not ML engineers), Ollama is the right abstraction. The users want a chat interface that works, not a C++ compile toolchain.

That said, we do use the important llama.cpp optimizations through Ollama config: OLLAMA_FLASH_ATTENTION=1, OLLAMA_KEEP_ALIVE=-1 (model never unloads from RAM), warmup cron every 2 mins to prevent cold starts.

Re: the mobile menu bug

The person who pointed this out on Brave mobile --- fixed. The menu overlay wasn't fully opaque and the hero canvas was bleeding through. Three-line fix: fully opaque background, higher z-index, body scroll lock when menu is open. Thanks for catching it.

What's next

Looking into GPU server options for a premium tier, that would give genuine 30+ t/s on 70B which actually makes real-time chat viable. Hetzner doesn't offer GPU cloud but exploring RunPod/Vast.ai integration. No timeline yet but it's on the roadmap.

Also planning a transparent benchmarks page on the site showing real t/s numbers for each model on each server tier. Nobody in this space does this honestly and I think it would build trust.

Site is live at nestai.chirai.dev if you want to check the changes. Happy to answer anything, the feedback from this community has genuinely made the product better.

Cheers from India 🇮🇳

reddit.com
u/chiruwonder — 4 days ago
▲ 44 r/hetzner

6 months running production Ollama workloads on Hetzner — what I learned about server selection and provisioning

Been running an AI hosting service on Hetzner for 6 months. Here's what I actually learned about running LLMs on their infrastructure.

Server selection for model sizes:

  • 7B models (Mistral, Llama 3.1 8B, Phi-3, DeepSeek R1 7B) → CX43 (16GB RAM) works fine
  • 14B models → CCX33 (32GB RAM) minimum
  • 32B models → CCX33 with 8GB swap, tight but workable
  • 70B models → CCX43 (64GB RAM), no swap needed

CPX series works too but I've had more placement failures. CX and CCX series are more reliably available across regions.

The region availability problem:

NBG1 fills up. Built fallback logic that tries NBG1 → FSN1 → HEL1 → ASH → SIN in order when a server type isn't available in the primary region. The 412 resource_unavailable error is your friend — it's retryable, unlike 400s which mean you did something wrong.

Swap is non-negotiable:

8GB swap, swappiness=80. Without it, 7B models on 16GB RAM servers OOM on edge cases — specifically when Open WebUI and Ollama are both loaded and you hit a large context. With swap it degrades gracefully instead of crashing.

Cloud-init reliability:

The script runs at boot and takes 15-25 minutes. Things I learned:

  • apt-get update needs retry logic — it fails on brand new VMs occasionally
  • Docker Compose images should be pulled explicitly before up -d — pulling during up times out on large images
  • Callbacks to your API should use file-based JSON not inline shell-escaped strings. Learned this the hard way when special characters in generated passwords broke JSON parsing

Hetzner-specific notes:

  • SSH key IDs and Network IDs need to be integers in the API, not strings
  • The labels field on servers is genuinely useful for filtering — label_selector: 'managed_by=nestai' lets you list all managed servers instantly
  • IPv4 is included in the price now, no longer extra
  • Firewall rules via UFW are fine, no need to use Hetzner's firewall API unless you want centralised management

What I'm running in production:

Vercel frontend → Express backend on a Hetzner CX22 → Hetzner VMs per customer (provisioned on demand, deleted on cancel).

Happy to answer anything about the architecture. Service is at nestai.chirai.dev if you want to see the end result.

reddit.com
u/chiruwonder — 6 days ago
▲ 1 r/SelfHostedAI+1 crossposts

Built an OpenAI-compatible API on top of private Ollama — no rate limits, data never leaves your server

[removed]

u/[deleted] — 8 days ago