Ollama swap to llamacpp/llama server
So I'm a newb in certain aspects but not in others, I'm currently running an AI stack on my unraid server:
CPU: AMD Threadripper 3960X (24c/48t)
Motherboard: Gigabyte TRX40 AORUS PRO WIFI
RAM: 256GB DDR4-3200 G.Skill Trident Z
GPU: Nvidia Titan Xp Collector’s Edition (single GPU)
10GB LAN
I've got ollama, searxng, and anythingllm as my setup, I initially went anythingllm because of the ease of the built ins. Running gemma4:26b MoE as my primary. Getting about 20-25 tok/s, slow but manageable.
I mostly use it for writing and the occasional vibe code.
Recently I've been looking at TurboQuant, which ollama supports but doesn't expose, and potentially MemPalace, again for creative writings. I've also been thinking about an Exo stack as I've several machines just idling there that I could throw into the mix.
I feeling my cockles that moving to llama.cpp would be more bettererererer.
Am I missing something? Am I wrong in my thinking? There's just so much new info to invest and I'm a bit overwhelmed.