u/CrowKing63 — reddlx

I'm testing running local LLMs on a gaming mini PC (AMD 7840HS, 32 GB RAM) paired with an eGPU (Radeon 9060XT with 16 GB VRAM). Since I'm not very familiar with using llama.cpp, I kept getting unsatisfactory results, but with the recent Gemma4 24B A4B IQ4 NL model I finally reached 25.9 t/s. I even connected it to OpenCode and tried asking questions from my codebase, and it seems usable at this level.

llama-server -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-IQ4_NL
 --fit on
 --fit-ctx 128000
 --fit-target 256
 -np 1
 -fa on
 --no-mmap
 --mlock
 --threads 8
 -b 512 -ub 256
 -ctk q8_0 -ctv q8_0
 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0
 --presence-penalty 0.0 --repeat-penalty 1.0 --reasoning-budget -1

This is the result of using it this way.

Increase -b and -ub any further, it won't even load. Are there any unnecessary arguments or arguments that could be optimized?

Thanks.