Mac Mini M4 16GB (hermes agent) - Gemma-4-26b-a4b-it-UD-IQ4_XS.gguf
Hey guys, I've been running on this model Gemma-4-26b-a4b-it-UD-IQ4_XS.gguf with my mac mini m4 16GB locally. Want to get some input on how I can tweak this further to improve tp/s. My current setup as above, and below are the existing configs.
--ctx-size 65536 (hermes agent floor threshold)
--n-gpu-layers 0
--mmap
--flash-attn on -ctk q8_0 -ctv q8_0
--parallel 1
--fit on
--threads 8
I've tried cpu, gpu offloading with -cmoe, - --n-gpu-layers 40,30,20,15 but all failed with HTTP 500 compute error. Probably did something wrong or I've misunderstood the setup..
Average tp/s without cpu, gpu, offloading is around 6-8 tp/s. Honestly, it's pretty decent at the rate it's going for a local hosted LLM but, I'm looking for ideas on how I can squeeze out more juice. 15-20 tp/s probably the sweet spot here but not sure if anyone has achieved it with a > 26b param model.