u/Fit_Baker4577

Mac Mini M4 16GB (hermes agent) - Gemma-4-26b-a4b-it-UD-IQ4_XS.gguf

Hey guys, I've been running on this model Gemma-4-26b-a4b-it-UD-IQ4_XS.gguf with my mac mini m4 16GB locally. Want to get some input on how I can tweak this further to improve tp/s. My current setup as above, and below are the existing configs.

--ctx-size 65536 (hermes agent floor threshold)
  --n-gpu-layers 0
  --mmap
  --flash-attn on -ctk q8_0 -ctv q8_0
  --parallel 1
  --fit on
  --threads 8

I've tried cpu, gpu offloading with -cmoe, - --n-gpu-layers 40,30,20,15 but all failed with HTTP 500 compute error. Probably did something wrong or I've misunderstood the setup..

Average tp/s without cpu, gpu, offloading is around 6-8 tp/s. Honestly, it's pretty decent at the rate it's going for a local hosted LLM but, I'm looking for ideas on how I can squeeze out more juice. 15-20 tp/s probably the sweet spot here but not sure if anyone has achieved it with a > 26b param model.

reddit.com
u/Fit_Baker4577 — 1 day ago

Mac Mini M4 16GB (hermes agent) - Gemma-4-26b-a4b-it-UD-IQ4_XS.gguf

Hey guys, I've been running on this model Gemma-4-26b-a4b-it-UD-IQ4_XS.gguf with my mac mini m4 16GB. Want to get some input on how I can tweak this further to improve tp/s. My current setup as above, and below are the existing configs.

--ctx-size 65536 (hermes agent floor threshold)
  --n-gpu-layers 0
  --mmap
  --flash-attn on -ctk q8_0 -ctv q8_0
  --parallel 1
  --fit on
  --threads 8

I've tried cpu, gpu offloading with -cmoe, - --n-gpu-layers 40,30,20,15 but all failed with HTTP500 compute error. Probably did something wrong or I've misunderstood the setup..

Average tp/s without cpu, gpu, offloading is around 6-8 tp/s. Any idea how I can squeeze out more juice? 15-20 tp/s probably the sweet spot here but not sure if anyone has achieved it.

reddit.com
u/Fit_Baker4577 — 2 days ago