u/310dweller

Struggling with this. Got a killer hybrid setup running on a 32 GB Mac Mini M4 base where a local model (Gemma 26B) runs triage/privacy centric tasks, now getting upwards of 20 tokens/s since I switched to oMLX instead of Ollama for the local stuff which was half that. However, I am having an issue where oMLX is not accurately reporting the context limit I set in it (64k), so I am unable to use the =0 setting in the config where Hermes automatically sets context by whatever the model max is. This causes a fallback to a 32k context limit for the local model, which is then rejected by Hermes due to it being beneath the 64k requirement.

Has anyone had any experience with oMLX and gotten it to successfully report local context limit? Or alternatively, found a way to manually set context limits by model? Goal for me is to keep the lower limit for the local models and when I outsource hard tasks to cloud models get the benefit of those large contexts. Profiles/subagents/kanban have all been dead ends for me so far.

My goal is to get Hermes running as a messaging triage and batching system (funnel for emails, iMessage, text, IG, WhatsApp, line, WeChat etc). I am running on a 32 GB M4 Mac Mini and am hypothesizing once it is set up, a local Gemma 4:26B instance will be enough for the little routing tasks - keeping things private - and kicking up bigger tasks to cloud models like research, coding, complex email and document drafting etc.

That said, in my initial exploration it seems like Hermes needs a lot of muscle to get set up to run smoothly, and am thinking I’ll bring a cloud model in for that portion entirely. In yalls experience which model is the best for setup? New DeepSeek, Qwen, Sonnet, Gemini high..? Curious what is best to architect the system out.

Set context limit by model manually - is there a way? Hybrid local Gemma via oMLX + Cloud Models.