Set context limit by model manually - is there a way? Hybrid local Gemma via oMLX + Cloud Models.
Struggling with this. Got a killer hybrid setup running on a 32 GB Mac Mini M4 base where a local model (Gemma 26B) runs triage/privacy centric tasks, now getting upwards of 20 tokens/s since I switched to oMLX instead of Ollama for the local stuff which was half that. However, I am having an issue where oMLX is not accurately reporting the context limit I set in it (64k), so I am unable to use the =0 setting in the config where Hermes automatically sets context by whatever the model max is. This causes a fallback to a 32k context limit for the local model, which is then rejected by Hermes due to it being beneath the 64k requirement.
Has anyone had any experience with oMLX and gotten it to successfully report local context limit? Or alternatively, found a way to manually set context limits by model? Goal for me is to keep the lower limit for the local models and when I outsource hard tasks to cloud models get the benefit of those large contexts. Profiles/subagents/kanban have all been dead ends for me so far.