Been running GLM-5.1 + Qwen 3.5 via Ollama Cloud — the harness matters more than the model
After going deep on local vs. cloud model comparisons, I landed on a setup that’s been working really well: GLM-5.1 as the planning model, Qwen 3.5 as the execution model, both accessed via Ollama Cloud (:cloud suffix — routes through ollama.com, not directly to the model providers).
The cost angle is hard to ignore. GLM-5.1 hits ~94.6% of Claude Opus 4.6’s coding score at a fraction of the price, and Qwen 3.5 is Apache 2.0 with near-frontier performance on agentic tasks.
But here’s the thing most benchmark posts miss: the harness is at least as important as the model. SWE-bench Pro shows a 22-point swing on identical model weights just by changing the agent scaffold. You can take a mid-tier model and beat a frontier model in a bad harness. The model is the ceiling — the harness determines how close you get to it.
For the harness I’ve been using oh-my-pi (https://github.com/can1357/oh-my-pi) and it’s been excellent. Role-based model routing means GLM-5.1 handles planning (slow/plan role) and Qwen 3.5 takes execution (default). Hash-anchored edits, LSP integration, persistent IPython kernel, proper subagent support — it’s the kind of thoughtful tooling that actually gets you close to the model’s potential instead of leaving 20 points on the table.
If you’re evaluating local or cloud-hybrid setups, don’t just swap models and call it a benchmark. Fix your harness first.