r/oMLX

▲ 38 r/oMLX

oMLX 0.3.9.dev2 released.

Highlights:
- Gemma 4 MTP on the vision path (thanks to @Prince_Canuma's mlx-vlm). Image+text decodes much faster now
- Gemma 4 on the DFlash engine (thanks to @bstnxbt's dflash-mlx)
- ParoQuant support
- omlx launch copilot joins claude / codex / opencode / openclaw / pi
- Restart server button right in the admin UI
- oQ auto-builds a proxy when the model can't fit in RAM

Plus a lot of bug fixes and 20 new contributors in this cycle.

reddit.com
u/d4mations — 1 day ago
▲ 7 r/oMLX

oMLX use in Hermes

I have oMLX installed on an M1 Max Mac Studio and it’s setup for network access (0.0.0.0) and have Hermes installed on a separate Mac Mini (Intel). I’ve configured a custom model in the Hermes config for oMLX but no matter what I try I cannot get Hermes to talk to oMLX.

Has anyone had any success with this setup?

reddit.com
u/aptonline — 1 day ago
▲ 19 r/oMLX

2x-6x Speed improvements with oMLX

Hi everyone, I've spent quite a bit of time trying to get some of the newer Qwen 3.6 (27B/35B) or Gemma 4 (26B/31B) models with D-Flash, MTP, and TurboQuant to work on OMLX, but I have had no success. The additional speed improvements I'm seeing would go a long way to using more local horsepower to run my work loads.

In fact, when I try to run this on my M1 Max 64GB machine, speed is negatively impacted. It's been rough.

Anyone had any success? What are running or what resources did you leverage to get there?

reddit.com
u/roaringpup31 — 3 days ago
▲ 23 r/oMLX

Pi coding agent is amazing (or how I learned to stop worrying and leave OpenCode)

Warning: long post ahead. On the plus side, it’s completely human-written. No AI slop was used in writing this post. I’m old school that way, I like to actually write my own Reddit posts. Thought you all would appreciate something written entirely by a human for a change. ;)

Disclaimer: this post says nice things about Pi. I am not associated with the dev team of Pi coding agent in any way.

Yesterday I tried Pi coding agent on my local LLM rig for the first time. I had been using OpenCode as my daily driver agentic harness, and I had been intimidated by Pi’s stripped down, minimalist approach.

My rig, by the way, is an M4 MacBook Pro with 64Gb of RAM. oMLX is the backend, serving up jundot’s quant of qwen3.6:35b-a3b-oQ6. I average around 60 tokens/second at around 80 percent RAM usage.

My coding needs are fairly modest. I run around eight static websites for my hobby board gaming group, hosted on GitHub pages. So the daily tasks usually involve updating sites with user submissions, implementing feature requests, squashing minor bugs, things of that sort.

I had gotten used to the security blanket of OpenCode, with its set of built-in tools. I had come to accept that sometimes OpenCode will take a little longer to answer a request, and had gotten used to its sometimes dumb little oversights and charmingly stupid mistakes.

For example, I often ask OpenCode to make a 3x3 image collage of board game cover images using ImageMagick command line tools. It would usually take several revisions, as OpenCode would first render them in a straight line row instead of a 3x3 grid. Then after feedback, render a 3x3 grid, but each image was of different size. Then after even more feedback, it would finally output a 3x3 grid of equally sized images.

You know the old saying about LLMs acting like green interns? In my case, OpenCode often acts like an intern who needs the instructions explained multiple times before they get the task right.

But at least OpenCode was the evil intern that I was familiar with. As I said, I had gotten used to working within its limitations and quirks.

Anyway, yesterday I decided to overcome my nervousness about leaving the security blanket of OpenCode and dive into the unknown depths of Pi coding agent. I gave Pi the exact same task using a similar prompt: create a 3x3 grid of the cover images of these specified board games, each image 400x400 pixels.

Pi methodically went about the task. First it identified which images were available locally and which were not. Then it web searched the websites to grab the missing images and download them locally. Then it created the 3x3 grid, to my desired specs, right the first time. I was blown away at how much better, faster, more accurate, and more capable it felt working with Pi vs. OpenCode. I didn’t change the local model, I just changed the agentic harness. If OpenCode felt like working with an inexperienced intern, Pi felt more like working with a trustworthy and reliable teammate.

With OpenCode I had assumed it would be capable of only routine maintenance and updates, and that if ever I needed to do some heavier lifting, I would have to bust out a cloud frontier model like Codex. But I decided to give Pi a more challenging test to uncover its true capabilities. I asked Pi to plan set-by-step the addition of a search feature to one of my sites, with live filtering as the user types, a dropdown menu overlay matching the site’s existing CSS, etc.

Guess what, Pi made the plan, checked with me for my go-ahead, then started implanting the plan, task by task. It wasn’t perfect. There were a couple of points where functions were called in the wrong order. But I dutifully fed the web inspector errors to Pi, it quickly and correctly figured out the issues, and fixed them. Within a few minutes, my search feature was working, pretty much exactly as I had envisioned it.

Even more impressive: following Pi’s philosophy of “if you need extra features, ask Pi to build them”, I asked Pi to reflect on our coding session, then based on that suggest some enhancements to itself to address the main pain points. Pi identified that it needs a better auto-compact feature, and a better way to seamlessly pick up in context where it left off; and built those features into itself. It also added a JS script to mitigate those function calling timing issues we had encountered. So as one works with Pi, one gradually customizes and improves Pi to become more optimized for the actually coding work that you do.

Man, I was so impressed. Pi takes this local LLM thing from “works well enough for routine tasks” to “works well enough that I don’t think I need to fire up a cloud model”. I now have the confidence to leave OpenCode behind.

TL; DR: I overcame my fears and tried Pi instead of OpenCode, and had a great experience.

reddit.com
u/Konamicoder — 3 days ago
▲ 2 r/oMLX

Porting oMLX to C

I want to integrate oMLX into my project, but without a python server.

What do you guys think of porting this to C to better integrate with apps?

Apple won’t let me sign an iPhone or iPad app since they don’t allow running a python interpreter in an app.

reddit.com
u/swordsman1 — 16 hours ago
▲ 9 r/oMLX

What Works for Coding on an M5 with 24GB of Universal Ram

I am new oMLX and relatively new to local LLMs. I have been trying to get Qwen 3.5, Qwen 3 or Gemma 4 running on my M5 with 24GB of universal ram using oMLX. I have tested a number of models from the mlx-community in the size range of 13 - 15GB. To date, they all blow after a few minutes of starting a task with OOM.

I would appreciate hearing what you have working for coding on a Mac with 24GB of RAM.

Is oMLX the best way to run it? I've been trying, hoping may be a better word, to find a model with TurboQuant that will handle the run of the mill dev tasks to help minimize my cost for the larger models.

Thank you in advance! lbe

reddit.com
u/LearnedByError — 3 days ago
▲ 6 r/oMLX

How do you enable TurboQuant beside toggling it "on" ? I see no peak memory reduction at any context length (8k, 32k, 131K), neither on MoE model family (Gemma4 or Qwen3.5/3.6).

That's it. Is there something else that turning it on and set the bit depth? What am I missing? Where's the user manual for this thing so I can read it?

reddit.com
u/JLeonsarmiento — 2 days ago
▲ 2 r/oMLX+1 crossposts

Optimizing workflow concurrency on Mac/omlx?

I've had a lot of success running differently-sized models using a bunch of different harnesses, but one place I haven't had much success is improving concurrent throughput, i.e. "running multiple workflows at once".

I can run multiple workflows at once, but my tok/sec drops significantly. I've tried using smaller models, but in processing they still use all available gpu cores. Is there a way to configure the runner to only use a portion of available gpu cores?

reddit.com
u/numberwitch — 4 days ago
▲ 1 r/oMLX

Mac mini m4 pro

I'm running omlx on a Mac mini m4 pro with 64gb of memory

Using qwen 3.6 35b ud mlx 4 bit

I'm only getting prompt processing 353 toks and token gen 15.6 toks

Feels like I should have better performance than that. Don't have anything else running that's consuming memory or CPU

I run vs code, openclaw and Hermes on another box over, and tried openclaw local. All around the same performance numbers

What can I look at to find the cause of the slowness

Thanks

reddit.com
u/benwaynet — 6 days ago
▲ 1 r/oMLX

Tools integration for local models usage.

Hello,

Can someone direct me to how to enable tools like Playwright MCP, or similar tools for all common browser interactions? I was able to do it in LM studio, but I'm lost in oMLX.

Thank you in advance.

reddit.com
u/MrApo87 — 5 days ago