u/FootballSuperb664

Image 1 — MLX-serve vs LM Studio on Apple Silicon ~40% faster in my benchmarks (w/ MTP/PLD)
Image 2 — MLX-serve vs LM Studio on Apple Silicon ~40% faster in my benchmarks (w/ MTP/PLD)
Image 3 — MLX-serve vs LM Studio on Apple Silicon ~40% faster in my benchmarks (w/ MTP/PLD)
Image 4 — MLX-serve vs LM Studio on Apple Silicon ~40% faster in my benchmarks (w/ MTP/PLD)

MLX-serve vs LM Studio on Apple Silicon ~40% faster in my benchmarks (w/ MTP/PLD)

Benchmarked mlx-serve against LM Studio on Apple Silicon today, roughly +40% faster overall depending on types of workload when using new Gemma4 drafter MTP and PLD in other models.

The gap is widest on echo/repetitive tasks like agentic code editing where speculative decoding really kicks in (+122% on Gemma 4 E2B echo), and more modest on free-form generation (~+20%). Both using the same MLX weights over HTTP so it's a pretty apples-to-apples comparison.

It's a native Zig server so no Python in the stack, and it exposes OpenAI + Anthropic-compatible APIs if that matters to your setup. Posting in case anyone else is trying to squeeze more out of their M-series chip.

https://github.com/ddalcu/mlx-serve

u/FootballSuperb664 — 4 days ago

I see a lot of model quality benchmarks, but none that test the actual endpoints of servers to make sure they work well.

If we build agents locally, how do we know LMStudio/Ollama/MLX work properly ?

Talking about proper spec testing on: Responses API, Chat Completions API, Anthropic Messages API.

Found this repo, but it's only for Responses, is there one for Completions and Messages ?

https://github.com/openresponses/openresponses

I see a lot of problems, and crashes when you go beyond simple Chat Completions, LM Studio specifically.

u/FootballSuperb664 — 10 days ago