u/Nevermore1215 — reddlx

THIS WAS VIBECODED TO THE MAX. But it works, my hermes agent is using MTP and it is god awfully fast!

Built and deployed Qwen3.6-27B with Multi-Token Prediction (MTP) speculative decoding on an RTX 3090 (24GB). MTP predicts multiple tokens per forward pass, achieving 65 tok/s decode speed — a 2.6x improvement over the ~25 tok/s baseline for a 27B dense model. The deployment required a custom llama.cpp build from an unmerged PR, careful VRAM management around existing services, and a multi-agent handoff pipeline (Claude → Local Agent) to complete.

Guide