TLDR: The hype is real! 1.5x speedup. Up to 2x speedup with tensor parallelism!

After reading the PR I immediately hunted for MTP-compatible Q4_1 quants (they offer a small speedup on these compute-lacking older cards) but couldn't find any.

Luckily I came across this post which highlighted how to transplant MTP grafting onto your own quants, and thus attached it to an Unsloth quant I already had.

Setup

CachyOS (Arch Linux)
ROCm 7.2
Both cards running at PCIe 4.0 x 8

Built the llama.cpp fork https://github.com/skyne98/llama.cpp-gfx906 with https://github.com/ggml-org/llama.cpp/pull/22673 and ran the following command with the included PR benchmark script:

llama-server -m ~/models/Qwen3.6-27B-MTP-Q4_1.gguf \
--temp 1.0 --min-p 0.0 --top-k 20 --top-p 0.95 \
--jinja --presence-penalty 1.5 \
--chat-template-kwargs '{"preserve_thinking": true}' \
-ub 2048 -b 2048 \
-fa 1 -np 1 \
--no-mmap --no-warmup \
-dev ROCm0,ROCm1 --fit on -fitt 256

Script Benchmark

Stock:

code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=26.2
code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=26.2
explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=26.3
summarize          pred= 192 draft=   0 acc=   0 rate=n/a tok/s=26.4
qa_factual         pred= 192 draft=   0 acc=   0 rate=n/a tok/s=26.4
translation        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=26.4
creative_short     pred= 192 draft=   0 acc=   0 rate=n/a tok/s=26.4
stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=26.3
long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=26.0

With MTP on: --spec-type mtp --spec-draft-n-max 2

code_python        pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=39.6
code_cpp           pred= 192 draft= 156 acc= 113 rate=0.724 tok/s=36.5
explain_concept    pred= 192 draft= 154 acc= 113 rate=0.734 tok/s=36.7
summarize          pred= 192 draft= 138 acc= 121 rate=0.877 tok/s=40.7
qa_factual         pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=39.4
translation        pred= 192 draft= 152 acc= 115 rate=0.757 tok/s=37.5
creative_short     pred= 192 draft= 156 acc= 113 rate=0.724 tok/s=36.6
stepwise_math      pred= 192 draft= 146 acc= 118 rate=0.808 tok/s=39.0
long_code_review   pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=37.8

Aggregate: {
 "n_requests": 9,
 "total_predicted": 1728,
 "total_draft": 1340,
 "total_draft_accepted": 1046,
 "aggregate_accept_rate": 0.7806,
 "wall_s_total": 51.42
}

With tensor parallelism on: -sm tensor

code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=35.0
code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=34.8
explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=34.6
summarize          pred= 192 draft=   0 acc=   0 rate=n/a tok/s=34.6
qa_factual         pred= 192 draft=   0 acc=   0 rate=n/a tok/s=34.7
translation        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=34.7
creative_short     pred= 192 draft=   0 acc=   0 rate=n/a tok/s=34.7
stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=34.6
long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=34.3

Combining MTP and tensor parallelism:

code_python        pred= 192 draft= 142 acc= 120 rate=0.845 tok/s=59.8
code_cpp           pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=56.6
explain_concept    pred= 192 draft= 146 acc= 117 rate=0.801 tok/s=56.8
summarize          pred=  53 draft=  42 acc=  31 rate=0.738 tok/s=54.5
qa_factual         pred= 192 draft= 148 acc= 117 rate=0.790 tok/s=56.8
translation        pred= 192 draft= 146 acc= 117 rate=0.801 tok/s=57.3
creative_short     pred= 192 draft= 154 acc= 114 rate=0.740 tok/s=54.8
stepwise_math      pred= 192 draft= 140 acc= 121 rate=0.864 tok/s=59.6
long_code_review   pred= 192 draft= 148 acc= 117 rate=0.790 tok/s=56.2

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1589,
  "total_draft": 1214,
  "total_draft_accepted": 970,
  "aggregate_accept_rate": 0.799,
  "wall_s_total": 32.24

Real-world benchmark

The numbers above look absolutely insane, however in the real-world the speed up dwindles very quickly - not to mention there's a regression in prefill speed which is currently being worked on. I ran this 18k coding prompt and it's clear the 60t/s is only observable for very short prompts, but combining MTP and tensor parallelism does indeed net a hefty 2x speedup.

Stock:

prompt eval time =   53173.24 ms / 19191 tokens (    2.77 ms per token,   360.91 tokens per second)
      eval time =  337695.94 ms /  7791 tokens (   43.34 ms per token,    23.07 tokens per second)
     total time =  390869.18 ms / 26982 tokens

With MTP on:

prompt eval time =   84388.11 ms / 19191 tokens (    4.40 ms per token,   227.41 tokens per second)
      eval time =  260732.83 ms /  8408 tokens (   31.01 ms per token,    32.25 tokens per second)
     total time =  345120.94 ms / 27599 tokens

With tensor parallelism:

prompt eval time =   41925.27 ms / 19191 tokens (    2.18 ms per token,   457.74 tokens per second)
       eval time =  253262.25 ms /  8104 tokens (   31.25 ms per token,    32.00 tokens per second)
      total time =  295187.53 ms / 27295 tokens

Combining MTP and tensor parallelism:

prompt eval time =   49696.04 ms / 19191 tokens (    2.59 ms per token,   386.17 tokens per second)
       eval time =  155821.64 ms /  7440 tokens (   20.94 ms per token,    47.75 tokens per second)
      total time =  205517.69 ms / 26631 tokens

u/legit_split_

Changed CPU now low FPS + high Host processing latency

Troubleshooting steps:

More Qwen3.6-27B MTP success but on dual Mi50s

Setup

Script Benchmark

Real-world benchmark