
Struggle on MI50(gfx906), very slow with just ~10k ctx, am I doing something wrong?
Hi I am new to localLLM and I got 4x AMD Instinct MI40 32GB(128GB total), with Supermicro h12ssl-i as mobo. I tried to use Qwen3.6 with Claude code, however even without referencing files or installing skills, mcp, the harness is already ~20k from start and I often see the tps dropped to 1 or even 0.1 from Omniroute's(api router) log panel.
While seeing other homelabbers easily having ~80/tps or even ~100/tps with just single RTX3090 without struggling all those rocm+pytorch+triton+vllm version matching, patching and rocblas libs chaos, I feel very unbalanced. Am I doing something very stupid on my server setup or it's just fate and punishment for cutting corners to buy AMD card?
Anyway back to analysis, I followed the recipe of a successful repo:
https://arkprojects.space/wiki/AMD_GFX906/vllm/recipes/Qwen3.6-35B-A3B and converted as docker command:
docker run -d \
--name vllm-gfx906-mixa3607 \
--network host \
--ipc host \
--pid host \
--privileged \
--cap-add=SYS_ADMIN \
--device=/dev/kfd \
--device=/dev/dri \
--group-add video \
--group-add $(getent group render | cut -d: -f3) \
--volume /sys:/sys:ro \
--volume $HOME/.triton:/root/.triton \
-v /media/docker/mount/vllm/models:/models \
--shm-size=16g \
-e HSA_OVERRIDE_GFX_VERSION=9.0.6 \
-e FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" \
-e VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS="1" \
mixa3607/vllm-gfx906:0.20.1-rocm-7.2.1-aiinfos \
vllm serve /models/cyankiwi-Qwen3.6-35B-A3B-AWQ-4bit \
--served-model-name qwen3.6 \
--tensor-parallel-size 4 \
--port 8100\
--async-scheduling \
--trust-remote-code \
--enable-auto-tool-choice \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--max-model-len 200000 \
--data-parallel-size 1 \
--dtype float16 \
--gpu-memory-utilization 0.95 \
--limit-mm-per-prompt '{"image": 20, "video": 4}' \
--max-num-seqs 16 \
--enable-expert-parallel \
--enable-prefix-caching
And I tried to benchmark with following script directly in docker bash, so no api router's overhead:
FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" OMP_NUM_THREADS=4 VLLM_LOGGING_LEVEL=DEBUG vllm bench serve \
--dataset-name random \
--random-input-len 10000 \
--random-output-len 1000 \
--num-prompts 4 \
--request-rate 10000 \
--ignore-eos
And result as follows:
============ Serving Benchmark Result ============
Successful requests: 4
Failed requests: 0
Request rate configured (RPS): 10000.00
Benchmark duration (s): 72.19
Total input tokens: 40000
Total generated tokens: 4000
Request throughput (req/s): 0.06
Output token throughput (tok/s): 55.41
Peak output token throughput (tok/s): 88.00
Peak concurrent requests: 4.00
Total token throughput (tok/s): 609.53
---------------Time to First Token----------------
Mean TTFT (ms): 17451.07
Median TTFT (ms): 18025.08
P99 TTFT (ms): 26242.86
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 54.49
Median TPOT (ms): 53.97
P99 TPOT (ms): 63.98
---------------Inter-token Latency----------------
Mean ITL (ms): 54.49
Median ITL (ms): 45.98
P99 ITL (ms): 50.17
==================================================
with 20k ctx:
============ Serving Benchmark Result ============
Successful requests: 4
Failed requests: 0
Request rate configured (RPS): 20000.00
Benchmark duration (s): 96.08
Total input tokens: 80000
Total generated tokens: 4000
Request throughput (req/s): 0.04
Output token throughput (tok/s): 41.63
Peak output token throughput (tok/s): 76.00
Peak concurrent requests: 4.00
Total token throughput (tok/s): 874.24
---------------Time to First Token----------------
Mean TTFT (ms): 26404.19
Median TTFT (ms): 26443.89
P99 TTFT (ms): 40167.30
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 69.37
Median TPOT (ms): 69.38
P99 TPOT (ms): 82.77
---------------Inter-token Latency----------------
Mean ITL (ms): 69.37
Median ITL (ms): 55.24
P99 ITL (ms): 342.95
==================================================
Are these numbers looks normal with 4x MI50 setup? Anything I should test or tune? Thank you.