u/simi6a6

Hi I am new to localLLM and I got 4x AMD Instinct MI40 32GB(128GB total), with Supermicro h12ssl-i as mobo. I tried to use Qwen3.6 with Claude code, however even without referencing files or installing skills, mcp, the harness is already ~20k from start and I often see the tps dropped to 1 or even 0.1 from Omniroute's(api router) log panel.

While seeing other homelabbers easily having ~80/tps or even ~100/tps with just single RTX3090 without struggling all those rocm+pytorch+triton+vllm version matching, patching and rocblas libs chaos, I feel very unbalanced. Am I doing something very stupid on my server setup or it's just fate and punishment for cutting corners to buy AMD card?

Anyway back to analysis, I followed the recipe of a successful repo:

https://arkprojects.space/wiki/AMD_GFX906/vllm/recipes/Qwen3.6-35B-A3B and converted as docker command:

docker run -d \
  --name vllm-gfx906-mixa3607 \
  --network host \
  --ipc host \
  --pid host \
  --privileged \
  --cap-add=SYS_ADMIN \
  --device=/dev/kfd \
  --device=/dev/dri \
  --group-add video \
  --group-add $(getent group render | cut -d: -f3) \
  --volume /sys:/sys:ro \
  --volume $HOME/.triton:/root/.triton \
  -v /media/docker/mount/vllm/models:/models \
  --shm-size=16g \
  -e HSA_OVERRIDE_GFX_VERSION=9.0.6 \
  -e FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" \
  -e VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS="1" \
  mixa3607/vllm-gfx906:0.20.1-rocm-7.2.1-aiinfos \
  vllm serve /models/cyankiwi-Qwen3.6-35B-A3B-AWQ-4bit \
    --served-model-name qwen3.6 \
    --tensor-parallel-size 4 \
    --port 8100\
    --async-scheduling \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --reasoning-parser qwen3 \
    --tool-call-parser qwen3_coder \
    --max-model-len 200000 \
    --data-parallel-size 1 \
    --dtype float16 \
    --gpu-memory-utilization 0.95 \
    --limit-mm-per-prompt '{"image": 20, "video": 4}' \
    --max-num-seqs 16 \
    --enable-expert-parallel \
    --enable-prefix-caching

And I tried to benchmark with following script directly in docker bash, so no api router's overhead:

FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" OMP_NUM_THREADS=4 VLLM_LOGGING_LEVEL=DEBUG vllm bench serve \
--dataset-name random \
--random-input-len 10000 \
--random-output-len 1000 \
--num-prompts 4 \
--request-rate 10000 \
--ignore-eos

And result as follows:

============ Serving Benchmark Result ============
Successful requests:                     4         
Failed requests:                         0         
Request rate configured (RPS):           10000.00  
Benchmark duration (s):                  72.19     
Total input tokens:                      40000     
Total generated tokens:                  4000      
Request throughput (req/s):              0.06      
Output token throughput (tok/s):         55.41     
Peak output token throughput (tok/s):    88.00     
Peak concurrent requests:                4.00      
Total token throughput (tok/s):          609.53    
---------------Time to First Token----------------
Mean TTFT (ms):                          17451.07  
Median TTFT (ms):                        18025.08  
P99 TTFT (ms):                           26242.86  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          54.49     
Median TPOT (ms):                        53.97     
P99 TPOT (ms):                           63.98     
---------------Inter-token Latency----------------
Mean ITL (ms):                           54.49     
Median ITL (ms):                         45.98     
P99 ITL (ms):                            50.17     
==================================================

with 20k ctx:

============ Serving Benchmark Result ============
Successful requests:                     4         
Failed requests:                         0         
Request rate configured (RPS):           20000.00  
Benchmark duration (s):                  96.08     
Total input tokens:                      80000     
Total generated tokens:                  4000      
Request throughput (req/s):              0.04      
Output token throughput (tok/s):         41.63     
Peak output token throughput (tok/s):    76.00     
Peak concurrent requests:                4.00      
Total token throughput (tok/s):          874.24    
---------------Time to First Token----------------
Mean TTFT (ms):                          26404.19  
Median TTFT (ms):                        26443.89  
P99 TTFT (ms):                           40167.30  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          69.37     
Median TPOT (ms):                        69.38     
P99 TPOT (ms):                           82.77     
---------------Inter-token Latency----------------
Mean ITL (ms):                           69.37     
Median ITL (ms):                         55.24     
P99 ITL (ms):                            342.95    
==================================================

Are these numbers looks normal with 4x MI50 setup? Anything I should test or tune? Thank you.

Struggle on MI50(gfx906), very slow with just ~10k ctx, am I doing something wrong?