
Qwen3.6-27B 8bit DFLASH performance vs num_speculative_tokens
I'm running Qwen3.6-27B 8bit on my RTX PRO 6000 Blackwell workstation edition and I was trying to figure out the optimal setting for `num_speculative_tokens` while using DFLASH. So I decided to run some benchmarks where I varied `num_speculative_tokens` from 1 to 20 to find the optimal value. Hopefully it's helpful to you guys!
Here's the results in text format:
🏆 FINAL RESULTS
===============================================
{'k'} | {'Avg tok/s'} | {'±std'} | Best?
\---------------------------------------------------
1 | 67.4 | ± 0.1 |
2 | 88.8 | ± 0.1 |
3 | 102.5 | ± 0.8 |
4 | 116.1 | ± 0.1 |
5 | 124.7 | ± 0.1 |
6 | 127.6 | ± 0.1 |
7 | 126.6 | ± 0.1 |
8 | 133.8 | ± 0.1 |
9 | 126.8 | ± 0.4 |
10 | 136.8 | ± 0.1 |
11 | 140.0 | ± 0.3 | ← BEST
12 | 132.5 | ± 0.2 |
13 | 137.8 | ± 0.1 |
14 | 135.0 | ± 3.9 |
15 | 136.7 | ± 1.3 |
16 | 132.2 | ± 0.2 |
17 | 129.8 | ± 0.1 |
18 | 123.4 | ± 0.1 |
19 | 123.8 | ± 0.4 |
20 | 125.0 | ± 0.1 |
🎯 Recommended: k = 11 (139.95999999999998%.1f tok/s)
Here's my vLLM setup:
qwen-vllm: # ← Qwen3.6-27B via vLLM (OpenAI-compatible API)
image: vllm/vllm-openai:latest
container_name: qwen-vllm
ipc: host
shm_size: 32g # Critical for large context + Qwen3.6 performance
ports:
- "8000:8000" # OpenAI-compatible endpoint[](http://localhost:8000/v1)
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface # Persists the ~55 GB model download
environment:
- HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
- HF_HUB_ENABLE_HF_TRANSFER=1
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all # ← Change to 1 if you only want to use a single GPU
capabilities: [ gpu ]
command: >
--model Qwen/Qwen3.6-27B-FP8
--served-model-name qwen3.6-27b
--host 0.0.0.0
--port 8000
--tensor-parallel-size 1
--gpu-memory-utilization 0.90
--max-model-len 262144
--kv-cache-dtype auto
--attention-backend flash_attn
--max-num-batched-tokens 16384
--max-num-seqs 24
--trust-remote-code
--enable-prefix-caching
--enable-chunked-prefill
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-27B-DFlash", "num_speculative_tokens": 11}'
-O3
extra_hosts:
- "host.docker.internal:host-gateway"
networks:
- hermes-net
u/dxplq876 — 3 days ago