u/dxplq876

Qwen3.6-27B 8bit DFLASH performance vs num_speculative_tokens
▲ 33 r/BlackwellPerformance+1 crossposts

Qwen3.6-27B 8bit DFLASH performance vs num_speculative_tokens

I'm running Qwen3.6-27B 8bit on my RTX PRO 6000 Blackwell workstation edition and I was trying to figure out the optimal setting for `num_speculative_tokens` while using DFLASH. So I decided to run some benchmarks where I varied `num_speculative_tokens` from 1 to 20 to find the optimal value. Hopefully it's helpful to you guys!

Here's the results in text format:

🏆 FINAL RESULTS

===============================================

{'k'} | {'Avg tok/s'} | {'±std'} | Best?

\---------------------------------------------------

1 |         67.4 | ±   0.1 |

2 |         88.8 | ±   0.1 |

3 |        102.5 | ±   0.8 |

4 |        116.1 | ±   0.1 |

5 |        124.7 | ±   0.1 |

6 |        127.6 | ±   0.1 |

7 |        126.6 | ±   0.1 |

8 |        133.8 | ±   0.1 |

9 |        126.8 | ±   0.4 |

10 |        136.8 | ±   0.1 |

11 |        140.0 | ±   0.3 | ← BEST

12 |        132.5 | ±   0.2 |

13 |        137.8 | ±   0.1 |

14 |        135.0 | ±   3.9 |

15 |        136.7 | ±   1.3 |

16 |        132.2 | ±   0.2 |

17 |        129.8 | ±   0.1 |

18 |        123.4 | ±   0.1 |

19 |        123.8 | ±   0.4 |

20 |        125.0 | ±   0.1 |

🎯 Recommended: k = 11 (139.95999999999998%.1f tok/s)  

Here's my vLLM setup:

  qwen-vllm: # ← Qwen3.6-27B via vLLM (OpenAI-compatible API)
    image: vllm/vllm-openai:latest
    container_name: qwen-vllm
    ipc: host
    shm_size: 32g                    # Critical for large context + Qwen3.6 performance
    ports:
      - "8000:8000"                  # OpenAI-compatible endpoint[](http://localhost:8000/v1)
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface   # Persists the ~55 GB model download
    environment:
      - HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
      - HF_HUB_ENABLE_HF_TRANSFER=1
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all             # ← Change to 1 if you only want to use a single GPU
              capabilities: [ gpu ]
    command: >
      --model Qwen/Qwen3.6-27B-FP8
      --served-model-name qwen3.6-27b
      --host 0.0.0.0
      --port 8000
      --tensor-parallel-size 1
      --gpu-memory-utilization 0.90
      --max-model-len 262144
      --kv-cache-dtype auto
      --attention-backend flash_attn
      --max-num-batched-tokens 16384
      --max-num-seqs 24
      --trust-remote-code
      --enable-prefix-caching
      --enable-chunked-prefill
      --reasoning-parser qwen3
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-27B-DFlash", "num_speculative_tokens": 11}'
      -O3
    extra_hosts:
      - "host.docker.internal:host-gateway"
    networks:
      - hermes-net
u/dxplq876 — 3 days ago