r/BlackwellPerformance

Hey everyone,

Was looking to see if anyone has dealt with this and can provide some guidance. I’m thinking about picking up 4 more RTX 6000 MAXQ or WS (probably MAXQ I guess).

Has anyone run into this that can shed some light on how they went about it?

Motherboard: ASRock WRX90 WS EVO
CPU: Ryzen Threadripper PRO 9985WX
GPU: RTX 6000 MAX-Q x 4
RAM: 768GB (8x96GB) - Vcolor DDR5 6400 TR596G64D452O
PSU: Super Flower Leadex Titanium 2800W ATX 3.1
Cooling: Silverstone SST-XE360-TR5 Server AIO Liquid Cooling
Case: Phanteks PH-ES620PC_BK02 Enthoo Pro Server Edition

Existing build

I’d had someone suggest to opt for a PCIe switch and do it that way.

I found a post here from someone who did something similar.

https://www.reddit.com/r/LocalLLaMA/s/CZyOzj4S02

Thanks

reddit.com
u/Direct_Bodybuilder63 — 9 days ago
▲ 86 r/BlackwellPerformance+1 crossposts

DeepSeek-V4-Flash W4A16+FP8 with MTP self-speculation: 85 tok/s @ 524k on 2× RTX PRO 6000 Max-Q

TL;DR: DeepSeek-V4-Flash running at 85.52 tok/s @ 524k ctx and ~111 tok/s @ 128k single-stream on 2× RTX PRO 6000 Max-Q

pasta-paul's DeepSeek-V4-Flash-W4A16-FP8 quant is great, but its MTP head silently gets stripped at load time (HF transformers has it in _keys_to_ignore_on_load_unexpected), so --speculative-config '{"method":"mtp",...}' is a no-op.

Retrofitted the MTP block, ran a GPTQ pass on its routed experts to match the base's W4A16 INT4 group format, and patched vLLM.

Decode goes from 52.85 tok/s (no MTP) → 85.52 tok/s @ 524k 2-stream → ~111 tok/s @ 128k single-stream. 671B total / 32B active, fits on 2× 96 GB.

Model: https://huggingface.co/LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8

Numbers

2× RTX PRO 6000 Blackwell Max-Q (96 GB each, no NVLink, sm_120):

Profile Decode TPS TTFT Δ vs base
pasta-paul base, no MTP, 524k 52.85 91 ms reference
This model, 524k 2-stream 85.52 155 ms +62% (1.62×)
This model, 128k single-stream ~111 ~310 ms +110% (2.10×)

Sanity-check benchmarks (small samples, full data in the model card):

Benchmark n Score
GSM8K (T=0, COT, exact-match) 100 93%
MMLU (mixed subjects) 100 53% (sample dragged by hard subjects; tracks base)
HumanEval (syntactic check, not pass@1 exec) 50 90%

What got quantized how

  • 768 routed-expert tensors (256 experts × {w1, w2, w3}): W4A16 INT4 group=128 sym, GPTQ (Frantar-style with Cholesky H⁻¹). Calibrated with 256 ultrachat_200k prompts × 256 max_tokens captured from the running pasta-paul model — 17,701 MTP forward dumps, 473k tokens.
  • 5 attention projections: FP8_BLOCK (kept upstream's FP8 weights, just renamed scale → weight_scale to match pasta-paul's compressed-tensors convention).
  • Shared experts, e_proj, h_proj, norms, gate, attn_sink: BF16 / FP32.

Max-Q specific fixes:

If you're on the Max-Q workstation cards specifically: you MUST pass --disable-custom-all-reduce.

vLLM's CustomAllreduce uses CUDA P2P (independent of NCCL_P2P_DISABLE), and on PCIe-only Max-Q topology it deadlocks at post-graph eager warmup.

Without the flag the engine hangs at gpu_worker.py:619 with infinite shm_broadcast.py:681 No available shared memory broadcast block warnings. The Server variant has NVLink and does not hit this.

NCCL tuning that drops TTFT from ~155 ms to ~91 ms on Max-Q at zero decode-TPS cost:

NCCL_PROTO=LL NCCL_ALGO=Ring NCCL_MIN_NCHANNELS=8 
NCCL_NTHREADS=512

How to run

Needs the patched vLLM fork. Vanilla doesn't load DSV4-Flash quants. Base workspace at https://github.com/pasta-paul/dsv4-flash-w4a16-fp8.

Apply the MTP patches on top.

vllm serve LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8 \
  --tensor-parallel-size 2 --kv-cache-dtype fp8 --block-size 256 \
  --max-model-len 524288 --max-num-seqs 2 \
  --gpu-memory-utilization 0.93 \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --trust-remote-code \
  --disable-custom-all-reduce \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
  --host 0.0.0.0 --port 8000

I also wrote an AGENTS.md runbook. Point Claude/Codex/Cursor to it and tell it "set this up"/ "verify hardware and get this model running"/ or similar. Goes through preflight → CUDA toolkit (no sudo via conda) → patched vLLM build → download → patches → serve → smoke test.

Limitations

  • TP=2 only. TP=1 OOMs on a single RTX6000 pro; TP≥4 hits an upstream W4A16 MoE scale-sharding bug (vllm-project/vllm#41511).
  • num_speculative_tokens capped at 1. DSV4 flash ships exactly one MTP head (num_nextn_predict_layers=1); higher values will not produce more drafts.
  • Reasoning parser caveat. With --reasoning-parser deepseek_v4, output splits into content and reasoning_content. Clients reading only content see empty strings on "thinking" responses.
  • MTP GPTQ skipped attention during calibration — see Future work in card.
  • Hardware tested: only Max-Q. Server variant + DGX Spark + H200 should work but I have not run them.

Request for the community

If you run this and the MTP draft acceptance rate comes out significantly different on your prompt distribution, please do comment with your domain and the rate (vLLM will log it as spec_decode_acceptance_rate).

Credits

  • DeepSeek-AI for the base model
  • pasta-paul for the W4A16+FP8 quant + jasl/vllm serving stack (repo)

u/Blahblahblakha — 4 days ago

Is it worth upgrading from 2x RTX6kPro to 4x?

Hi All:

Earlier this year I built a new machine specifically for inference work. I went with 2x RTX6k Pro Max-Q to start with. I've mostly just been using Qwen3.6-35b-a3b which is great but I'm not really taking advantage of the 2 cards. There's plenty of much larger models like kimi, deepseek, and the like; but I cant run those on 2 cards.

I think my workflow would benefit from some of these bigger models, but my question is, does upgrading from 2 to 4 cards make sense? It feels like many people jump straight up to 8 cards.

Do people who use 4x RTX6kPro cards feel like the models that run on that hardware is worthwhile? Are you comfortable where you are at that level of vram?

Thanks for your thoughts!

reddit.com
u/MenuNo294 — 3 days ago

2RTX PRO 6000 192GB VRAM - MTP NVFP4 issues with vision

Hi guys,

I previously ran WSL2 and figured out how to run:

https://huggingface.co/sakamakismile/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP

It’s NVFP4, MTP with vision enabled. This was the command I ran in WSL2 after a few extra tweaks in the environment that I found via reddit (can’t find it anymore). I got up to 350t/s on this setup with 48 num seqs and 131072 context window. Fantastic setup. Here is the command I ran:

t > ~/vllm-setup/start-vllm.sh << 'EOF' #!/bin/bash set -e export CUDA_VISIBLE_DEVICES=0,1 export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 export NCCL_P2P_DISABLE=1 export NCCL_SHM_DISABLE=0 export NCCL_SOCKET_IFNAME=lo export NCCL_IGNORE_CPU_AFFINITY=1 export NCCL_TOPO_DUMP_FILE=/dev/null export NCCL_NET_DISABLE_INTRA=0 export NCCL_CUMEM_ENABLE=0 source $HOME/vllm-setup/venv/bin/activate python -m vllm.entrypoints.openai.api_server \ --model $HOME/models/Qwen3.6-27B-Abliterated-NVFP4 \ --served-model-name qwen3.6-27b \ --host 0.0.0.0 \ --port 8000 \ --api-key X \ --trust-remote-code \ --quantization modelopt \ --max-model-len 131072 \ --max-num-seqs 48 \ --kv-cache-dtype fp8 \ --gpu-memory-utilization 0.88 \ --tensor-parallel-size 2 \ --reasoning-parser qwen3 \ --enable-auto-tool-choice \ --tool-call-parser qwen3_xml \ --enable-prefix-caching \ --enable-chunked-prefill \ --default-chat-template-kwargs '{"enable_thinking": true, "preserve_thinking": true}' \ --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' EOF chmod +x ~/vllm-setup/start-vllm.sh bash ~/vllm-setup/start-vllm.sh

This worked great! It took a lot of research to get it working so I also hope it helps someone else. NCCL in WSL2 was problematic I must say.

Now I installed a brand new headless Ubuntu to get into production and in a Docker container and I have come as far as getting the system to work if I run language-model-only (stripping vision capabilities), then it works fine. However when removing language-model-only it hangs when vllm shows MTP model detected.

docker run --name huihui-qwen36-27b-nvfp4-tp2 \
--gpus '"device=0,1"' \
--shm-size=32g \
-e CUDA_VISIBLE_DEVICES=0,1 \
-e NCCL_P2P_DISABLE=1 \
-e NCCL_SHM_DISABLE=0 \
-e NCCL_SOCKET_IFNAME=lo \
-e NCCL_IGNORE_CPU_AFFINITY=1 \
-e NCCL_TOPO_DUMP_FILE=/dev/null \
-e NCCL_NET_DISABLE_INTRA=0 \
-e NCCL_CUMEM_ENABLE=0 \
-e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
-v /home/shadde/models/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP:/models/current:ro \
-p 8000:8000 \
vllm/vllm-openai:cu130-nightly \
--model /models/current \
--served-model-name qwen3.6-27b \
--host 0.0.0.0 \
--port 8000 \
--trust-remote-code \
--language-model-only \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--default-chat-template-kwargs '{"enable_thinking": true, "preserve_thinking": true}' \
--enable-prefix-caching \
--enable-chunked-prefill \
--max-model-len 131072 \
--gpu-memory-utilization 0.94 \
--kv-cache-dtype fp8_e4m3 \
--tensor-parallel-size 2 \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'

This works but as soon as I remove language-model-only it silently freeze at MTP Detected. Just a freeze
Any suggestions are welcome, thanks

u/quantier — 7 days ago
▲ 33 r/BlackwellPerformance+1 crossposts

Qwen3.6-27B 8bit DFLASH performance vs num_speculative_tokens

I'm running Qwen3.6-27B 8bit on my RTX PRO 6000 Blackwell workstation edition and I was trying to figure out the optimal setting for `num_speculative_tokens` while using DFLASH. So I decided to run some benchmarks where I varied `num_speculative_tokens` from 1 to 20 to find the optimal value. Hopefully it's helpful to you guys!

Here's the results in text format:

🏆 FINAL RESULTS

===============================================

{'k'} | {'Avg tok/s'} | {'±std'} | Best?

\---------------------------------------------------

1 |         67.4 | ±   0.1 |

2 |         88.8 | ±   0.1 |

3 |        102.5 | ±   0.8 |

4 |        116.1 | ±   0.1 |

5 |        124.7 | ±   0.1 |

6 |        127.6 | ±   0.1 |

7 |        126.6 | ±   0.1 |

8 |        133.8 | ±   0.1 |

9 |        126.8 | ±   0.4 |

10 |        136.8 | ±   0.1 |

11 |        140.0 | ±   0.3 | ← BEST

12 |        132.5 | ±   0.2 |

13 |        137.8 | ±   0.1 |

14 |        135.0 | ±   3.9 |

15 |        136.7 | ±   1.3 |

16 |        132.2 | ±   0.2 |

17 |        129.8 | ±   0.1 |

18 |        123.4 | ±   0.1 |

19 |        123.8 | ±   0.4 |

20 |        125.0 | ±   0.1 |

🎯 Recommended: k = 11 (139.95999999999998%.1f tok/s)  

Here's my vLLM setup:

  qwen-vllm: # ← Qwen3.6-27B via vLLM (OpenAI-compatible API)
    image: vllm/vllm-openai:latest
    container_name: qwen-vllm
    ipc: host
    shm_size: 32g                    # Critical for large context + Qwen3.6 performance
    ports:
      - "8000:8000"                  # OpenAI-compatible endpoint[](http://localhost:8000/v1)
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface   # Persists the ~55 GB model download
    environment:
      - HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
      - HF_HUB_ENABLE_HF_TRANSFER=1
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all             # ← Change to 1 if you only want to use a single GPU
              capabilities: [ gpu ]
    command: &gt;
      --model Qwen/Qwen3.6-27B-FP8
      --served-model-name qwen3.6-27b
      --host 0.0.0.0
      --port 8000
      --tensor-parallel-size 1
      --gpu-memory-utilization 0.90
      --max-model-len 262144
      --kv-cache-dtype auto
      --attention-backend flash_attn
      --max-num-batched-tokens 16384
      --max-num-seqs 24
      --trust-remote-code
      --enable-prefix-caching
      --enable-chunked-prefill
      --reasoning-parser qwen3
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-27B-DFlash", "num_speculative_tokens": 11}'
      -O3
    extra_hosts:
      - "host.docker.internal:host-gateway"
    networks:
      - hermes-net
u/dxplq876 — 3 days ago

Updated Discord Join Link?

Hi, does anyone have an updated invite to the discord? The one in the subreddit description is expired, and the last one I found in a post was also expired....invite links are usually are set to expire after like 7 or 14 days unless you click the specific flag to have them not expire.

reddit.com
u/sininspira — 8 days ago
▲ 3 r/BlackwellPerformance+1 crossposts

Best local LLM for OpenClaw on RTX 6000 Pro? Trying to reduce GPT/Claude token costs

I’m joining a university this fall as an engineering assistant professor, and I’m planning to start integrating OpenClaw into our research workflows. I’ve already been using agentic coding tools heavily for a while, but I want to move toward more capable autonomous systems for both research and development.

I’m trying to figure out what the best local LLM setup would be on an NVIDIA RTX 6000 Pro (96 GB), particularly for:

  • coding / agentic engineering
  • technical writing

For people already running local setups: what models are actually working well right now?

I’m especially curious about how current local models compare against Claude Opus 4.7 and GPT-5.5 (are they much worse or comparable).

I’m a heavy LLM user, enough that I burn through Cursor limits very quickly (my $60 subscription got exhausted within ~3 days, most of the times only Opus worked for my coding tasks). Because of that, I’m wondering whether investing in long-term local inference infrastructure makes more sense.

reddit.com
u/Silent_Cherry5086 — 3 days ago

Saitech Sold Me Defective RTX Pro 6000

They seem to be misquoting their own defective product policies and read to me like they don’t want to be held responsible for a defective GPU they sold me very recently. Or at the least they won’t do anything unless they can convince NVIDIA to make them whole first.

Has anyone dealt with them before?

reddit.com
u/Southern-Round4731 — 3 days ago

A working model name and compose config would be much appreciated. Also what numbers are you getting you of it. I ran the lukealonso/GLM-5.1-NVFP4 few days ago but it was only 1k PP and 33 tps gen. Tried newer docker images and it would just start using 100% gpus by itself after startup and won't quit 🫩 saw people posting tps in ~100 ranges.

Hardware: RTX 6000 pros

Thank you in advance!

reddit.com
u/val_in_tech — 6 days ago