u/Asleep_Training3543 — reddlx

i wanted to see how fast i could make qwen3.6 35b run on a single h100, so i put together a sglang setup for it. it exposes an openai compatible api and also works with claude code through anthropic compatible routing from the connect tab.

the model is an uncensored fp8 qwen3.6 35b. the setup came out of a result of a bunch of sweeps and failed experiments that eventually landed on a config that felt worth keeping. i tried different decode settings, cache settings, speculative decoding variants, backend choices and a few paths that looked promising but ended up slower.

the main thing that worked was dflash speculative decoding with a matched draft model. the draft model predicts tokens ahead and the target model verifies them, so when acceptance is good the server gets multiple tokens out of one larger step instead of grinding forward one token at a time. that is where a lot of the speed comes from.

fp8 weights and fp8 kv cache help keep memory pressure down. prefix caching helps repeated prompts and claude code style sessions. faster attention and moe backends matter a lot on h100. prefill and decode need different tuning because prompt ingestion and token generation stress the system in different ways.

on the best runs i saw normal prose around 250+ tok/s decode and code-style generations over 400 tok/s on one h100. i think there is still more headroom with better speculative decoding sweeps and deeper kernel work, but this is probably where i’m going to leave this version for now.

if you want to try it out, here is the link. would love feedback on it. it is uncensored so you could literally ask ANYTHING.

http://209.20.156.253:8080/

https://preview.redd.it/w9bal5g2fuxg1.png?width=2842&format=png&auto=webp&s=0d207f851ef495e004038deaed651ffbd01167e3