I posted here and here about debugging around rare nonsensical outputs in vLLM and wanted to share what we found the issue to be in the end.
Turns out it was more about how requests were handled as opposed to stuff like the model and kernels.
In my last post I explained how we made the bug reproducible under cache pressure so the next step was to compare behaviour across different vLLM versions. The issue would only appear in newer scheduler implementation so this narrowed the scope quite a bit.
The main change in that version was how requests were split between the prefill and decode phases so we needed to understand what was happening at runtime. We added request-level tracking through the forward pass and gave each request an ID. We traced it through the execution so we could see exactly how it moved through the system. This is where things started to break down.
Some new requests were being routed directly into the decode phase instead of going through prefill.
In practice that means they aren’t starting from a clean slate. Instead, they’re picking up existing SSM state that got left behind from previous sequences.
Because the state is recursive the impact wasn’t limited to just one token. Once the wrong state got loaded it influenced the entire generation. This explains why the outputs looked fully corrupted rather than slightly degraded.
It also explains why the logprobs was showing high confidence on incorrect tokens. The model was behaving consistently given the state it received. The problem was that the state itself was wrong.
A reminder from this learning is that correctness issues do not always come from the model….they can come from how requests are orchestrated. Especially when stateful components are involved and systems are under pressure.