u/Connect-Concert-4016

I built a small  tool so I stop fooling myself on long-context inference runs
▲ 4 r/LLMDevs+1 crossposts

I built a small tool so I stop fooling myself on long-context inference runs

I’ve been working on long-context inference/compression, and I kept running into a dumb but important problem:

It is easy to run a 64K context test that is not actually a clean 64K benchmark.

A model may have a native RoPE context of 32K, but you ask for 64K. Now the result depends on whether YaRN / rope scaling is configured correctly, whether the backend supports it, and whether you actually measured peak VRAM and retrieval behavior instead of just assuming it worked.

So I built a small diagnostic command that prints a “model context receipt” before I treat anything as a benchmark.

Example:

fraqtl inspect Qwen/Qwen2.5-7B-Instruct --context 65536

For Qwen2.5-7B at 64K, it flags things like:

  • native context is 32,768
  • requested context is 65,536
  • YaRN / rope scaling is required
  • YaRN is not configured
  • estimated FP16 KV cache at 64K is about 3.76 GB
  • peak VRAM still needs to be measured
  • retrieval still needs to be tested

The point is not “this model works at 64K.”

The point is the opposite:

Before claiming anything, I want a receipt that says what is known, what is assumed, and what still needs to be tested.

I’m thinking of adding:

  • perplexity
  • needle-in-a-haystack / passkey retrieval
  • decode tok/sec
  • prefill tok/sec
  • peak VRAM
  • batch concurrency
  • backend-specific notes for llama.cpp / vLLM / Transformers

Question for people doing inference or long-context evals:

What else would you want in this receipt before trusting a long-context run?