
I built a small tool so I stop fooling myself on long-context inference runs
I’ve been working on long-context inference/compression, and I kept running into a dumb but important problem:
It is easy to run a 64K context test that is not actually a clean 64K benchmark.
A model may have a native RoPE context of 32K, but you ask for 64K. Now the result depends on whether YaRN / rope scaling is configured correctly, whether the backend supports it, and whether you actually measured peak VRAM and retrieval behavior instead of just assuming it worked.
So I built a small diagnostic command that prints a “model context receipt” before I treat anything as a benchmark.
Example:
fraqtl inspect Qwen/Qwen2.5-7B-Instruct --context 65536
For Qwen2.5-7B at 64K, it flags things like:
- native context is 32,768
- requested context is 65,536
- YaRN / rope scaling is required
- YaRN is not configured
- estimated FP16 KV cache at 64K is about 3.76 GB
- peak VRAM still needs to be measured
- retrieval still needs to be tested
The point is not “this model works at 64K.”
The point is the opposite:
Before claiming anything, I want a receipt that says what is known, what is assumed, and what still needs to be tested.
I’m thinking of adding:
- perplexity
- needle-in-a-haystack / passkey retrieval
- decode tok/sec
- prefill tok/sec
- peak VRAM
- batch concurrency
- backend-specific notes for llama.cpp / vLLM / Transformers
Question for people doing inference or long-context evals:
What else would you want in this receipt before trusting a long-context run?