u/Connect-Concert-4016

I’ve been working on long-context inference/compression, and I kept running into a dumb but important problem:

It is easy to run a 64K context test that is not actually a clean 64K benchmark.

A model may have a native RoPE context of 32K, but you ask for 64K. Now the result depends on whether YaRN / rope scaling is configured correctly, whether the backend supports it, and whether you actually measured peak VRAM and retrieval behavior instead of just assuming it worked.

So I built a small diagnostic command that prints a “model context receipt” before I treat anything as a benchmark.

Example:

fraqtl inspect Qwen/Qwen2.5-7B-Instruct --context 65536

For Qwen2.5-7B at 64K, it flags things like:

native context is 32,768
requested context is 65,536
YaRN / rope scaling is required
YaRN is not configured
estimated FP16 KV cache at 64K is about 3.76 GB
peak VRAM still needs to be measured
retrieval still needs to be tested

The point is not “this model works at 64K.”

The point is the opposite:

Before claiming anything, I want a receipt that says what is known, what is assumed, and what still needs to be tested.

I’m thinking of adding:

perplexity
needle-in-a-haystack / passkey retrieval
decode tok/sec
prefill tok/sec
peak VRAM
batch concurrency
backend-specific notes for llama.cpp / vLLM / Transformers

Question for people doing inference or long-context evals:

What else would you want in this receipt before trusting a long-context run?

I built a small tool so I stop fooling myself on long-context inference runs