u/Suitable-Song-302

URL: https://github.com/quantumaikr/quant.cpp

Title (≤80 chars)

Show HN: quant.cpp – 7x longer LLM context via KV cache compression, pure C

Post

I built a minimal LLM inference engine in pure C (67K LOC, zero dependencies) with one goal: extend context length without adding hardware.

The key insight: LLM inference memory is dominated by the KV cache, not model weights. Compressing the KV cache to 4-bit keys + Q4 values gives 6.9x memory reduction with negligible quality loss.

Real numbers on a 16GB Mac (M1 Pro):

Model FP16 KV (llama.cpp) Compressed KV (quant.cpp) Gain

Llama 3.2 3B ~50K tokens ~350K tokens 6.9x

Gemma 4 26B-A4B (MoE) ~4K tokens ~30K tokens 6.9x

How it works:

Keys: uniform 4-bit min-max quantization per 128-element block

Values: Q4 nibble quantization with per-block scales

Delta mode: store key[t] - key[t-1] instead of absolute keys (like video P-frames), enabling 3-bit at +1.3% PPL

QK-norm aware: models like Gemma 4 automatically use FP32 keys + Q4 values (sparse key distributions break low-bit quantization)

Quality (WikiText-2 PPL, SmolLM2 1.7B):

FP32 baseline: 14.63

4-bit K + Q4 V: 14.57 (+0.0%)

Delta 3-bit K + Q4 V: 14.82 (+1.3%)

vs llama.cpp Q4 KV: llama.cpp Q4_0 KV gives PPL +10.6%. quant.cpp gives +0.0%. Same bit budget, 10x less degradation.

Code philosophy: 67K lines of C11. No frameworks, no CUDA required. The full forward pass fits in one file. Ships as a single-header quant.h (15K LOC) you can drop into any C project.

Supported models: Llama 3.2, Qwen 3.5, Gemma 3/4, MoE (128 experts).

./quant model.gguf -p "hello" -k uniform_4b -v q4 # that's it

Feedback welcome. Particularly interested in: (1) what context length you'd need for your use case, (2) which models to prioritize next.

quant.cpp — 7x longer LLM context in pure C (Gemma 4 26B on 16GB Mac)