
quant.cpp — 7x longer LLM context in pure C (Gemma 4 26B on 16GB Mac)
URL: https://github.com/quantumaikr/quant.cpp
Title (≤80 chars)
Show HN: quant.cpp – 7x longer LLM context via KV cache compression, pure C
Post
I built a minimal LLM inference engine in pure C (67K LOC, zero dependencies) with one goal: extend context length without adding hardware.
The key insight: LLM inference memory is dominated by the KV cache, not model weights. Compressing the KV cache to 4-bit keys + Q4 values gives 6.9x memory reduction with negligible quality loss.
Real numbers on a 16GB Mac (M1 Pro):
Model FP16 KV (llama.cpp) Compressed KV (quant.cpp) Gain
Llama 3.2 3B ~50K tokens ~350K tokens 6.9x
Gemma 4 26B-A4B (MoE) ~4K tokens ~30K tokens 6.9x
How it works:
Keys: uniform 4-bit min-max quantization per 128-element block
Values: Q4 nibble quantization with per-block scales
Delta mode: store key[t] - key[t-1] instead of absolute keys (like video P-frames), enabling 3-bit at +1.3% PPL
QK-norm aware: models like Gemma 4 automatically use FP32 keys + Q4 values (sparse key distributions break low-bit quantization)
Quality (WikiText-2 PPL, SmolLM2 1.7B):
FP32 baseline: 14.63
4-bit K + Q4 V: 14.57 (+0.0%)
Delta 3-bit K + Q4 V: 14.82 (+1.3%)
vs llama.cpp Q4 KV: llama.cpp Q4_0 KV gives PPL +10.6%. quant.cpp gives +0.0%. Same bit budget, 10x less degradation.
Code philosophy: 67K lines of C11. No frameworks, no CUDA required. The full forward pass fits in one file. Ships as a single-header quant.h (15K LOC) you can drop into any C project.
Supported models: Llama 3.2, Qwen 3.5, Gemma 3/4, MoE (128 experts).
./quant model.gguf -p "hello" -k uniform_4b -v q4 # that's it
Feedback welcome. Particularly interested in: (1) what context length you'd need for your use case, (2) which models to prioritize next.