u/Anxious-Visit-7735

4-bit weight quantization with a log-spaced codebook (PBF4) — bnb + llama.cpp implementations

***Updated, added more models + longer runs***

Built a 4-bit weight quantization format called PBF4. The 16-entry codebook is sampled every-other-level from an 8-bit log-polar ("PBF8") spine with irrational base φ+π and step ln(8)/16; layout is NF4-style 7 negatives + 0 + 8 positives. No calibration — same codebook for every tensor.

Implementations in bitsandbytes (Python + CUDA/HIP, mirrors the fp4/nf4 paths) and llama.cpp (PBF-MX block format + a multi-spine PBF-MX-T variant).

Per-tensor evaluation: 58 real weight tensors from 7 architectures (Qwen 0.5B, SmolLM-360M, TinyLlama, OLMo-1B, GPT-2, Granite-2B, Mamba-370M). PBF4 wins 57/58 vs NF4 on x²-weighted MSE (the metric that tracks matmul-output impact), with 20–28% error reductions. The trade: PBF4 is 24–31% worse on plain abs error — log spacing sacrifices small-value precision to better preserve large values, which dominate matmul outputs.

End-to-end on (wikitext-2, n_ctx=512, 30 -80 chunks):

model scale PBF-MX-T (bpw / PPL) Q4_K_M (bpw / PPL) Δ PPL Δ BPW
Qwen3-0.6B 0.6B 4.78 / 29.60 5.09 / 23.54 +6.05 +0.31
TinyLlama-1.1B 1.1B 4.45 / 9.68 4.85 / 9.19 +0.49 +0.40
Granite-3.3-2B 2B 4.40 / 10.20 4.87 / 8.63 +1.57 +0.47
Qwen2.5-7B 7B 4.47 / 6.23 4.91 / 5.99 +0.23 +0.44
Mistral-7B 7B 4.35 / 5.61 4.83 / 5.50 +0.11 +0.48

Important caveat: Q4_K_M is mixed-precision — it keeps ~1/3 of weights at q6_K (embedding, lm_head, per-layer attn_v / ffn_down). PBF-MX-T quantises everything at 4-bit except output.weight. So the bpw delta understates how much more aggressive PBF-MX-T's 4-bit coverage is; a like-for-like comparison would close the PPL gap. Haven't run that experiment yet.

github.com
u/Anxious-Visit-7735 — 4 days ago