u/Anxious-Visit-7735

***Updated, added more models + longer runs***

Built a 4-bit weight quantization format called PBF4. The 16-entry codebook is sampled every-other-level from an 8-bit log-polar ("PBF8") spine with irrational base φ+π and step ln(8)/16; layout is NF4-style 7 negatives + 0 + 8 positives. No calibration — same codebook for every tensor.

Implementations in bitsandbytes (Python + CUDA/HIP, mirrors the fp4/nf4 paths) and llama.cpp (PBF-MX block format + a multi-spine PBF-MX-T variant).

Per-tensor evaluation: 58 real weight tensors from 7 architectures (Qwen 0.5B, SmolLM-360M, TinyLlama, OLMo-1B, GPT-2, Granite-2B, Mamba-370M). PBF4 wins 57/58 vs NF4 on x²-weighted MSE (the metric that tracks matmul-output impact), with 20–28% error reductions. The trade: PBF4 is 24–31% worse on plain abs error — log spacing sacrifices small-value precision to better preserve large values, which dominate matmul outputs.

End-to-end on (wikitext-2, n_ctx=512, 30 -80 chunks):

model	scale	PBF-MX-T (bpw / PPL)	Q4_K_M (bpw / PPL)	Δ PPL	Δ BPW
Qwen3-0.6B	0.6B	4.78 / 29.60	5.09 / 23.54	+6.05	+0.31
TinyLlama-1.1B	1.1B	4.45 / 9.68	4.85 / 9.19	+0.49	+0.40
Granite-3.3-2B	2B	4.40 / 10.20	4.87 / 8.63	+1.57	+0.47
Qwen2.5-7B	7B	4.47 / 6.23	4.91 / 5.99	+0.23	+0.44
Mistral-7B	7B	4.35 / 5.61	4.83 / 5.50	+0.11	+0.48

Important caveat: Q4_K_M is mixed-precision — it keeps ~1/3 of weights at q6_K (embedding, lm_head, per-layer attn_v / ffn_down). PBF-MX-T quantises everything at 4-bit except output.weight. So the bpw delta understates how much more aggressive PBF-MX-T's 4-bit coverage is; a like-for-like comparison would close the PPL gap. Haven't run that experiment yet.

4-bit weight quantization with a log-spaced codebook (PBF4) — bnb + llama.cpp implementations