![[P] GPU friendly lossless 12-bit BF16 format with 0.03% escape rate and 1 integer ADD decode works for AMD & NVIDIA](https://preview.redd.it/qbx94xeeo2tg1.png?width=140&height=93&auto=webp&s=39ed7f02dad84ccf081f932903c016c7983d4fcd)
[P] GPU friendly lossless 12-bit BF16 format with 0.03% escape rate and 1 integer ADD decode works for AMD & NVIDIA
Hi everyone, I am from Australia : ) I just released a new research prototype
It’s a lossless BF16 compression format that stores weights in 12 bits by replacing the 8-bit exponent with a 4-bit group code.
For 99.97% of weights, decoding is just one integer ADD.
Byte-aligned split storage: true 12-bit per weight, no 16-bit padding waste, and zero HBM read amplification.
Yes 12 bit not 11 bit !! The main idea was not just “compress weights more”, but to make the format GPU-friendly enough to use directly during inference:
sign + mantissa: exactly 1 byte per element
group: two nibbles packed into exactly 1 byte too
- 1.33x smaller than BF16
- Fixed-rate 12-bit per weight, no entropy coding
- Zero precision loss bit-perfect reconstruction
- Fused decode + matmul, so there is effectively no separate decompression stage
- Byte-aligned storage, no LUT, no bitstream parsing
- Works on both NVIDIA and AMD
Some results so far:
Single-user (B=1), RTX 5070 Ti
- Llama 2 7B: 64.7 tok/s (1.47x vs vLLM)
- Mistral 7B: 60.0 tok/s (1.10x vs vLLM)
- Llama 3.1 8B: 57.0 tok/s (vLLM OOM on 16 GB)
Multi-user (B=256), total tok/s
- Llama 2 7B: 2931 vs 1086 in vLLM (2.70x)
- Mistral 7B: 2554 vs 872 in vLLM (2.93x)
It also seems surprisingly stable across model types:
- Llama 3.1 405B: 0.034% escape rate
- Mixtral 8x7B: 0.050%
- SDXL UNet: 0.233%
- CogVideoX 2B: 0.128%
So far this is tested on BF16 safetensors only.
Repo: https://github.com/cenconq25/Turbo-Lossless
Also worth noting: the V3 fused decode+GEMM kernel uses tensor-core patterns inspired by ZipServ / ZipGEMM (Fan et al., ASPLOS 2026).
Happy to hear criticism, edge cases, or reasons this idea won’t scale.
Thanks for your time : )