
Gemma 4 31B at 256K Full Context on a Single RTX 5090 — TurboQuant KV Cache Benchmark
Just got Gemma 4 31B running at full 256K context on a single RTX 5090 using TurboQuant KV cache compression.
System Specs
| Component | Spec |
|---|---|
| GPU | NVIDIA GeForce RTX 5090 (32GB VRAM) |
| CPU | AMD Ryzen 9 9950X3D (16-core) |
| RAM | 64GB DDR5 |
| OS | Windows 11 |
Setup
- Model:
gemma-4-31B-it-UD-Q4_K_XLfrom Unsloth (17.46 GiB) - Build: TheTom/llama-cpp-turboquant branch
feature/turboquant-kv-cache, merged with latest upstream master for Gemma 4 support - KV Cache:
turbo3(3-bit PolarQuant + Hadamard rotation, ~4.5x compression vs f16) - Config:
--n-gpu-layers 99 --no-mmap --flash-attn on --cache-type-k turbo3 --cache-type-v turbo3
Benchmark Results
| Test | Speed (t/s) |
|---|---|
| pp4096 | 3,362.71 |
| pp16384 | 3,047.00 |
| pp65536 | 2,077.96 |
| pp131072 | 1,428.80 |
| pp262144 | 899.55 |
| tg128 | 61.51 |
- VRAM usage at 262K: 27.7 GB / 32 GB (4.3 GB headroom)
- GPU temp: 78-80°C at 575W (some thermal throttling occurred during 262K runs, actual unthrottled speed likely ~950+ t/s... maybe)
Key Takeaways
256K full context fits on a single 5090 — The turbo3 KV cache compresses K/V from 8 bits to effectively 3 bits with near-zero quality loss (based on the TurboQuant paper, arXiv 2504.19874). Without it, 256K would be impossible on 32GB VRAM.
Prompt processing scales predictably — Roughly halving speed per 4x context increase due to O(n²) attention.
Token generation is constant — 61.5 t/s regardless of context length. Memory bandwidth bound.
Gemma 4 support required fixes — Had to fix an MSVC bug in llama.cpp where
std::transformwith(const bool*)fails to correctly read GGUF bool arrays beyond ~48 elements in Release builds. This breaks the SWA (sliding window attention) layer pattern for Gemma 4's hybrid attention architecture. Fix: replace with manualuint8_t*loop.
Build Notes (Windows/MSVC)
If you're building TheTom's TurboQuant fork on Windows:
ggml-turbo-quant.c— Add#define _USE_MATH_DEFINESbefore#include <math.h>(MSVC doesn't define M_PI by default)ggml-cpu/ops.cpp— Addextern "C" int turbo3_cpu_wht_group_size;at file scope (C/C++ linkage mismatch)llama-model-loader.cpp— Replace thestd::transform((const bool*)...)inget_arr()with a manualuint8_t*loop (MSVC optimization bug with bool pointer casting)- Build with
-DBUILD_SHARED_LIBS=OFFto avoid DLL symbol export issues with the turbo globals - Use
-DCMAKE_CUDA_ARCHITECTURES=120afor RTX 5090 (sm_120a required for MXFP4 tensor core instructions)