Ollama Gemma4:31b on 3090 - FP,Q8,Q4 Benchmark
I was looking for user benchmarks this morning to see what others have been able to do on their 3090s. Nothing seemed to exist anywhere so I had Claude run them.
In case anyone is interested:
Gemma 4 31B Dense — Flash Attention + Q4 KV Cache on RTX 3090 (24GB)
Two Ollama env vars completely transformed this model's usability. The dense model went from a 16K context ceiling at 15 tok/s to full speed through 128K.
Before (FP16 KV, no Flash Attention):
| Context | tok/s | VRAM |
|---|---|---|
| 8K | 15.4 | 22,166 MiB |
| 16K | 15.4 | 23,590 MiB |
| 32K | 7.5 ⚠️ | 23,950 MiB |
| 64K | 3.8 ⚠️ | 23,660 MiB |
After (FA + Q4_0 KV Cache):
| Context | tok/s | VRAM |
|---|---|---|
| 8K | 29.8 | 20,960 MiB |
| 16K | 29.8 | 21,136 MiB |
| 32K | 29.6 | 21,528 MiB |
| 64K | 29.6 | 22,312 MiB |
| 100K | 29.5 | 23,246 MiB |
| 128K | 29.6 | 23,930 MiB |
| 200K | 14.1 ⚠️ | 23,630 MiB |
| 256K | 10.0 ⚠️ | 24,110 MiB |
Config (add to Ollama systemd service):
OLLAMA_FLASH_ATTENTION=1
OLLAMA_KV_CACHE_TYPE=q4_0
Why it works so well on Gemma 4: Only 10 of 60 layers use global attention with full KV cache. The other 50 use sliding window (512-1024 tokens), so the KV cache barely grows with context. Q4 quantization on an already-small KV cache keeps everything in VRAM through 128K with zero CPU offload.
The Gemma 3 KV cache speed bug (which dropped tok/s by 80%) does not appear on Gemma 4 with Ollama 0.20.2.
Hardware: RTX 3090 24GB, Ollama 0.20.2, gemma4:31b Q4_K_M