u/---NiKoS---

🔥 Hot ▲ 80 r/ollama+1 crossposts

Ollama Gemma4:31b on 3090 - FP,Q8,Q4 Benchmark

I was looking for user benchmarks this morning to see what others have been able to do on their 3090s. Nothing seemed to exist anywhere so I had Claude run them.

In case anyone is interested:

Gemma 4 31B Dense — Flash Attention + Q4 KV Cache on RTX 3090 (24GB)

Two Ollama env vars completely transformed this model's usability. The dense model went from a 16K context ceiling at 15 tok/s to full speed through 128K.

Before (FP16 KV, no Flash Attention):

Context tok/s VRAM
8K 15.4 22,166 MiB
16K 15.4 23,590 MiB
32K 7.5 ⚠️ 23,950 MiB
64K 3.8 ⚠️ 23,660 MiB

After (FA + Q4_0 KV Cache):

Context tok/s VRAM
8K 29.8 20,960 MiB
16K 29.8 21,136 MiB
32K 29.6 21,528 MiB
64K 29.6 22,312 MiB
100K 29.5 23,246 MiB
128K 29.6 23,930 MiB
200K 14.1 ⚠️ 23,630 MiB
256K 10.0 ⚠️ 24,110 MiB

Config (add to Ollama systemd service):

OLLAMA_FLASH_ATTENTION=1
OLLAMA_KV_CACHE_TYPE=q4_0

Why it works so well on Gemma 4: Only 10 of 60 layers use global attention with full KV cache. The other 50 use sliding window (512-1024 tokens), so the KV cache barely grows with context. Q4 quantization on an already-small KV cache keeps everything in VRAM through 128K with zero CPU offload.

The Gemma 3 KV cache speed bug (which dropped tok/s by 80%) does not appear on Gemma 4 with Ollama 0.20.2.

Hardware: RTX 3090 24GB, Ollama 0.20.2, gemma4:31b Q4_K_M

reddit.com
u/---NiKoS--- — 23 hours ago