u/---NiKoS---

I was looking for user benchmarks this morning to see what others have been able to do on their 3090s. Nothing seemed to exist anywhere so I had Claude run them.

In case anyone is interested:

Gemma 4 31B Dense — Flash Attention + Q4 KV Cache on RTX 3090 (24GB)

Two Ollama env vars completely transformed this model's usability. The dense model went from a 16K context ceiling at 15 tok/s to full speed through 128K.

Before (FP16 KV, no Flash Attention):

Context	tok/s	VRAM
8K	15.4	22,166 MiB
16K	15.4	23,590 MiB
32K	7.5 ⚠️	23,950 MiB
64K	3.8 ⚠️	23,660 MiB

After (FA + Q4_0 KV Cache):

Context	tok/s	VRAM
8K	29.8	20,960 MiB
16K	29.8	21,136 MiB
32K	29.6	21,528 MiB
64K	29.6	22,312 MiB
100K	29.5	23,246 MiB
128K	29.6	23,930 MiB
200K	14.1 ⚠️	23,630 MiB
256K	10.0 ⚠️	24,110 MiB

Config (add to Ollama systemd service):

OLLAMA_FLASH_ATTENTION=1
OLLAMA_KV_CACHE_TYPE=q4_0

Why it works so well on Gemma 4: Only 10 of 60 layers use global attention with full KV cache. The other 50 use sliding window (512-1024 tokens), so the KV cache barely grows with context. Q4 quantization on an already-small KV cache keeps everything in VRAM through 128K with zero CPU offload.

The Gemma 3 KV cache speed bug (which dropped tok/s by 80%) does not appear on Gemma 4 with Ollama 0.20.2.

Hardware: RTX 3090 24GB, Ollama 0.20.2, gemma4:31b Q4_K_M

Ollama Gemma4:31b on 3090 - FP,Q8,Q4 Benchmark