I bench-marked prompt eval tokens per second on my Ollama models.

Benchmark Results

Prompt: `You are an expert software engineer. Write a comprehensive, production-ready Python implementation of a rate limiter using the sliding window algorithm. Include:

A class-based design with proper encapsulation
Unit tests using pytest
Type hints throughout
Comprehensive docstrings
Error handling for edge cases
Performance considerations for high-throughput scenarios

Make the implementation exactly 500 words in explanation length, with detailed comments explaining each decision.`

Model Name	Tokens per Second	Status
qwen2.5-coder:1.5b	373.1	✅
gpt-oss-custom:latest	145.83	✅
qwen2.5:7b-instruct	144.42	✅
nemotron-3-nano-custom:latest	134.58	✅
nemotron-cascade-2-custom:latest	133.9	✅
gemma4:latest	128.85	✅
gemma4:26b-custom	113.48	✅
glm-4.7-flash:latest	96.62	✅
huihui_ai/qwen3.5-abliterated:35b	89.68	✅
qwen3.6:35b	89.4	✅
qwen3.6:latest	88.34	✅
huihui_ai/Qwen3.6-abliterated:35b	87.55	✅
glm-4.7-flash-custom:latest	72.96	✅
qwen3-coder-next:latest	58.87	✅
qwen3-next:80b-custom	56.54	✅
qwen3-coder-next-custom:latest	52.19	✅
devstral-small-2:latest	51.44	✅
devstral-small-2-custom:latest	30.39	✅
gemma4:31b-custom	26.84	✅
deepseek-r1:32b-custom	24.68	✅
deepseek-r1:70b-custom	11.04	✅
qwen3.5:latest	N/A	❌
qwen3-vl:32b-custom	N/A	❌
qwen3.6:27b	N/A	❌

Notes

5 minute timeout
custom models are max num_ctx or max num_ctx that can fit in 64gb vram
Results are what Ollama reported for eval prompt tokens/sec.
I did not read result for correctness.
Rig is 1 x 5060 (16gb), 1 x 3090 (24gb), 2 x 3060 (12gb)

u/Critical-Entry3377

Comparing tokens per second of common models

Benchmark Results