Comparing tokens per second of common models
I bench-marked prompt eval tokens per second on my Ollama models.
Benchmark Results
Prompt: `You are an expert software engineer. Write a comprehensive, production-ready Python implementation of a rate limiter using the sliding window algorithm. Include:
- A class-based design with proper encapsulation
- Unit tests using pytest
- Type hints throughout
- Comprehensive docstrings
- Error handling for edge cases
- Performance considerations for high-throughput scenarios
Make the implementation exactly 500 words in explanation length, with detailed comments explaining each decision.`
| Model Name | Tokens per Second | Status |
|---|---|---|
| qwen2.5-coder:1.5b | 373.1 | ✅ |
| gpt-oss-custom:latest | 145.83 | ✅ |
| qwen2.5:7b-instruct | 144.42 | ✅ |
| nemotron-3-nano-custom:latest | 134.58 | ✅ |
| nemotron-cascade-2-custom:latest | 133.9 | ✅ |
| gemma4:latest | 128.85 | ✅ |
| gemma4:26b-custom | 113.48 | ✅ |
| glm-4.7-flash:latest | 96.62 | ✅ |
| huihui_ai/qwen3.5-abliterated:35b | 89.68 | ✅ |
| qwen3.6:35b | 89.4 | ✅ |
| qwen3.6:latest | 88.34 | ✅ |
| huihui_ai/Qwen3.6-abliterated:35b | 87.55 | ✅ |
| glm-4.7-flash-custom:latest | 72.96 | ✅ |
| qwen3-coder-next:latest | 58.87 | ✅ |
| qwen3-next:80b-custom | 56.54 | ✅ |
| qwen3-coder-next-custom:latest | 52.19 | ✅ |
| devstral-small-2:latest | 51.44 | ✅ |
| devstral-small-2-custom:latest | 30.39 | ✅ |
| gemma4:31b-custom | 26.84 | ✅ |
| deepseek-r1:32b-custom | 24.68 | ✅ |
| deepseek-r1:70b-custom | 11.04 | ✅ |
| qwen3.5:latest | N/A | ❌ |
| qwen3-vl:32b-custom | N/A | ❌ |
| qwen3.6:27b | N/A | ❌ |
Notes
- 5 minute timeout
- custom models are max num_ctx or max num_ctx that can fit in 64gb vram
- Results are what Ollama reported for eval prompt tokens/sec.
- I did not read result for correctness.
- Rig is 1 x 5060 (16gb), 1 x 3090 (24gb), 2 x 3060 (12gb)
u/Critical-Entry3377 — 4 hours ago