u/Critical-Entry3377

▲ 2 r/LocalLLM+1 crossposts

Comparing tokens per second of common models

I bench-marked prompt eval tokens per second on my Ollama models.

Benchmark Results

Prompt: `You are an expert software engineer. Write a comprehensive, production-ready Python implementation of a rate limiter using the sliding window algorithm. Include:

  1. A class-based design with proper encapsulation
  2. Unit tests using pytest
  3. Type hints throughout
  4. Comprehensive docstrings
  5. Error handling for edge cases
  6. Performance considerations for high-throughput scenarios

Make the implementation exactly 500 words in explanation length, with detailed comments explaining each decision.`

Model Name Tokens per Second Status
qwen2.5-coder:1.5b 373.1
gpt-oss-custom:latest 145.83
qwen2.5:7b-instruct 144.42
nemotron-3-nano-custom:latest 134.58
nemotron-cascade-2-custom:latest 133.9
gemma4:latest 128.85
gemma4:26b-custom 113.48
glm-4.7-flash:latest 96.62
huihui_ai/qwen3.5-abliterated:35b 89.68
qwen3.6:35b 89.4
qwen3.6:latest 88.34
huihui_ai/Qwen3.6-abliterated:35b 87.55
glm-4.7-flash-custom:latest 72.96
qwen3-coder-next:latest 58.87
qwen3-next:80b-custom 56.54
qwen3-coder-next-custom:latest 52.19
devstral-small-2:latest 51.44
devstral-small-2-custom:latest 30.39
gemma4:31b-custom 26.84
deepseek-r1:32b-custom 24.68
deepseek-r1:70b-custom 11.04
qwen3.5:latest N/A
qwen3-vl:32b-custom N/A
qwen3.6:27b N/A

Notes

  • 5 minute timeout
  • custom models are max num_ctx or max num_ctx that can fit in 64gb vram
  • Results are what Ollama reported for eval prompt tokens/sec.
  • I did not read result for correctness.
  • Rig is 1 x 5060 (16gb), 1 x 3090 (24gb), 2 x 3060 (12gb)
reddit.com
u/Critical-Entry3377 — 4 hours ago