u/ttlequals0

I maintain MinusPod, a self-hosted podcast server that uses Whisper and an LLM to strip ads. Users kept asking which LLM to use, and I didn't have a real answer. So I built a benchmark.

What was tested

32 models across 12 providers, from frontier (GPT-5.5, Claude Opus 4.7, Gemini 2.5 Pro, Grok 4.1, o3) down to free OpenRouter models
7 podcast episodes, 6 with ads and 1 no-ad negative control, all with human-verified ad timestamps
Each episode split into ~85-second sliding windows. Models judge each window independently.
5 trials per (model, episode) at temperature 0 to catch non-determinism
Predictions scored at IoU >= 0.5 against ground truth
Costs recomputed from token counts at a fixed pricing snapshot, so all rows compare at the same prices
~14,400 unique calls per sweep

Top results

Quick definitions for the table columns:

F1: combined precision and recall against human-verified ad spans. 0 means the model got nothing right, 1 means it found every ad with the correct boundaries. Higher is better.
Cost/episode: average USD per episode at a fixed pricing snapshot. Lower is better.
JSON compliance: fraction of responses that parsed as clean JSON matching the requested schema. 1.0 means every response came back well-formed. Higher is better.

Rank	Model	F1	Cost/episode	JSON compliance
1	grok-4.1-fast	0.642	$0.15	0.87
2	qwen3.5-plus (free tier)	0.616	$0.00	1.00
3	gpt-5.5	0.613	$3.46	0.87
4	claude-opus-4-7	0.593	$4.10	1.00
5	gemini-2.5-pro	0.549	$2.03	0.97

A few things the data surfaced:

Most models are heavily recall-biased. They flag non-ads as ads. o3 is the only paid model that leans the other way (precision 0.70, recall 0.48).
F1 and boundary accuracy don't track. Some models that score well on F1 are still 15+ seconds off on where the ad starts or ends, which matters if you're actually cutting the audio.
JSON schema compliance varies. o4-mini parsed cleanly only 5% of the time. Combined with its 0.07 F1, it was the worst-paid model in the run.
Self-reported confidence is poorly calibrated almost everywhere. Several models claim 0.95+ confidence at a true hit rate of 0.20 to 0.45.

Caveats

F1 numbers are upper-bounded by transcript quality—the benchmark scores against transcripts produced by faster-whisper large-v3 with an initial_prompt containing sponsor vocabulary. Smaller Whisper models or no vocabulary prompt will result in lower ceilings. Production results will vary.
Latency numbers for OpenRouter-routed models include OpenRouter queueing and upstream provider load. Treat them as indicators of availability, not model speed.
Data science is not my background. The metric choices (F1 at IoU 0.5, MAE for boundaries, per-bin calibration tables) are what I could defend after reading around. I'd genuinely like a critique. PRs and issues welcome, especially on scoring methodology, additional episodes, or anything I'm computing wrong.

Repo and full report: https://github.com/ttlequals0/MinusPod/tree/main/benchmarks/llm

About MinusPod

MinusPod is a self-hosted server that removes ads before you ever hit play. It transcribes episodes with Whisper, uses an LLM to detect and cut ad segments, and gets smarter over time by building cross-episode ad patterns and learning from your corrections. Bring your own LLM: Claude, Ollama, OpenRouter, or any OpenAI-compatible provider.

https://github.com/ttlequals0/MinusPod

MinusPod LLM benchmark: 32 models tested on podcast ad detection (real transcripts, human-verified)