
A while back I shipped a desktop app that generates fine-tuning datasets via OpenRouter. Got my Qwen2.5-Coder-7B from 55.5% → 72.3% on HumanEval with it (5 runs, Q4_K_M GGUF).
What's new
- Auto-detect - one click, scans `localhost:11434/1234/8080`, adds whatever answers
- Mixed mode - gen on local Qwen3-14B, judge on cloud GPT-4-mini (or any combo per category). Routes each call to the right backend automatically.
- Custom endpoints — vLLM, TGI, your own gateway, paste base URL + optional bearer token
- Instant cancel - `task.cancel()` straight into the in-flight httpx, so cancel feels like ~1s instead of waiting 8 minutes for a 14B chat call to time out
- Reasoning model handling - Qwen3 / DeepSeek-R1 burning the whole budget on `<think>` blocks now auto-retries with 4× budget instead of skipping the example
Annoying stuff I had to figure out
- Token accounting differs across providers. OpenRouter breaks out `reasoning_tokens` cleanly. Ollama doesn't — `usage.completion_tokens` is the whole think+content figure. So an 80-token reply after 800 tokens of `<think>` reports as 880, breaks the budget check, blows up Quality Report stats by 10×. Fix: detect `<think>` blocks or `message.reasoning` field, recount the kept content with tiktoken, write it back into usage.
- LM Studio uses `message.reasoning_content` instead of `message.reasoning`.** Same idea, different field name. Discovered with curl. Sigh.
- Capability flags, not provider-kind switches. First draft had `if provider.kind == "ollama"` everywhere. Doesn't scale. Refactored to `ProviderCapabilities` (supports_reasoning / requires_api_key / has_pricing / etc). Adding a new backend is now one class + one registry entry.
What I learned
- <14B local models aren't worth it for dataset gen. Tested 7B/9B — output drifts off-topic, repeats patterns, misunderstands category descriptions. The tokens you save on cloud you spend 5× over on rejected examples. 14B floor, 32B comfortable.
- Mixed mode is the actual killer feature. Expected "fully offline" to be the win. Turns out the workflow most people want is: cheap local for volume gen (5000+ examples), strong cloud as judge (because rubber-stamp judges silently kill dataset quality). One config change in v1.0.3-beta.
What didn't make the cut
- Per-provider concurrency limits. Prototyped, cut. Enterprise complexity for ~zero real benefit on single-GPU setups.
- Provider badge in model picker. Two providers with same model name show as identical entries. Punted.
Links
- Repo: github.com/AronDaron/dataset-generator (AGPL-3.0)
- Dataset (2,248 examples): huggingface.co/datasets/AronDaron/OctoBench-2.2k