u/AronSan — reddlx

A while back I shipped a desktop app that generates fine-tuning datasets via OpenRouter. Got my Qwen2.5-Coder-7B from 55.5% → 72.3% on HumanEval with it (5 runs, Q4_K_M GGUF).

What's new

- Auto-detect - one click, scans `localhost:11434/1234/8080`, adds whatever answers
- Mixed mode - gen on local Qwen3-14B, judge on cloud GPT-4-mini (or any combo per category). Routes each call to the right backend automatically.
- Custom endpoints — vLLM, TGI, your own gateway, paste base URL + optional bearer token
- Instant cancel - `task.cancel()` straight into the in-flight httpx, so cancel feels like ~1s instead of waiting 8 minutes for a 14B chat call to time out
- Reasoning model handling - Qwen3 / DeepSeek-R1 burning the whole budget on `<think>` blocks now auto-retries with 4× budget instead of skipping the example

https://preview.redd.it/lz1sry13iyyg1.png?width=658&format=png&auto=webp&s=8502576438ff619fbdf5d13b641e7f9244f51222

Annoying stuff I had to figure out

- Token accounting differs across providers. OpenRouter breaks out `reasoning_tokens` cleanly. Ollama doesn't — `usage.completion_tokens` is the whole think+content figure. So an 80-token reply after 800 tokens of `<think>` reports as 880, breaks the budget check, blows up Quality Report stats by 10×. Fix: detect `<think>` blocks or `message.reasoning` field, recount the kept content with tiktoken, write it back into usage.

- LM Studio uses `message.reasoning_content` instead of `message.reasoning`.** Same idea, different field name. Discovered with curl. Sigh.

- Capability flags, not provider-kind switches. First draft had `if provider.kind == "ollama"` everywhere. Doesn't scale. Refactored to `ProviderCapabilities` (supports_reasoning / requires_api_key / has_pricing / etc). Adding a new backend is now one class + one registry entry.

What I learned

- <14B local models aren't worth it for dataset gen. Tested 7B/9B — output drifts off-topic, repeats patterns, misunderstands category descriptions. The tokens you save on cloud you spend 5× over on rejected examples. 14B floor, 32B comfortable.

- Mixed mode is the actual killer feature. Expected "fully offline" to be the win. Turns out the workflow most people want is: cheap local for volume gen (5000+ examples), strong cloud as judge (because rubber-stamp judges silently kill dataset quality). One config change in v1.0.3-beta.

What didn't make the cut

- Per-provider concurrency limits. Prototyped, cut. Enterprise complexity for ~zero real benefit on single-GPU setups.

- Provider badge in model picker. Two providers with same model name show as identical entries. Punted.

Links

- Repo: github.com/AronDaron/dataset-generator (AGPL-3.0)

- Dataset (2,248 examples): huggingface.co/datasets/AronDaron/OctoBench-2.2k

Hey,

Quick update on the dataset generator app I posted about a few days ago.

I gave it a real try. Generated a bigger dataset (2,248 examples across 8 categories), fine-tuned Qwen2.5-Coder-7B-Instruct again, and ran four benchmarks this time. Here's how it went:

https://preview.redd.it/r1zp3ohv76yg1.png?width=2550&format=png&auto=webp&s=992571e3cd91bfaabd7fc184e81eb56876cc3db6

HumanEval / HumanEval+ jumped much harder than last time. BigCodeBench barely moved. LiveCodeBench actually regressed. The last two are the more interesting part.

I dug into the LCB regression — turned out the model had correct logic but missing `input()`/`print()` wrappers. My training data was framed as "return only the function" and LCB tests need full programs with stdin/stdout. Format mismatch, not a knowledge gap. Already generating a category that fixes this.

BCB barely moving was honestly my fault. My "data libraries" category was way too generic ("any 2+ libs from this list") and BCB tests precise API usage with concrete kwargs. Working on a follow-up category seeded with BCB's actual taxonomy.

A few other things I learned along the way:

- Judge model matters more than generator model. Some flash-tier judges rubber-stamp everything; smaller ones skip half of what they don't understand.

- Shorter category descriptions beat longer ones. I overengineered prompts at first and accept rate dropped from ~85% to 10% with too many filters.

Resources:

- Dataset: https://huggingface.co/datasets/AronDaron/OctoBench-2.2k

- Fine-tuned model: https://huggingface.co/AronDaron/Qwen2.5-Coder-7B-Instruct-OctoBench-2.2k-Fine-tune

- Code (AGPL-3.0): https://github.com/AronDaron/dataset-generator

Happy to hear feedback, especially around judge model selection — that surprised me the most. Also if anyone has tried fine-tuning specifically targeting BCB or LCB, would love to hear what worked.