
[Tool] Dataset Factory for ComfyUI A workflow pack for video dataset curation and creation
Hey everyone. I've been working on full finetuning LTX 2.3 for 2D animation video generation, and the biggest bottleneck wasn't training it was building and curating the dataset. So I started building a proper toolset for it inside ComfyUI.
What it is: A pack of workflows that cover the full dataset pipeline, from raw footage to training-ready clips.
What's working now:
Workflow 1 — Slicer — drops a long video and automatically detects scene cuts, saves each clip numbered into a folder. It remembers progress — if you stopped at clip 47, the next video starts at 48.
Workflow 2 — Captioner — points to a folder of clips and sends each one to a vision API (Any VL or omni model). Generates a detailed text description of what happens in the clip: camera angle, motion, characters, environment, lighting. Saves a .txt per clip.
Workflow 3 — Adapter — if you need exact-second durations for training (2s, 3s, 4s...), it speeds up or slows down each clip by the smallest amount needed to match the target length.
Workflow 4 — Curator — type a natural-language query like "water", "fight", or "character running". It reads all captions, compares them semantically using a local embedding model, and copies the matching clips into a separate folder. No need to read captions one by one when trying to find videos in a massive dataset.
Workflow 5 — Analysis — analyzes technical quality for every clip: sharpness, motion score, resolution, black bars, near-duplicate detection — and automatically sorts them into good, medium, and discard folders.
Workflow 6 — Profiler — reads the whole dataset and generates a plain-text report with clip count, duration distribution, motion distribution, dominant camera angles, and automatic imbalance warnings like "74% of clips are close-up or 30% of clips are about characters fighting — consider adding more wide shots and normal interactions."
I'm still building:
- Local captioning using omni models (no API needed or local apis)
- Caption refinement — a second-pass critic that checks if the caption actually matches the video
- A proper quality scoring system — this is the part I care most about. I don't want a score that's just an LLM saying "this clip looks good." I want something closer to human curation: optical flow for motion quality, blur detection per frame, composition analysis, temporal consistency metrics that reflect what actually makes a clip good for training, not just aesthetically pleasing
My goal rn is to make building datasets for LoRA and DoRA training as fast and reliable as possible, with the minimum human effort required. I will release everything when the scoring system and local captioning are solid. (And i'm open to suggestions)