u/MerlingDSal

▲ 3 r/comfyui+1 crossposts

[Tool] Dataset Factory for ComfyUI A workflow pack for video dataset curation and creation

Hey everyone. I've been working on full finetuning LTX 2.3 for 2D animation video generation, and the biggest bottleneck wasn't training it was building and curating the dataset. So I started building a proper toolset for it inside ComfyUI.

What it is: A pack of workflows that cover the full dataset pipeline, from raw footage to training-ready clips.

What's working now:

Workflow 1 — Slicer — drops a long video and automatically detects scene cuts, saves each clip numbered into a folder. It remembers progress — if you stopped at clip 47, the next video starts at 48.

Workflow 2 — Captioner — points to a folder of clips and sends each one to a vision API (Any VL or omni model). Generates a detailed text description of what happens in the clip: camera angle, motion, characters, environment, lighting. Saves a .txt per clip.

Workflow 3 — Adapter — if you need exact-second durations for training (2s, 3s, 4s...), it speeds up or slows down each clip by the smallest amount needed to match the target length.

Workflow 4 — Curator — type a natural-language query like "water", "fight", or "character running". It reads all captions, compares them semantically using a local embedding model, and copies the matching clips into a separate folder. No need to read captions one by one when trying to find videos in a massive dataset.

Workflow 5 — Analysis — analyzes technical quality for every clip: sharpness, motion score, resolution, black bars, near-duplicate detection — and automatically sorts them into good, medium, and discard folders.

Workflow 6 — Profiler — reads the whole dataset and generates a plain-text report with clip count, duration distribution, motion distribution, dominant camera angles, and automatic imbalance warnings like "74% of clips are close-up or 30% of clips are about characters fighting — consider adding more wide shots and normal interactions."

I'm still building:

  • Local captioning using omni models (no API needed or local apis)
  • Caption refinement — a second-pass critic that checks if the caption actually matches the video
  • A proper quality scoring system — this is the part I care most about. I don't want a score that's just an LLM saying "this clip looks good." I want something closer to human curation: optical flow for motion quality, blur detection per frame, composition analysis, temporal consistency metrics that reflect what actually makes a clip good for training, not just aesthetically pleasing

My goal rn is to make building datasets for LoRA and DoRA training as fast and reliable as possible, with the minimum human effort required. I will release everything when the scoring system and local captioning are solid. (And i'm open to suggestions)

u/MerlingDSal — 13 days ago
▲ 69 r/comfyui+1 crossposts

This is a follow-up to my previous post:

Previous post for context: https://www.reddit.com/r/StableDiffusion/comments/1svrzzt/is_anyone_else_interested_in_buildingfinetuning/

Hi people of Reddit.

A few days ago I decided to try a full fine-tuning run of LTX 2.3. In a previous post, I talked about the problems LTX 2.3 has with 2D animation, and recently I had the chance to talk with people from the LTX team. They basically confirmed what I was already suspecting.

LTX did not receive that much 2D animation training, mainly because licensing this kind of data is difficult.

So after struggling with LoRA training, I decided that I wanted to do a full finetune of the model, with the goal of adding more 2D animation data into it. More specifically, I want to focus on high quality eastern 2D animation, since that is usually where the motion, acting, timing, compositing, and detail are strongest.

But while studying the architecture and trying to figure out the best way to do this full finetuning run, I realized that LTX is kind of a monster, and building a good and big dataset is much harder than it sounds.

So Im making this post to ask if anyone wants to help with this process.

The main goal is to create a curated high-quality dataset for a full finetune of LTX 2.3. From what Im seeing, the minimum target for this kind of run should be around 5k clips. If the dataset is too small, the learning rate has to be lower to avoid catastrophic forgetting and damaging the model. But if the dataset is too small and too weak, the model will not learn enough, and the full finetune will probably not be very useful.

My current plan is to collect clips from some of the best animated works and build a dataset of around 5k clips, separated into three groups.

1 - Less curated clips These are clips that are probably good enough, but still need to be reviewed or filtered better.

2 - Highly curated clips These are the best clips. Strong motion, clean composition, useful character acting, good animation timing, good effects, good line consistency, and generally high training value.

3 - Filtered or augmented clips These would either be clips that pass some kind of quality filter, or high-quality clips modified with AI tools to make them slightly different while still helping the model learn useful motion and animation patterns.

The goal is not just to make the model “look anime.” That is not enough. The real goal is to improve its understanding of 2D animation in general.

Things like timing, spacing, pose changes, limited animation, smear frames, hair and clothing movement, water, smoke, impact effects, character acting, mouth shapes, and stylized camera movement.

With or without help, Im planning to do this full fine-tuning run and release the result to the open-source community.

But if more people help, either with GPU, dataset curation, clip selection, captioning, testing, the final result will probably be much better for everyone.

Right now, the most useful help would be dataset curation. Finding clips is easy. Finding clips that are actually useful for training is the hard part. (And I was also thinking about adding 2D "sexual" animation, but I haven't decided yet.)

I already have some clips collected (2k), and I also trained an experimental LoRA recently. I still need to organize the files and check which checkpoint is the best before posting it on Civitai.

If anyone is interested in helping building a serious 2D animation fine-tune for LTX 2.3, you can join this discord: https://discord.gg/MG2yUntvh

reddit.com
u/MerlingDSal — 16 days ago