https://preview.redd.it/7tdi4fa3k52h1.png?width=1828&format=png&auto=webp&s=9b35d7acf7b376c4171e33e0eafdb91b5ed5e1fe

I've been working on this for a few months and it's finally in a state where I think it might be useful to someone other than me. Sharing it here in case you're trying to train character LoRAs on FLUX-2 and you're tired of guessing.

The premise: every time I train a character LoRA, I end up stuck on two questions.

Is my dataset actually balanced and identity-consistent, or am I just hoping?
Once trained, which step actually holds likeness across the whole prompt sweep — not just the one flattering close-up?

GridLoraTester answers both with numbers from face-recognition scores. It's split in two surfaces; you can use either independently.

Dataset curation

Face recognition (ArcFace via InsightFace buffalo_l) gives every photo a similarity score against a per-dataset centroid (mean of all detected faces). Off-identity photos surface immediately.
Pose × framing classifier (front / ¾ / profile × close-up / medium / wide / extreme). A dataset-health checklist tells you what's balanced and what's under-represented vs published portrait-dataset targets.
Prune candidates when you're over a max size — most-redundant photos within over-represented buckets, ranked by k=3 nearest in-bucket cosine. Soft delete, fully reversible.
External-photo suggestions — link Immich / Google Photos / a local folder, and the engine mines that library for photos that fit the dataset's identity AND fill an under-rep bucket. Pose-tempered scoring so profile shots aren't penalised. Dedup runs both vs the existing dataset AND across the suggestions themselves, so the same photo on Immich + Google Photos collapses to one suggestion.
BlockHash 256-bit near-duplicate detection (10-bit Hamming threshold) underneath all of the above.

Grid testing

One row per checkpoint × one column per prompt, same seed across the grid for fair comparison.
Every cell scored against the dataset centroid: green ≥ 0.50 / amber ≥ 0.35 / red < 0.35.
Per-prompt aspect ratio via [3:4] / [16:9] prefixes; resolution comes from a single MP budget. [trigger] placeholder substituted automatically.
Run history per test — flip between runs to compare quant changes, training continuation, or rescore a past run against an updated centroid without regenerating anything.
Score-vs-step graph (median / p20 / max). Useful for picking the checkpoint where p20 (consistency) catches up with median (peak) instead of just chasing the spikes.

Tech bits, in case you care

FLUX-2 Klein via diffusers; FP8 / FP8 dynamic / bf16 / INT8 ConvRot quant paths. INT8 ConvRot uses Hadamard rotation + torch._int_mm cuBLASLt → ~2× faster denoise than FP8 weight-only on Ampere (3090/3080), same VRAM (~9 GB transformer for Klein 9B). LoRA bake-in via Tensor.data.copy_() preserves Parameter identity so torch.compile survives swaps.
Prompt-embedding cache in SQLite. After encoding, Qwen3 text encoder is fully unloaded (del + gc + empty_cache()) so it doesn't squat VRAM during the denoise + VAE.
Per-shape batching in the grid loop — mixed AR rows don't crash batched inference; prompts grouped by (w, h) before each pipe() call.
Dashboard is SvelteKit + better-sqlite3 in WAL mode. Python writes back to the same DB the dashboard reads — no IPC marshalling, just shared SQLite.
Idle-TTL on the face worker frees the ORT BFC arena (~5–6 GB) when not in use; lazy-respawn on next request.

What it isn't

Not a trainer. It eats the LoRA folder your trainer (ai-toolkit, etc.) already produces.
FLUX-2 only right now. The pipeline-load code is reasonably isolated; FLUX-1 / SD3 / Wan2.2 aren't out of the question if there's demand.
NVIDIA + ≥ 24 GB VRAM. Linux is the tested path; the dashboard runs on macOS/Windows but the inference side wants Linux + CUDA.

License

Source-available under PolyForm Noncommercial 1.0.0 — free for personal / hobby / research / education. Commercial use is a separate paid license (details in LICENSE). MIT was too permissive for the niche; PolyForm cleanly splits "free for everyone learning" from "paid if you're shipping a product on top".

Repo

→ https://github.com/Mandrakia/GridLoraTester

Bug reports and PRs welcome. Particularly interested in feedback on the suggestion engine's bucket-targeting heuristic and the grid-test sort UX — those are the two surfaces where my own preferences leak into the defaults most.

Screenshots

Dataset list Dataset details Dataset stats Dataset edit : Prune Dataset edit : Suggestions Test setup Test grid result Test graphi result

u/Mandrakia

Character lora tool : GridLoraTester