r/StableDiffusion

🔥 Hot ▲ 275 r/StableDiffusion

I had fun testing out LTX's lipsync ability. Full open source Z-Image -> LTX-2.3 -> WanAnimate semi-automated workflow. [explicit music]

u/luckyyirish — 6 hours ago
Image 1 — Joy-Image-Edit released
Image 2 — Joy-Image-Edit released
Image 3 — Joy-Image-Edit released
Image 4 — Joy-Image-Edit released
Image 5 — Joy-Image-Edit released
Image 6 — Joy-Image-Edit released
Image 7 — Joy-Image-Edit released
Image 8 — Joy-Image-Edit released
Image 9 — Joy-Image-Edit released
Image 10 — Joy-Image-Edit released
Image 11 — Joy-Image-Edit released
Image 12 — Joy-Image-Edit released
Image 13 — Joy-Image-Edit released
Image 14 — Joy-Image-Edit released
Image 15 — Joy-Image-Edit released
🔥 Hot ▲ 164 r/StableDiffusion

Joy-Image-Edit released

Model: https://huggingface.co/jdopensource/JoyAI-Image-Edit
paper: https://joyai-image.s3.cn-north-1.jdcloud-oss.com/JoyAI-Image.pdf
Github: https://github.com/jd-opensource/JoyAI-Image

JoyAI-Image-Edit is a multimodal foundation model specialized in instruction-guided image editing. It enables precise and controllable edits by leveraging strong spatial understanding, including scene parsing, relational grounding, and instruction decomposition, allowing complex modifications to be applied accurately to specified regions.

JoyAI-Image is a unified multimodal foundation model for image understanding, text-to-image generation, and instruction-guided image editing. It combines an 8B Multimodal Large Language Model (MLLM) with a 16B Multimodal Diffusion Transformer (MMDiT). A central principle of JoyAI-Image is the closed-loop collaboration between understanding, generation, and editing. Stronger spatial understanding improves grounded generation and contrallable editing through better scene parsing, relational grounding, and instruction decomposition, while generative transformations such as viewpoint changes provide complementary evidence for spatial reasoning.

u/AgeNo5351 — 9 hours ago
🔥 Hot ▲ 111 r/StableDiffusion+1 crossposts

ComfyUI-OmniVoice-TTS

>OmniVoice is a state-of-the-art zero-shot multilingual TTS model supporting more than 600 languages. Built on a novel diffusion language model architecture, it generates high-quality speech with superior inference speed, supporting voice cloning and voice design.

https://github.com/k2-fsa/OmniVoice

HuggingFace: https://huggingface.co/k2-fsa/OmniVoice

ComfyUi: https://github.com/Saganaki22/ComfyUI-OmniVoice-TTS

u/fruesome — 8 hours ago
Image 1 — Z Image Base vs Z Image Turbo T2I Comparison with Prompts
Image 2 — Z Image Base vs Z Image Turbo T2I Comparison with Prompts
Image 3 — Z Image Base vs Z Image Turbo T2I Comparison with Prompts
Image 4 — Z Image Base vs Z Image Turbo T2I Comparison with Prompts
Image 5 — Z Image Base vs Z Image Turbo T2I Comparison with Prompts
Image 6 — Z Image Base vs Z Image Turbo T2I Comparison with Prompts
Image 7 — Z Image Base vs Z Image Turbo T2I Comparison with Prompts
Image 8 — Z Image Base vs Z Image Turbo T2I Comparison with Prompts
Image 9 — Z Image Base vs Z Image Turbo T2I Comparison with Prompts
Image 10 — Z Image Base vs Z Image Turbo T2I Comparison with Prompts
Image 11 — Z Image Base vs Z Image Turbo T2I Comparison with Prompts

Z Image Base vs Z Image Turbo T2I Comparison with Prompts

I generated some images using both models with the same prompts. Using comfy UI template workflows. I hope this helps you choose the right model for your needs.

Base Model Settings:

  • width/height: 1024x1024
  • steps : 30
  • cfg: 3.5
  • denoise: 1
  • seed: randomize

Turbo Model Settings:

  • width/height: 1024x1024
  • steps: 8
  • seed: randomize
u/AssociateDry2412 — 2 hours ago
Created ComfyUI nodes to work with new Netflix Void model [beta]

Created ComfyUI nodes to work with new Netflix Void model [beta]

Hello

When I heard that Netflix released new Void model to outpaint things I decided I will create some basic Comfy nodes to support that, nodes are already available in Comfy Manager ("AP Netflix VOID")

I didn't have enough time to play with more frames, it is first working beta version so if you want just play with it but do not expect much!

Example workflow did erase the cup but effect is not really satisfying...

https://github.com/adampolczynski/AP_Netflix_VOID - repo

https://github.com/adampolczynski/AP_Netflix_VOID/tree/main/examples - WORKFLOW, examples

https://registry.comfy.org/publishers/adampolczynski/nodes/ap-netflix-void

workflow Netflix Void

reddit.com
u/Huge-Refuse-2135 — 3 hours ago

Is there an AI model that can fully isolate clean speech from noisy recordings?

Hey everyone,

I’ve been exploring different opensource AI audio tools and was curious if there’s an opensource model or workflow that can isolate voice and make it sound professional?

Like:

  1. Remove background noise from almost any audio
  2. Clean up ambient sounds (street noise, room tone, etc.)
  3. Eliminate mic feedback or hiss
  4. Output crisp, clear speech suitable for film, podcasts, or interviews

also curious, what are people are using these days?

reddit.com
u/QikoG35 — 5 hours ago

Synesthesia AI Video Director — Vocal Shot Chain update.

This week I've been working on adding long-takes to Synesthesia by passing the last frame of a vocal shot into the first frame of the next vocal shot. This was quite a bit more complicated than it seemed at first. The example video posted here from my song "Settle for Clay" has 2 issues that are now fixed in the most recent version of Synesthesia. First issue was Claude decided to not grab the actual last frame - but instead used "-sseof -0.5" causing a skip like you see here. After that was fixed - we then had a duplicate frame which caused a pause instead of a skip. In order to fix that we had to render a full extra second for the vocal shot (LTX-desktop limitation), roll back to 1 frame AFTER the last frame and pass that into the next shot to avoid the duplicate frame.

https://github.com/RowanUnderwood/Synesthesia-AI-Video-Director

First post:

First Update:

u/jacobpederson — 8 hours ago

SDXL Node Merger - A new method for merging models. OPEN SOURCE

Hey everyone! It's been a while.

I'm excited to share a tool I've been working on — SDXL Node Merger.

It's a free, open-source, node-based model merging tool designed specifically for SDXL. Think ComfyUI, but for merging models instead of generating images.

Why another merger?

Most merging tools are either CLI-based or have very basic UIs. I wanted something that lets me visually design complex merge recipes — and more importantly, batch multiple merges at once. Set up 10 different merge configs, hit Execute, grab a coffee, come back to 10 finished models. No more babysitting each merge one by one.

Key Features

🔗 Visual Node Editor — Drag, drop, and connect nodes with beautiful animated Bezier curves. Build anything from simple A+B merges to complex multi-model chains.

🧠 11 Merge Algorithms — Weighted Sum, Add Difference, TIES, DARE, SLERP, Similarity Merge, and more. All with Merge Block Weighted (MBW) support for per-block control.

⚡ Low VRAM Mode — Streams tensors one by one, so you can merge on GPUs with as little as 4GB VRAM.

🎨 4 Stunning Themes — Midnight, Aurora, Ember, Frost. Because merging should look good too.

📦 Batch Processing — Multiple Save nodes = multiple output models in one run. This is a game changer for testing merge ratios.

🚀 RTX 50-series ready — Built with CUDA 12.x / PyTorch latest.

Setup

Just clone the repo, run start.bat, and it handles everything — venv, PyTorch, dependencies. Opens right in your browser.

Would love to hear your feedback and feature requests. Happy merging! 🎉

This isn't a paid service or tool, so I hope I haven't broken any rules. 🤔😅

reddit.com
u/anonimgeronimo — 22 hours ago
Image 1 — Walkthrough: Training a Keep/Trash Classifier on CLIP & DINOv2 Embeddings for SD Coloring Pages
Image 2 — Walkthrough: Training a Keep/Trash Classifier on CLIP & DINOv2 Embeddings for SD Coloring Pages
Image 3 — Walkthrough: Training a Keep/Trash Classifier on CLIP & DINOv2 Embeddings for SD Coloring Pages

Walkthrough: Training a Keep/Trash Classifier on CLIP & DINOv2 Embeddings for SD Coloring Pages

TL;DR: I run a pipeline that generates coloring-page line art with Stable Diffusion. Manually rating thousands of images was becoming a bottleneck, so I trained a simple logistic-regression classifier on CLIP and DINOv2 embeddings to auto-trash the obvious failures. Tested six classifiers across three embedding models and two feature sets. Result: CLIP-based semantic embeddings beat DINOv2's structural embeddings for quality classification, and a dead-simple linear model gets the job done. In the first real deployment, 55% of images were safely auto-trashed with a conservative threshold.


The Problem: Curation at Scale

I generate coloring-page line art using Stable Diffusion. Black outlines on white background, the kind you'd find in an adult coloring book. The pipeline produces hundreds of images per batch across different models and prompts. Some come out great. Many don't: wrong anatomy, broken lines, weird artifacts, subjects that don't match the prompt at all.

Every image goes through a two-stage curation process. First, a binary keep/trash decision: does this image meet a minimum quality bar? Then the keepers enter Elo-style duels against each other to surface the best work. The first stage is the bottleneck. It's not hard, but it's tedious: you're looking at hundreds of images and most of them are clearly trash.

After rating about 3,400 coloring-page images by hand (roughly 18% kept, 82% trashed), I figured there was enough labeled data to let a classifier handle the obvious cases. The goal wasn't to replace human judgment, it was to skip the images that no human would keep.

Why Embeddings?

Instead of training a CNN from scratch or fine-tuning a large model, I went with a much simpler approach: extract embeddings from pretrained vision models, then train a linear classifier on top.

Embeddings are fixed-size vector representations that capture what a model "understands" about an image. A 1024-dimensional vector might sound abstract, but it encodes rich information (semantic content, composition, texture, style) depending on which model produced it. The key insight is that if two images are "similar" according to the model, their embeddings will be close together in vector space.

This means you can take a pretrained model that has never seen a coloring page in its life, extract embeddings for your dataset, and train a simple classifier on top. No fine-tuning, no GPU-intensive training loop, just scikit-learn.

I tested two families of embedding models:

OpenCLIP ViT-H/14, trained on image-text pairs, so it understands images in terms of semantic meaning. It knows "what this image is about." When it looks at a coloring page of a cat, it encodes the concept of cat, the style of line art, the composition. This is the same architecture behind CLIP-based prompt engineering, the model that connects text and images in Stable Diffusion.

DINOv2 (ViT-L/14 and ViT-g/14), a self-supervised vision model from Meta, trained purely on images with no text. It captures visual structure: poses, shapes, textures, spatial layout. It knows "what this image looks like" but has no concept of what the subject is called. I tested two variants: ViT-L/14 (300M parameters, 1024-dim) and ViT-g/14 (1.1B parameters, 1536-dim).

The question was: for separating good coloring pages from bad ones, does "what it's about" (CLIP) or "what it looks like" (DINOv2) matter more?

The Dataset

The training cohort consisted of 3,441 coloring-page images from my pipeline:

  • 625 kept (18.2%)
  • 2,816 trashed (81.8%)

All images were black-and-white line art at 1024x1024, generated across multiple SD models and prompt configurations. The keep/trash labels come from my own manual ratings over several months, same person, same quality bar throughout.

The class imbalance is real but expected. Most SD generations don't meet a quality bar, especially for something as specific as clean line art. All classifiers were trained with balanced class weights to account for this.

One note on cross-validation: in an SD pipeline, images can derive from one another through img2img and create families of siblings that look very similar. I used grouped cross-validation to make sure siblings never appear in both the training and test folds. Without this, metrics would be inflated because the model could "recognize" a family it already saw during training.

Method

The approach is deliberately simple: logistic regression on embeddings. No neural network training, no hyperparameter sweeps, no ensemble methods. I wanted to see how far a linear decision boundary could go before adding complexity.

I embedded the full corpus (17K images across all types) with each of the three models, then trained classifiers on two feature sets:

  • Raw: Just the embedding vector (1024-dim for CLIP and DINOv2-L, 1536-dim for DINOv2-g). Feed the vector directly to logistic regression.
  • Hybrid: The raw embedding concatenated with a handful of engineered features. For instance, the cosine distance between a generated image and the original image it was derived from (how far did it "drift"?), plus some global image statistics. The idea is that raw embeddings capture "what the image is" while the engineered features capture "how it relates to other images in the pipeline."

That gives six classifiers total: three models x two feature sets. All trained with scikit-learn's LogisticRegression with balanced class weights and 5-fold grouped cross-validation.

Results

I used average precision as the primary metric (better than accuracy for imbalanced binary classification). The best classifier, OpenCLIP hybrid, scored 0.47 average precision with 0.74 balanced accuracy. The weakest, DINOv2 ViT-L/14 raw, scored 0.40. For reference, random baseline average precision for this class distribution is 0.18, so even the weakest model is more than 2x above chance.

A few things stand out:

Semantic beats structural. OpenCLIP wins outright, both in raw and hybrid configurations. For quality classification, "what the image is about" matters more than "what the image looks like." This makes intuitive sense: trash images often look structurally valid (clean lines, good composition) but have semantic defects. Wrong anatomy, extra limbs, a subject that doesn't match the prompt. CLIP catches those; DINOv2 doesn't.

Hybrid always beats raw. For every model, adding the engineered features on top of raw embeddings improved both metrics. The extra signal from "how this image relates to its neighbors" is real and consistent, regardless of which embedding space you're in.

Bigger DINOv2 helps, but not enough. The ViT-g/14 variant (1.1B params, 1536-dim) beats ViT-L/14 (300M params, 1024-dim) by about 2-3 percentage points. But it's 3.7x larger, 50% more embedding computation, and still loses to CLIP. Diminishing returns.

DINOv2-g raw ~ CLIP raw. Interestingly, the largest DINOv2 model with raw features (0.4346) nearly matches CLIP raw (0.4363). The structural space at 1536 dimensions approaches semantic-space quality for this task, but only when you throw 1.1B parameters at it.

What This Means in Practice

The numbers above are cross-validation metrics on the training cohort. But the actual question is: can this save time in production?

I ran the first real deployment on 616 unseen coloring pages from 35 new series. Using a conservative threshold, tuned so that fewer than 5 keepers would be lost on the training set, the OpenCLIP classifier auto-trashed 338 out of 616 images (55%). That's more than half the corpus handled without any human review.

The score separation was clean: auto-trashed images averaged a score of 0.07 (on a 0-1 scale), while surviving images averaged 0.48. There's a wide gap between the worst survivor and the best trashed image, which means the threshold isn't sitting on a knife edge.

I also ran DINOv2 classifiers on the same batch for comparison. DINOv2 ViT-L/14 caught only 4 additional images that CLIP missed, all borderline cases. DINOv2 ViT-g/14 added zero on top of that. In production, OpenCLIP alone is sufficient.

One interesting finding: the training cohort was all standard coloring pages, but this test batch included a completely different content style (furry themed art) that the classifier had never seen. It handled it fine, every auto-trashed image clearly deserved trashing. The classifier appears to have learned quality signals (line clarity, composition, anatomical errors) rather than content-specific features.

The classifier doesn't replace curation. It handles the obvious bottom of the barrel so I can spend my rating time on the images that actually need human judgment.

Takeaways

If you're running any kind of SD generation pipeline at scale and doing manual QA, here are the practical lessons:

Your labeled data is your moat. I had 3,400 labeled images from months of manual rating, and that's what made this work. The classifier itself is trivial, logistic regression, a few lines of scikit-learn. The hard part was the consistent labeling. If you're already doing manual curation, you're sitting on training data.

Start simple. A linear classifier on pretrained embeddings is hard to beat for the effort involved. No training loop, no GPU for inference (just for the initial embedding pass), no hyperparameter tuning. I didn't try random forests or neural networks because the linear model already solves the problem. Add complexity when simple stops working.

CLIP embeddings are surprisingly good at quality classification. Even though CLIP was designed for image-text matching, its semantic space captures quality signals that a structural model like DINOv2 misses. If you're only going to embed with one model, make it CLIP.

Don't skip grouped cross-validation. If your pipeline produces families of related images, random train/test splits will give you misleading metrics. Group by source image to get honest numbers.

There are existing tools for SD QA and filtering, and some of them are quite good. But building your own classifier on your own labels means it learns your quality bar, not someone else's. And honestly, it was more fun to build it myself.

What's Next

This is the first post in a short series:

  • Post 2: Using the same embeddings for near-duplicate detection, finding images that are "too similar" and cleaning up redundancy in the pipeline.
  • Post 3: The prompt compiler, a tool that takes a prose description like "a serene Japanese garden at sunset" and decomposes it into optimized, weighted tokens directly in the model's embedding space. This is the ambitious one.

If you have questions about the methodology or want to try this on your own pipeline, happy to discuss in the comments.

u/PerformanceNo1730 — 13 hours ago

GitHub - jd-opensource/JoyAI-Image: JoyAI-Image is the unified multimodal foundation model for image understanding, text-to-image generation, and instruction-guided image editing.

Haven't tested it myself because I lack the brainpower to run it. Seems interesting enough and would be cool to see in comfyui

github.com
u/More-Technician-8406 — 16 hours ago

Your Opinion on Zimage - loss of interest or bar to high?

Just curious what your opinion is on the state of Zimage turbo or Base. A year ago when a new Ai model dropped people would flock to it and the content on places like Civit or Tensor blasts off. Looking back on models like Flux, Pony, SDXL, things escalated quickly in terms of new Checkpoints and Loras, it seemed every day you went online you could find new releases.

When I see polls here, or in other discussions, Zimage usually ranks Number one in ratings for peoples favorite Image generator, and yet there seems to be very little coming out so I was curious, from your perspective why that may be? people moving on to video? losing interest in image gens? or is the requirement for training to high and cut out a lot more people then say SDXL or Flux did?

Keep in mind this is just a question, I don't have knowledge of training checkpoints, only Loras so I'm not as skilled as many of you and just curious how people far smarter than I feel about the slow down.

reddit.com
u/GRCphotography — 23 hours ago
▲ 2 r/StableDiffusion+1 crossposts

Which Version of LTX2.3 are You Using?

Hi,

I'd like to use LTX2.3, But I am not sure which models do I use. I'd prefer to use a base LTX2 model + LTX2.3 LoRA as that gives me more flexibility to control LoRA strength, but I am not sure if that's possible.

What are your recommendations? Any tips? Could you please provide the links to the models you are actually using?

Thanks.

reddit.com
u/Iory1998 — 7 hours ago
Week