u/Famous-Sport7862

LTX 2.3 audio as standalone speech model.
▲ 45 r/comfyui+1 crossposts

LTX 2.3 audio as standalone speech model.

User @wildmindai from X posted about this new model. Has anyone here tried it yet?

LTX 2.3 audio as standalone speech model.

Emotional TTS with Scenema Audio.

- Zero-shot expressive voice cloning, speech gen

- 8-step distilled with Gemma 3 12B text encoding

- stage directions via <action> tags

- runs at 1.5x real-time on RTX 4090

- fits in 16GB VRAM

- 13 languages, 48kHz stereo output

it also gens matching environment sounds

https://huggingface.co/ScenemaAI/scenema-audio

u/Famous-Sport7862 — 3 days ago