r/TextToSpeech

Four months in on my local TTS Mac app what I got wrong about what people actually wanted

Shipped Murmur here around New Year's a local text-to-speech Mac app that runs fully on-device on Apple Silicon. Been four months. Wanted to share what I got wrong about what users actually wanted, because it's different from what I built for.

What I thought the app was for:

I built it because I wanted to listen to long articles and drafts while doing other things. The core job in my head was "paste text, get audio, listen while walking." Privacy was a nice-to-have. Voice quality was the product. I shipped with one model (Kokoro), three or four voices, and called it done.

What people actually bought it for:

The first surprise was audiobook production. Not casual listening actual indie authors converting full novels into audiobooks. One user sent me a 90,000-word manuscript and asked how to batch the whole thing in one go. That wasn't a feature, it was a hack at the time. I built proper batch queue processing because of that email.

Second surprise was voice cloning. I almost didn't ship it. It felt like a gimmicky demo feature next to the "serious" TTS models. It's now probably the single most-used feature by paying users. Podcasters cloning their own voice for intro/outro consistency. Course creators cloning their voice for lesson narration so they don't have to re-record when they edit scripts. YouTube creators cloning their voice for sponsor reads. I completely underestimated how much people wanted this.

Third surprise was languages. I launched in English and half-heartedly added Spanish in week two. The Spanish launch post outperformed the English one. People buying for German, Japanese, Korean, Portuguese languages where ElevenLabs pricing gets especially brutal because the voice selection is smaller and per-character costs are the same. I now support 25+ languages across six different models because every language request was a real customer.

Fourth surprise was batch processing specifically. The original app was one-script-at-a-time. A course creator told me she was manually re-running the app 40 times a week for her lessons. That changed my priorities for a month. Now you drop a folder of scripts or an ePub and come back to the finished audio.

What I got right:

The no-subscription decision. I thought long and hard about it and came close to launching with a $10/mo tier. I'm glad I didn't. Every buyer email that mentions price says some version of "thank you for not making it a subscription." It's not just a pricing preference it filters for a different kind of user who actually uses the product and sticks around.

The fully-local architecture. Some people do buy for privacy. But the real win of fully-local wasn't privacy it was unlimited use. When generation doesn't cost me anything on the backend, it doesn't cost the user anything either, which means they can actually do things like process a whole novel or a 300-lesson course without thinking about caps.

What's in the app now vs. launch:

6 TTS engines (Kokoro, Fish Speech S2 Pro, Qwen3-TTS, and others), picked per content type
860+ community voices plus 10-second voice cloning
25+ languages
Batch processing for folders, ePubs, and queued scripts
Expression and emotion control on supported models ([whisper], [excited], [chuckling] style tags work)
Still one-time purchase ($49), still fully offline, still no telemetry

Requirements haven't changed: macOS 14+, Apple Silicon.

https://www.murmurtts.com

Happy to answer anything about what the last four months of shipping updates has looked like, or which of the six models is right for which use case. Also curious what people here have shipped and what the biggest "I built it for X but users wanted Y" surprise has been those stories are my favorite part of following indie Mac dev.

u/tarunyadav9761 — 19 hours ago

▲ 6 r/TextToSpeech

Is there a free online text to speech that is unlimted or maybe an android app ?

Idc about anything really good , I just need something simple to make some audiobooks in order to hear while running and exercising outside.

reddit.com

u/Atlandios000 — 6 days ago

▲ 2 r/TextToSpeech

In need of some help

Hai so ive been looking around for a while in search of a decent longtime use TTS system -
im currently running izabela with an API key from elvenlabs however, after one evening playing with some friends half of the months credits is alrdy used up....

i know theres plenty of free options but they all get rather dull or annoying to lisnt to and i dont wanna put my friends through that

so im not looking for some insane level voice actor tts but something human, a relaxed voice that is not getting on anyones nerves i dont have it in my budget to upgrade elvenlabs and a credit system seem to not be the way to go for me

as a mute its super nice to beable to communicate as unfortunatly alot of games have either no or very bad chat systems, and tabbing in and out of discord is slightly stressfull and alot of my msg dont even go through cuz people simply doesnt hear them...

i hope to find some help here as im rly lost lookin around for a solution

reddit.com

u/Shiya_Angel — 6 days ago

▲ 2 r/TextToSpeech

OmniVoice Audio Studio

Hey everyone, I wanted to share a project I've been working on — a fully self-hosted, browser-based audio production tool built on top of the k2-fsa/OmniVoice diffusion model.

https://preview.redd.it/qcjrpgxvkxvg1.png?width=713&format=png&auto=webp&s=46fd5a44efed966e764d748a015dfa3f61c3db87

What it does:

It lets you turn a script into a finished, multi-speaker audio production — think podcast episodes, audiobook chapters, narrated videos — entirely on your own machine. No cloud, no subscriptions, no data leaving your computer.

Key features:

Voice cloning from a 3–10 second reference clip. Up to 4 independent speakers per project
Voice Designer — no reference audio? Describe a voice using attributes (gender, age, accent, pitch, style) and it generates one consistently across all your paragraphs
Timeline editor with waveform display, drag-to-reposition, trim handles, cut tool, ripple editing, and undo/redo
Media track for dropping in music, SFX or ambience alongside your voice content
Smart text parser — paste your script, it splits into paragraphs automatically (can split further into additional paragraphs if required). Use [Speaker 2]: to switch voices, [pause 2s] to insert timed silences. Drag and drop between paragraphs to auto re-order, Single or multi paragraph regenerations. Set or adaptable seed options for each paragraph
Episode save/load — saves everything: text, audio, timeline layout, voice settings, generation params
Pronunciation dictionary — fix proper nouns and technical terms once, applies to all generations
600+ language support out of the box, zero-shot
Statistics - Generation demographics

Hardware: Runs on NVIDIA GPU, Apple Silicon (MPS), or CPU. Output is 24kHz WAV.

Tech stack: Python/Flask backend, pure HTML/JS frontend (single file, no framework), OmniVoice diffusion model.

The whole thing runs locally — you just open the HTML file in a browser pointed at the Flask server. No install beyond pip install and pulling the model weights.

Happy to answer questions about this implementation which will be releasing soon.

reddit.com

u/Eastern_Rock7947 — 5 days ago

▲ 0 r/TextToSpeech

Local TTS on Mac just got a lot more interesting, 600+ languages, voice cloning, no cloud

I've been building OpenVox, a local TTS app for Mac that lets you switch between multiple SOTA models depending on what you need. Just launched v1.4 with a new model called OmniVoice and wanted to get feedback from people who actually know TTS.

Model lineup:

OmniVoice (new) → 600+ languages, expressive, context-aware, voice cloning
Qwen3 → best quality for English, great for cloning
Kokoro → fast, handles long-form well
Chatterbox → more expressive, good for character voices

The multi-model approach has been the most useful thing for me personally. No single model wins everything, so being able to switch per use case without juggling different tools or APIs is nice.

OmniVoice language coverage

This is the part I think this sub will appreciate. Most local TTS solutions are effectively English-first with a few extras. OmniVoice covers Hindi, Arabic, Japanese, French, German, Spanish, Portuguese, Korean, Turkish, Ukrainian, Hebrew, Swahili, Tamil, Polish, Dutch, Greek, Swedish, Indonesian, Czech, Bengali and a lot more, 600+ total. Expressive and context-aware across all of them, not just English.

Other features

Voice cloning, voice design (text description to voice), PDF and EPUB to audio, voice conversion on existing files. Everything runs locally on Apple Silicon, no API calls, no usage limits beyond the free tier.

Pricing

Free tier: 5,000 chars/day, 10 Voice Designs, 3 Voice Clones Pro: $19.99 one-time, no subscription

App Store: https://apps.apple.com/in/app/openvox-local-voice-ai/id6758789314?mt=12
More Information: https://openvoxai.com/

Curious what this community thinks about the model choices and whether there are gaps you'd want to see filled.

u/ritzynitz — 5 days ago