Four months in on my local TTS Mac app what I got wrong about what people actually wanted
Shipped Murmur here around New Year's a local text-to-speech Mac app that runs fully on-device on Apple Silicon. Been four months. Wanted to share what I got wrong about what users actually wanted, because it's different from what I built for.
What I thought the app was for:
I built it because I wanted to listen to long articles and drafts while doing other things. The core job in my head was "paste text, get audio, listen while walking." Privacy was a nice-to-have. Voice quality was the product. I shipped with one model (Kokoro), three or four voices, and called it done.
What people actually bought it for:
The first surprise was audiobook production. Not casual listening actual indie authors converting full novels into audiobooks. One user sent me a 90,000-word manuscript and asked how to batch the whole thing in one go. That wasn't a feature, it was a hack at the time. I built proper batch queue processing because of that email.
Second surprise was voice cloning. I almost didn't ship it. It felt like a gimmicky demo feature next to the "serious" TTS models. It's now probably the single most-used feature by paying users. Podcasters cloning their own voice for intro/outro consistency. Course creators cloning their voice for lesson narration so they don't have to re-record when they edit scripts. YouTube creators cloning their voice for sponsor reads. I completely underestimated how much people wanted this.
Third surprise was languages. I launched in English and half-heartedly added Spanish in week two. The Spanish launch post outperformed the English one. People buying for German, Japanese, Korean, Portuguese languages where ElevenLabs pricing gets especially brutal because the voice selection is smaller and per-character costs are the same. I now support 25+ languages across six different models because every language request was a real customer.
Fourth surprise was batch processing specifically. The original app was one-script-at-a-time. A course creator told me she was manually re-running the app 40 times a week for her lessons. That changed my priorities for a month. Now you drop a folder of scripts or an ePub and come back to the finished audio.
What I got right:
The no-subscription decision. I thought long and hard about it and came close to launching with a $10/mo tier. I'm glad I didn't. Every buyer email that mentions price says some version of "thank you for not making it a subscription." It's not just a pricing preference it filters for a different kind of user who actually uses the product and sticks around.
The fully-local architecture. Some people do buy for privacy. But the real win of fully-local wasn't privacy it was unlimited use. When generation doesn't cost me anything on the backend, it doesn't cost the user anything either, which means they can actually do things like process a whole novel or a 300-lesson course without thinking about caps.
What's in the app now vs. launch:
- 6 TTS engines (Kokoro, Fish Speech S2 Pro, Qwen3-TTS, and others), picked per content type
- 860+ community voices plus 10-second voice cloning
- 25+ languages
- Batch processing for folders, ePubs, and queued scripts
- Expression and emotion control on supported models ([whisper], [excited], [chuckling] style tags work)
- Still one-time purchase ($49), still fully offline, still no telemetry
Requirements haven't changed: macOS 14+, Apple Silicon.
Happy to answer anything about what the last four months of shipping updates has looked like, or which of the six models is right for which use case. Also curious what people here have shipped and what the biggest "I built it for X but users wanted Y" surprise has been those stories are my favorite part of following indie Mac dev.