I’ve seen a lot of discussion about local TTS—primarily for privacy and cost savings. With the advancements in open-source models, offline TTS with excellent sound quality is a reality.
I decided to build a native Apple Silicon audiobook generator that can turn text into a 10-hour audio file in a single run. Turning an open-source script into a production-ready Swift app took a lot of effort, and I’d love to share my experience, the technical hurdles I hit, and also get some feedback from this community.
Here are my main takeaways from building it:
- Choosing the Model (Kokoro TTS)
I chose Kokoro TTS (82M parameters) because of its sound quality in resource-constrained environments. While there are highly expressive models out there, they generally require GPUs. They can run on a CPU, but they are painfully slow. Apple Silicon GPUs are not as powerful as NVIDIA GPUs. I feel their capability is somewhere between a standard CPU and an NVIDIA GPU.
- Resource Management: Why I used ONNX instead of MLX
Even a lightweight model like Kokoro demands serious resources. After generating audio for 10-30 minutes, a MacBook’s fans will start to kick in.
I initially looked at MLX (Apple's latest ML framework), but I found it uses memory very aggressively. My goal was to make audiobook generation a background job—meaning you can do your regular work on your Mac while a 10-hour book generates in the background.
Instead of MLX, I opted for ONNX Runtime targeting Apple Silicon. I specifically limited the CPU and threading resources so the audio generates at a 7x to 12x real-time ratio while it only uses a small portion of the total CPU and memory. If you are busy using your Mac for heavy tasks, the background audio generation simply slows down to get out of your way.
- Robustness for Long-Running Tasks
Generating 10 hours of audio still takes about 1 to 2 hours of compute time. I needed a robust system to generate audio for such a long time. I built a queuing and checkpoint system that allows for pausing and resuming. You can literally close your laptop halfway through generating a book, open it the next day, and it will resume flawlessly.
- Quality Control & The "AAA" Problem
Open-source libraries are great help. But when you sit down and actually listen to the generated long-form audio, you notice very troublesome mistakes in pronunciation. I had to fix lots of bugs to make the code production-ready.
A major issue with TTS engines is acronyms. For example, the engine will read "AAA" as "Ay-Ay-Ay", which sounds ridiculous in an audiobook. To fix this, I built a custom pronunciation editor. You can tell the engine to read "AAA" as "Triple A". I also implemented multi-speaker support and filters for unwanted text.
- The Swift Struggle
I chose to write all the code in native Swift for performance, but the Swift ecosystem lacks the ML libraries that Python has. Python’s massive library ecosystem gives cloud TTS an edge. To get it working locally and natively, I actually manually converted some Python libraries into Swift.
- Why not an iOS App? (The Thermal Bottleneck)
My friends asked why I didn't build this for the iPhone. I actually started there, but I hit thermal and battery drain bottlenecks.
iPhones have three compute units: the CPU, GPU, and the Neural Engine (NPU). The Neural Engine runs "cold" and is incredibly battery efficient—I heard Apple's built-in iOS voices run exclusively on the NPU. But the NPU supports limited ML operations, and my guess is that this is why the built-in voice is a bit robotic.
Getting a model like Kokoro to run entirely on the NPU is probably not doable. Running Kokoro with MLX on iOS is possible. But I found MLX also uses the CPU. Running Kokoro in MLX turned out to be a mix of GPU and CPU. Mixing CPU/GPU computation seems to produce poorer performance.
Running it purely on the CPU generated heat and drained the battery. It is not a fit for a 10-hour generation task. This seems to match my previous experience on edge devices. The battery is the problem.
If we can push a lightweight model into the NPU, unlimited offline TTS might work on the iPhone.
If anyone wants to poke around and try it, the app is called Aura Reader. I put it on the Mac App Store and at www.gushilabs.com. There’s a free version available, so it’s easy to experiment with. This is my first time releasing an app, so I’m especially interested in whether this is actually useful in real workflows. I would love any honest feedback.