u/tarunyadav9761

My full faceless YouTube pipeline, $0 in monthly subscriptions. Including how I handle music (the part that was killing my margins).
▲ 3 r/aitubers+2 crossposts

My full faceless YouTube pipeline, $0 in monthly subscriptions. Including how I handle music (the part that was killing my margins).

Been running a faceless YouTube workflow for about 6 months and finally got my monthly tool spend to zero. Not because I cheaped out, because I moved every recurring cost onto local tools that run on my Mac. Sharing the actual pipeline because the "here is my AI workflow" posts I read a year ago saved me a lot of learning time, and I want to pay that forward.

Context on the channel. Mid-five-figure subscriber faceless channel, explainer-style videos in the 8-15 minute range. AI scripting, AI voiceover, AI visuals, heavy editing. The kind of thing that eats through tool subscriptions if you are not careful.

The pipeline, step by step, with tools and costs:

  1. Script generation. Claude and ChatGPT, alternating based on what the video needs. Paying for both at Pro tier. This is the one cost I still pay because it is where the actual quality ceiling is.
  2. Script polish and fact-check. Manual, no tool. The AI first drafts are never publishable without heavy editing. This takes me 2 to 4 hours per script.
  3. Voiceover generation. I used to pay ElevenLabs at $99 a month for the Creator tier. At 8-12 videos a month with long scripts, I was blowing through character limits constantly and upgrading.

I moved this to Murmur, a local TTS app that runs on Apple Silicon. Fully on-device, no monthly cost, no per-character pricing. Voice quality is behind ElevenLabs v3 for character voice work but for faceless channel narration with a consistent single voice, it is more than good enough. Saved me about $1200 a year. Disclosure that I also built Murmur, which is how I ended up going down this local-first path in the first place.

  1. Visual generation. Mix of stock (Pexels, free), Midjourney for specific scenes, and Runway for motion. Midjourney at $30 a month is the one visual cost I cannot replace with anything local yet.

  2. B-roll. AI-generated plus stock plus screen recordings. No subscription, all free or one-time.

  3. Background music. This was the part that was killing my margins and where the pipeline got most interesting.

The music problem specifically. A 10 minute video needs maybe 3-6 different music cues. Intro, main sections, transitions, outro, occasional dramatic moments. For a while I was on Epidemic Sound at $15 a month. It was fine but the tracks were recognizable across channels in my niche and I kept running into the same cuts other creators were using.

Tried Suno. Great quality but at 10 tracks per video times 10 videos per month, I was burning through credits in 2 weeks. Their pricing does not fit a high-volume background music workflow.

What I moved to: LoopMaker, another local app that generates music on Mac. One-time $49 purchase, unlimited generations, fully offline. Built on ACE-Step 1.5 which is an open-source music model that benchmarks between Suno v4.5 and v5 in quality. I generate 3 to 5 variations of each cue I need, pick the best one, drop it in the edit. Done.

Also my app, same disclosure applies. I built both Murmur and LoopMaker because the subscription economics for AI tools stop making sense past a certain volume and I wanted tools where the unit economics were different.

What LoopMaker handles well for video work:

  • Cinematic and dramatic backgrounds for explainer content
  • Lo-fi and chill beds under voiceover
  • Upbeat electronic for intros and outros
  • Ambient texture for mood transitions
  • Genre-matched tracks for themed videos (retro synthwave for 80s content, orchestral for history content, etc.)

Where I still use other tools for music:

  • If a video is specifically about a song or genre, I still use Suno because its vocal quality on polished tracks is higher
  • For the rare video that needs something specific I cannot prompt well, Epidemic Sound has a one-off per-track pricing that I use maybe once a month
  1. Editing. Final Cut Pro, one-time purchase. Not touching Premiere's subscription.
  2. Thumbnails. Photoshop plus Midjourney (already paid), manual arrangement.
  3. Upload and scheduling. YouTube Studio, free.

Total monthly tool cost for this pipeline after the switch: Claude Pro, ChatGPT Plus, Midjourney Standard. Roughly $80 a month total. Used to be over $250 a month before I consolidated.

The one-time purchases: Final Cut Pro, Murmur, LoopMaker. Roughly $400 total spread across a year of buying them. Paid back in a few months from saved subscription costs.

The honest caveat. Moving to local tools trades monthly cost for upfront effort. You need an Apple Silicon Mac (M1 or newer). The learning curve is real, and some workflows are less polished than paid cloud tools. If you are making your first 10 videos and figuring things out, the subscriptions are worth it for the lower friction. At volume, the math flips.

Links for the tools I mentioned that were not obvious:

Murmur: https://www.murmurtts.com LoopMaker: https://tarun-yadav.com/loopmaker

Happy to go deeper on any specific part of the pipeline. Also curious what others here are doing to keep costs sane at volume, and what the current state of local AI visual gen looks like. That is the one piece I have not been able to move off cloud yet.

u/tarunyadav9761 — 3 hours ago
▲ 7 r/audiobooks+2 crossposts

I finally got off Google Cloud TTS and other hosted speech services. Here is my fully local setup 4 months in.

About 4 months ago I fully moved off hosted speech services. Google Cloud TTS was the last piece I cut, but I had also been bouncing between ElevenLabs and Speechify at various points. Wanted to share the setup I landed on because nothing I read before doing this covered the practical parts well.

What I was using before and why it bothered me:

  • Google Cloud TTS for long form audio conversion. Fine voices, but every article, PDF, and personal note I converted was being sent to Google. My own writing. My notes. Client documents. All of it.
  • ElevenLabs for higher quality narration work. Better voices but same privacy problem, plus subscription pressure and per character pricing that made me hesitate before generating.
  • Apple's built in Speech for short stuff. Robotic, no one wants to listen to that for more than 30 seconds.

The gap was always the same. I wanted natural sounding voices for hours of content, without any of it leaving my machine. Until about a year ago that combination did not really exist. Open source TTS models were either robotic or required a research level Python setup. Now that has changed.

The setup I use now:

Everything runs locally on an M2 MacBook Air via Apple's MLX framework. The models I actually use day to day are Kokoro, Fish Speech S2 Pro, and Qwen3-TTS. Each does something different well. Kokoro is fast and clean for long form narration. Fish Speech handles expressive delivery with emotion tags (whisper, excited, laughing). Qwen3-TTS is the multilingual one, 25+ languages at quality that genuinely surprised me.

Nothing goes to any server. No telemetry, no accounts, no internet required after the initial model download. I verified this with Little Snitch in the first few weeks just to be sure.

What I actually use this for:

  • Long form article listening. I dump articles from RSS and read-later apps, generate audio, listen while walking or doing chores. Zero cloud exposure.
  • PDF conversion for research papers and technical docs. Same thing but for longer content.
  • Sensitive document review. This is the one where local matters most. Legal docs, client NDAs, personal financial records. I can listen to them without any copy ever existing outside my machine.
  • Personal journal and note review. I write long notes in Obsidian and sometimes listen to them as a way of re-reading. These have never touched anyone else's infrastructure.
  • EPUB to audiobook conversion for books not on Audible.

The honest tradeoffs:

Local TTS is still behind the best cloud options (ElevenLabs v3) on character voices and emotional range. For anything where voice acting is the product, cloud wins. For the 80 percent use case of "I want natural narration of this text," local is now more than good enough.

Battery life on the MacBook Air when generating is not great. An hour of continuous generation drops about 15 to 20 percent. Less of an issue on plugged in desktop Macs.

Storage. The models are 1 to 3 GB each. If you run multiple you are looking at 5 to 10 GB of disk, one time.

If anyone wants the direct path without setting this up yourself, the app I ended up building for my own use is Murmur at murmurtts.com. One time purchase, the privacy design is the whole point of it. But the models themselves are all open source and you can run them directly from their repos if you prefer a DIY setup. I would genuinely recommend going direct if you are comfortable with Python, the whole point of this sub is self sovereignty over tools.

Curious what others here are using for speech synthesis. Is anyone running Piper, Coqui, Tortoise, or XTTS in production? I am especially interested in Linux setups since I have only done this on Mac.

u/tarunyadav9761 — 19 hours ago
▲ 41 r/TextToSpeech+3 crossposts

Four months in on my local TTS Mac app what I got wrong about what people actually wanted

Shipped Murmur here around New Year's a local text-to-speech Mac app that runs fully on-device on Apple Silicon. Been four months. Wanted to share what I got wrong about what users actually wanted, because it's different from what I built for.

What I thought the app was for:

I built it because I wanted to listen to long articles and drafts while doing other things. The core job in my head was "paste text, get audio, listen while walking." Privacy was a nice-to-have. Voice quality was the product. I shipped with one model (Kokoro), three or four voices, and called it done.

What people actually bought it for:

The first surprise was audiobook production. Not casual listening actual indie authors converting full novels into audiobooks. One user sent me a 90,000-word manuscript and asked how to batch the whole thing in one go. That wasn't a feature, it was a hack at the time. I built proper batch queue processing because of that email.

Second surprise was voice cloning. I almost didn't ship it. It felt like a gimmicky demo feature next to the "serious" TTS models. It's now probably the single most-used feature by paying users. Podcasters cloning their own voice for intro/outro consistency. Course creators cloning their voice for lesson narration so they don't have to re-record when they edit scripts. YouTube creators cloning their voice for sponsor reads. I completely underestimated how much people wanted this.

Third surprise was languages. I launched in English and half-heartedly added Spanish in week two. The Spanish launch post outperformed the English one. People buying for German, Japanese, Korean, Portuguese languages where ElevenLabs pricing gets especially brutal because the voice selection is smaller and per-character costs are the same. I now support 25+ languages across six different models because every language request was a real customer.

Fourth surprise was batch processing specifically. The original app was one-script-at-a-time. A course creator told me she was manually re-running the app 40 times a week for her lessons. That changed my priorities for a month. Now you drop a folder of scripts or an ePub and come back to the finished audio.

What I got right:

The no-subscription decision. I thought long and hard about it and came close to launching with a $10/mo tier. I'm glad I didn't. Every buyer email that mentions price says some version of "thank you for not making it a subscription." It's not just a pricing preference it filters for a different kind of user who actually uses the product and sticks around.

The fully-local architecture. Some people do buy for privacy. But the real win of fully-local wasn't privacy it was unlimited use. When generation doesn't cost me anything on the backend, it doesn't cost the user anything either, which means they can actually do things like process a whole novel or a 300-lesson course without thinking about caps.

What's in the app now vs. launch:

  • 6 TTS engines (Kokoro, Fish Speech S2 Pro, Qwen3-TTS, and others), picked per content type
  • 860+ community voices plus 10-second voice cloning
  • 25+ languages
  • Batch processing for folders, ePubs, and queued scripts
  • Expression and emotion control on supported models ([whisper], [excited], [chuckling] style tags work)
  • Still one-time purchase ($49), still fully offline, still no telemetry

Requirements haven't changed: macOS 14+, Apple Silicon.

https://www.murmurtts.com

Happy to answer anything about what the last four months of shipping updates has looked like, or which of the six models is right for which use case. Also curious what people here have shipped and what the biggest "I built it for X but users wanted Y" surprise has been those stories are my favorite part of following indie Mac dev.

u/tarunyadav9761 — 19 hours ago
▲ 6 r/aicuriosity+2 crossposts

Open-source AI music generation just hit commercial quality and it runs on a MacBook Air. Here's what that actually means.

Something wild happened in the AI music space that I don't think got enough attention here.

A model called ACE-Step 1.5 dropped in January open-source, MIT licensed, and it benchmarks above most commercial music AI on SongEval. We're talking quality between Suno v4.5 and Suno v5. It generates full songs with vocals, instrumentals, and lyrics in 50+ languages. And it needs less than 4GB of VRAM.

Let that sink in. The open-source music model now beats most of the paid ones.

Why this matters (the Stable Diffusion parallel):

Remember when image generation was locked behind DALL-E and Midjourney? Then Stable Diffusion came out open-source and suddenly anyone could generate images locally. It completely changed the landscape.

ACE-Step 1.5 is that moment for music. The model quality is there. The licensing is there (MIT + trained on licensed/royalty-free data). The hardware requirements are reasonable.

What I did with it:

I wrapped ACE-Step 1.5 into a native Mac app called LoopMaker. You type a prompt like "cinematic orchestral, 90 BPM, D minor" or "lo-fi chill beats with vinyl crackle" and it generates the full track locally on your Mac.

No Python setup. No terminal. No Gradio. Just a .app you open and use.

It runs through Apple's MLX framework on Apple Silicon even works on a MacBook Air with no fan. Everything stays on your machine. No cloud, no API calls, no credits.

How ACE-Step 1.5 works under the hood (simplified):

The architecture is a two-stage system:

  1. Language Model (the planner) takes your text prompt and uses Chain-of-Thought reasoning to create a full song blueprint: tempo, key, structure, arrangement, lyrics, style descriptors. It basically turns "make me a chill beat" into a detailed production plan
  2. Diffusion Transformer (the renderer) takes that blueprint and synthesizes the actual audio. Similar concept to how Stable Diffusion generates images from latent space, but for audio

This separation is clever because the LM handles all the "understanding what you want" complexity, and the DiT focuses purely on making it sound good. Neither has to compromise for the other.

What blew my mind:

  • It handles genre shifts within a single track
  • Vocals in multiple languages actually sound natural, not machine-translated
  • 1000+ instruments and styles with fine-grained timbre control
  • You can train a LoRA from just a few songs to capture a specific style (not in my app yet, but the model supports it)

Where it still falls short:

  • Output quality varies with random seeds it's "gacha-style" like early SD was
  • Some genres (especially Chinese rap) underperform
  • Vocal synthesis quality is good but not ElevenLabs-tier
  • Fine-grained musical parameter control is still coarse

The bigger picture:

We're watching the same open-source pattern play out across every AI modality:

  • Text: GPT locked behind API → LLaMA/Mistral run locally
  • Images: DALL-E/Midjourney → Stable Diffusion/Flux locally
  • Code: Copilot → DeepSeek/Codestral locally
  • Music: Suno/Udio → ACE-Step 1.5 locally ← we are here

Every time it happens, the same thing follows: someone wraps the model into a usable app, and suddenly millions of people who'd never touch a terminal can use it. That's what LoopMaker is trying to be.

🔗 tarun-yadav.com/loopmaker

u/tarunyadav9761 — 3 days ago