r/speechtech

Vibration and Distortion in CosyVoice3 Fine Tuned Model

I fine-tuned Fun-CosyVoice3-0.5B, but after training, during inference I observe significant distortion, noise, and vibration in the generated audio.

To isolate the issue, I performed the following tests:

1. HiFiGAN-only test

  • Regenerated audio directly from an input audio chunk using HiFiGAN (no tokenizer or Flow)
  • Regenerated Output is exactly like the original clean audio
  • Suggests HiFiGAN is not the source of the issue

2. Full pipeline test (tokenizer → Flow → HiFiGAN)

  • Passed clean audio samples from my dataset through the full pipeline
  • Regenerated Output synthesis contains noticeable vibration and distortion, despite clean input

3. Base vs fine-tuned Flow

Tested with both:

  • Base Flow model
  • Fine-tuned Flow model
  • Both produce similar vibration artifacts

Additional observation:

  • A clicking/mouse-like sound appears at the start and end of generated audio

What I’ve tried:

  • Multiple audio normalization techniques (LUFS) before feeding data to the tokenizer
  • Also tried de-clipping
  • No improvement

I have been stuck with this for weeks now and i cannot figure out a way out. would be really helpful if someone with past experience working with cosyvoice could help out.

Questions:

  • Has anyone encountered similar vibration/distortion artifacts in the tokenizer → Flow → HiFiGAN pipeline?
  • Could this be related to tokenizer encoding/decoding mismatch or preprocessing?
  • Any suggestions on debugging?
reddit.com
u/NoTransition8017 — 3 days ago

Looking for help for a specific use case of speaker diarization between two individuals in a noisy atmosphere. Have tried Seeed Studio microphone and rasberry pi but audio isn't clear enough. Need help.

I have been trying to capture voices in a noisy atmosphere with a Seeed Studio eSpeaker XVF3800 and a rasberry pi. But I can't get the audio clear enough to do the speaker diarization in a high enough level to accomplish what I need. Looking for someone to help me solve this problem. I think I need a sound engineer and someone who also knows how to leverage AI to help enhance the captured audio to do this at scale. Anyone interested or know someone who might be able to help?

reddit.com
u/FitStatistician2661 — 3 days ago
▲ 21 r/speechtech+5 crossposts

I was recently trying to transcribe an interview for my dad and he was very cautious about uploading anything to a cloud service which made sense. When I looked for local options everything required complex self-hosted setups that would have taken an hour to configure.So instead of doing the 1hr set up, i spent the next 4 to make it an in-brower, zero setup tool anyone can use to locally transcribe audio . Your audio never leaves your device, you can even turn off your wifi to prove it (after the models loads in ofc). Give it a try and let me know what you think, would love feedback from this community especially.

u/Gizmo_4Life — 6 days ago
▲ 6 r/speechtech+1 crossposts

best voice api

hello im buildign a app via vibe coding and it really needs audio in and audio out for the ai questions and answers.what is peoples experiances of the best way of achieving a ultra clear audio in and audio out answer #audioai #vibecoding #ai #helpneeded

reddit.com
u/ofah1974 — 5 days ago

Best APIs for speech to text?

Hi colleagues, I have a SaaS that transcribes 10 million minutes of audio per month, and I've tried many different processing methods. Currently, I'm using orchardrun.com because it offers the best performance and price (0.025 per hour) and allows me to handle fairly large audio files. But do you know of any other, more economical options?

reddit.com
u/SmoothConnection1670 — 6 days ago

Building a Voice Assistant for Medication Reminders — Wake Word Detection Was Harder Than Expected

We’ve been building a voice-first medication assistant at https://www.wiserx.health/, where patients can talk to the voice assistant with experience focused on helping patients manage medications at home without apps or caregivers.

One of the hardest parts for us was wake word detection. We tested a few public/open solutions, but accuracy in real-world home environments wasn’t great, especially with elderly users, background TV noise, accents, etc. We also looked at Picovoice, but it was pretty expensive for our stage as a startup.

We ended up working with https://davoice.io/ for custom wake word models and speaker identification, and honestly it’s been solid so far. Detection accuracy has been much better for our use case and we’ve seen way fewer false positives compared to what we tested earlier. Importantly we were trying to optimize the CPU usage and team at DaVoice helped us tweak the model and gave us an efficient one. They also offer other functionalities other than wake word which is speaker identification and isolation.

Curious what others here are using for wake word detection on embedded/edge devices and how you’re handling noisy environments.

u/FinishHot5984 — 6 days ago

Anyone using speech-to-text for Indian languages in production? What's actually working and what's not?

Marketing pages claim 90%+ accuracy on Hinglish. Reality from the teams I've talked to looks very different.

If you're using or have evaluated Indian-language STT for any use-case - voicebots, call analytics, video KYC, transcription, voice search, etc. would love to hear what you picked, why, and where it falls short.

Happy to share my learnings. Drop a comment or DM for a 30 min chat.

reddit.com
u/Spare-Ad2520 — 6 days ago

Need help with Faster-Whisper Transcription

Using Large V3 model but facing issue in transcribing Srilankan language Sinhala. Did anyone try to transcribe this language and get a good result?

reddit.com
u/THOThunterforever — 8 days ago