▲ 3 r/speechtech
Vibration and Distortion in CosyVoice3 Fine Tuned Model
I fine-tuned Fun-CosyVoice3-0.5B, but after training, during inference I observe significant distortion, noise, and vibration in the generated audio.
To isolate the issue, I performed the following tests:
1. HiFiGAN-only test
- Regenerated audio directly from an input audio chunk using HiFiGAN (no tokenizer or Flow)
- Regenerated Output is exactly like the original clean audio
- Suggests HiFiGAN is not the source of the issue
2. Full pipeline test (tokenizer → Flow → HiFiGAN)
- Passed clean audio samples from my dataset through the full pipeline
- Regenerated Output synthesis contains noticeable vibration and distortion, despite clean input
3. Base vs fine-tuned Flow
Tested with both:
- Base Flow model
- Fine-tuned Flow model
- Both produce similar vibration artifacts
Additional observation:
- A clicking/mouse-like sound appears at the start and end of generated audio
What I’ve tried:
- Multiple audio normalization techniques (LUFS) before feeding data to the tokenizer
- Also tried de-clipping
- No improvement
I have been stuck with this for weeks now and i cannot figure out a way out. would be really helpful if someone with past experience working with cosyvoice could help out.
Questions:
- Has anyone encountered similar vibration/distortion artifacts in the tokenizer → Flow → HiFiGAN pipeline?
- Could this be related to tokenizer encoding/decoding mismatch or preprocessing?
- Any suggestions on debugging?
u/NoTransition8017 — 3 days ago