u/NoTransition8017

I fine-tuned Fun-CosyVoice3-0.5B, but after training, during inference I observe significant distortion, noise, and vibration in the generated audio.

To isolate the issue, I performed the following tests:

1. HiFiGAN-only test

Regenerated audio directly from an input audio chunk using HiFiGAN (no tokenizer or Flow)
Regenerated Output is exactly like the original clean audio
Suggests HiFiGAN is not the source of the issue

2. Full pipeline test (tokenizer → Flow → HiFiGAN)

Passed clean audio samples from my dataset through the full pipeline
Regenerated Output synthesis contains noticeable vibration and distortion, despite clean input

3. Base vs fine-tuned Flow

Tested with both:

Base Flow model
Fine-tuned Flow model
Both produce similar vibration artifacts

Additional observation:

A clicking/mouse-like sound appears at the start and end of generated audio

What I’ve tried:

Multiple audio normalization techniques (LUFS) before feeding data to the tokenizer
Also tried de-clipping
No improvement

I have been stuck with this for weeks now and i cannot figure out a way out. would be really helpful if someone with past experience working with cosyvoice could help out.

Questions:

Has anyone encountered similar vibration/distortion artifacts in the tokenizer → Flow → HiFiGAN pipeline?
Could this be related to tokenizer encoding/decoding mismatch or preprocessing?
Any suggestions on debugging?

Vibration and Distortion in CosyVoice3 Fine Tuned Model