u/NoTransition8017

Vibration and Distortion in CosyVoice3 Fine Tuned Model

I fine-tuned Fun-CosyVoice3-0.5B, but after training, during inference I observe significant distortion, noise, and vibration in the generated audio.

To isolate the issue, I performed the following tests:

1. HiFiGAN-only test

  • Regenerated audio directly from an input audio chunk using HiFiGAN (no tokenizer or Flow)
  • Regenerated Output is exactly like the original clean audio
  • Suggests HiFiGAN is not the source of the issue

2. Full pipeline test (tokenizer → Flow → HiFiGAN)

  • Passed clean audio samples from my dataset through the full pipeline
  • Regenerated Output synthesis contains noticeable vibration and distortion, despite clean input

3. Base vs fine-tuned Flow

Tested with both:

  • Base Flow model
  • Fine-tuned Flow model
  • Both produce similar vibration artifacts

Additional observation:

  • A clicking/mouse-like sound appears at the start and end of generated audio

What I’ve tried:

  • Multiple audio normalization techniques (LUFS) before feeding data to the tokenizer
  • Also tried de-clipping
  • No improvement

I have been stuck with this for weeks now and i cannot figure out a way out. would be really helpful if someone with past experience working with cosyvoice could help out.

Questions:

  • Has anyone encountered similar vibration/distortion artifacts in the tokenizer → Flow → HiFiGAN pipeline?
  • Could this be related to tokenizer encoding/decoding mismatch or preprocessing?
  • Any suggestions on debugging?
reddit.com
u/NoTransition8017 — 3 days ago