u/Wooden_Leek_7258

tl;dr Feature extracted human and synthetic speech data sets free for research and non commercial use.

Hello,

I am building a pair of datasets, first the Human Speech Atlas has prosody and voice telemetry extracted from Mozilla Data Collective datasets, currently 90+ languages and 500k samples of normalized data. All PII scrubbed. Current plans to expand to 200+ languages.

Second the Synthetic Speech Atlas has synthetic voice feature extraction demonstrating a wide variety of vocoders, codecs, deep fake attack types etc. Passed 1 million samples a little while ago, should top 2 million by completion.

Data dictionary and methods up on Hugging Face.

https://huggingface.co/moonscape-software

First real foray into dataset construction so Id love some feedback.

[Self Promotion] Feature Extracted Human and Synthetic Voice datasets - free research use, legally clean, no audio.