![[Self Promotion] Feature Extracted Human and Synthetic Voice datasets - free research use, legally clean, no audio.](https://external-preview.redd.it/KfAfmJjSzb0DJxpMCfxMr9dcyT8EYq6uZMNnL6El1rw.png?width=1080&crop=smart&auto=webp&s=8cf10e29bc494f9857aff5c79e2566c7bb5ec17a)
[Self Promotion] Feature Extracted Human and Synthetic Voice datasets - free research use, legally clean, no audio.
tl;dr Feature extracted human and synthetic speech data sets free for research and non commercial use.
Hello,
I am building a pair of datasets, first the Human Speech Atlas has prosody and voice telemetry extracted from Mozilla Data Collective datasets, currently 90+ languages and 500k samples of normalized data. All PII scrubbed. Current plans to expand to 200+ languages.
Second the Synthetic Speech Atlas has synthetic voice feature extraction demonstrating a wide variety of vocoders, codecs, deep fake attack types etc. Passed 1 million samples a little while ago, should top 2 million by completion.
Data dictionary and methods up on Hugging Face.
https://huggingface.co/moonscape-software
First real foray into dataset construction so Id love some feedback.