u/SweatyCheetah6825 — reddlx

Thanks for the invite to post here!

We're curating the most linguistically diverse collection of datasets in the world with communities, and I thought I'd share a few of the latest:

Well known ones first, Common Voice - latest release, 25.0 has massive speech corpora for Spanish (48GB!), Kinyarwanda (57GB, bigger than Spanish which is so interesting), German, French, Bengali, Esperanto, Belarusian, Chinese, Swahili... like if you're doing ASR work you really have no excuse not to be using these. All CC0 licensed too so can be used for anything (ethical) you can imagine.

https://datacollective.mozillafoundation.org/datasets

But less well know is the INEL stuff from the University of Hamburg, which is doing genuinely important work. They've got supervised speech-to-text datasets for languages like:

Nganasan (38.5 hours!! for an endangered Samoyedic language spoken by like a few hundred people)
Dolgan — endangered Turkic language, 13 hours of data
Kamas — this one hit me hard, it's listed as an extinct language that they're hoping to revitalize. Someone recorded 14 hours of audio for a language with no living native speakers.
Evenki, Selkup, Enets, Nenets too

The effort that went into preserving these is something else.

https://datacollective.mozillafoundation.org/datasets

Other cool stuff:

Bamun-French parallel corpus (4,444 lines, useful for MT work on an African language that doesn't get nearly enough attention)
English-Hausa parallel corpus — 5k sentence pairs, great for MT
A Persian literary corpus of 1.26 MILLION tokens spanning poetry and literature
Afaan Oromoo word-level speech data for TTS work
A Catalan offensive language dataset
Even a corpus in Aranese, which is a variety of Occitan spoken in the Pyrenees. Again, CC0 licensed.

Basically if you're working on low-resource languages, doing academic NLP, or just want to contribute to something that actually matters for language preservation — go explore what we're doing together. Anyone here already been working with any of these? Curious what people have actually built with the lower-resource ones especially!

https://datacollective.mozillafoundation.org/datasets