
u/blueroses200

"This dissertation serves as the first full comprehensive grammar of the Phrygian language, which was spoken in central Anatolia from the beginning of the 1st millennium BCE to the middle of the 1st millennium CE and is attested in a total of about 500 inscriptions. The language as attested is divided into two stages; Old Phrygian, which was written in a native alphabet, spans from the earliest Phrygian inscriptions to about 300 BCE, whereas New Phrygian, which was written in the Greek alphabet, encompasses about 120 inscriptions from the beginning of the first millennium CE. Previous scholarship has for the most part focused on interpreting Phrygian inscriptions, the lexicon of the language, or tackled individual issues of grammar; this work aims to produce a full synchronic and diachronic grammar of the language, focusing prominently on the dialectal position of Phrygian within the Indo-European group of languages."
Thanks for the invite to post here!
We're curating the most linguistically diverse collection of datasets in the world with communities, and I thought I'd share a few of the latest:
Well known ones first, Common Voice - latest release, 25.0 has massive speech corpora for Spanish (48GB!), Kinyarwanda (57GB, bigger than Spanish which is so interesting), German, French, Bengali, Esperanto, Belarusian, Chinese, Swahili... like if you're doing ASR work you really have no excuse not to be using these. All CC0 licensed too so can be used for anything (ethical) you can imagine.
https://datacollective.mozillafoundation.org/datasets
But less well know is the INEL stuff from the University of Hamburg, which is doing genuinely important work. They've got supervised speech-to-text datasets for languages like:
- Nganasan (38.5 hours!! for an endangered Samoyedic language spoken by like a few hundred people)
- Dolgan — endangered Turkic language, 13 hours of data
- Kamas — this one hit me hard, it's listed as an extinct language that they're hoping to revitalize. Someone recorded 14 hours of audio for a language with no living native speakers.
- Evenki, Selkup, Enets, Nenets too
The effort that went into preserving these is something else.
https://datacollective.mozillafoundation.org/datasets
Other cool stuff:
- Bamun-French parallel corpus (4,444 lines, useful for MT work on an African language that doesn't get nearly enough attention)
- English-Hausa parallel corpus — 5k sentence pairs, great for MT
- A Persian literary corpus of 1.26 MILLION tokens spanning poetry and literature
- Afaan Oromoo word-level speech data for TTS work
- A Catalan offensive language dataset
- Even a corpus in Aranese, which is a variety of Occitan spoken in the Pyrenees. Again, CC0 licensed.
Basically if you're working on low-resource languages, doing academic NLP, or just want to contribute to something that actually matters for language preservation — go explore what we're doing together. Anyone here already been working with any of these? Curious what people have actually built with the lower-resource ones especially!
This could be useful for revivalists who don’t know where to start.
If you know of more resources, please share them here on the sub!