Building an open-core Romanian morphological analysis API — looking for feedback
Romanian NLP tooling sits at roughly 15% of what exists for English. The academic resources exist (DEXonline, RoLEX, UD Romanian Treebank) but there's no production-ready REST API for morphological analysis, verb conjugation, or noun declension.
I'm building LexicRo to fill that gap. Pre-development stage, looking for honest feedback on the approach.
Planned endpoints:
POST /analyze— token-level morphological analysis (lemma, POS, case, gender, number, person, tense)GET /conjugate/{verb}— full conjugation table across all moods and tensesGET /inflect/{word}— all inflected forms of a noun or adjectiveGET /lookup/{word}— lexical data from DEXonlinePOST /difficulty— CEFR level scoring calibrated to Romanian B1/B2 exams
Technical approach:
- Fine-tuning
bert-base-romanian-cased-v1for morphological tagging - verbecc Romanian XML templates for conjugation (extended)
- Training data: UD Romanian Treebank + RoLEX + DEXonline dump
- FastAPI service, Docker, OpenAPI spec
Licence: MIT code, CC BY-NC model weights (free for research). Free tier: 1,000 req/day.
Phase 1 (conjugation + lexical lookup) ships in ~3 months. Morphological analyser follows in phase 2.
Questions I'm genuinely trying to answer:
- Is fine-tuning Romanian BERT on the UD treebank (~9k sentences) going to give reliable enough morphological tagging for production use, or do I need more data?
- Anyone worked with the RoLEX dataset — is the morphosyntactic annotation consistent enough to use as training data directly?
- Are there Romanian NLP resources I'm missing that would be worth incorporating?
Site: lexicro.com | GitHub: github.com/LexicRo
u/gofractal — 3 days ago