u/gofractal

Romanian NLP tooling sits at roughly 15% of what exists for English. The academic resources exist (DEXonline, RoLEX, UD Romanian Treebank) but there's no production-ready REST API for morphological analysis, verb conjugation, or noun declension.

I'm building LexicRo to fill that gap. Pre-development stage, looking for honest feedback on the approach.

Planned endpoints:

POST /analyze — token-level morphological analysis (lemma, POS, case, gender, number, person, tense)
GET /conjugate/{verb} — full conjugation table across all moods and tenses
GET /inflect/{word} — all inflected forms of a noun or adjective
GET /lookup/{word} — lexical data from DEXonline
POST /difficulty — CEFR level scoring calibrated to Romanian B1/B2 exams

Technical approach:

Fine-tuning bert-base-romanian-cased-v1 for morphological tagging
verbecc Romanian XML templates for conjugation (extended)
Training data: UD Romanian Treebank + RoLEX + DEXonline dump
FastAPI service, Docker, OpenAPI spec

Licence: MIT code, CC BY-NC model weights (free for research). Free tier: 1,000 req/day.

Phase 1 (conjugation + lexical lookup) ships in ~3 months. Morphological analyser follows in phase 2.

Questions I'm genuinely trying to answer:

Is fine-tuning Romanian BERT on the UD treebank (~9k sentences) going to give reliable enough morphological tagging for production use, or do I need more data?
Anyone worked with the RoLEX dataset — is the morphosyntactic annotation consistent enough to use as training data directly?
Are there Romanian NLP resources I'm missing that would be worth incorporating?

Site: lexicro.com | GitHub: github.com/LexicRo

Building an open-core Romanian morphological analysis API — looking for feedback