u/gofractal

Building an open-core Romanian morphological analysis API — looking for feedback

Romanian NLP tooling sits at roughly 15% of what exists for English. The academic resources exist (DEXonline, RoLEX, UD Romanian Treebank) but there's no production-ready REST API for morphological analysis, verb conjugation, or noun declension.

I'm building LexicRo to fill that gap. Pre-development stage, looking for honest feedback on the approach.

Planned endpoints:

  • POST /analyze — token-level morphological analysis (lemma, POS, case, gender, number, person, tense)
  • GET /conjugate/{verb} — full conjugation table across all moods and tenses
  • GET /inflect/{word} — all inflected forms of a noun or adjective
  • GET /lookup/{word} — lexical data from DEXonline
  • POST /difficulty — CEFR level scoring calibrated to Romanian B1/B2 exams

Technical approach:

  • Fine-tuning bert-base-romanian-cased-v1 for morphological tagging
  • verbecc Romanian XML templates for conjugation (extended)
  • Training data: UD Romanian Treebank + RoLEX + DEXonline dump
  • FastAPI service, Docker, OpenAPI spec

Licence: MIT code, CC BY-NC model weights (free for research). Free tier: 1,000 req/day.

Phase 1 (conjugation + lexical lookup) ships in ~3 months. Morphological analyser follows in phase 2.

Questions I'm genuinely trying to answer:

  1. Is fine-tuning Romanian BERT on the UD treebank (~9k sentences) going to give reliable enough morphological tagging for production use, or do I need more data?
  2. Anyone worked with the RoLEX dataset — is the morphosyntactic annotation consistent enough to use as training data directly?
  3. Are there Romanian NLP resources I'm missing that would be worth incorporating?

Site: lexicro.com | GitHub: github.com/LexicRo

reddit.com
u/gofractal — 3 days ago