u/firojalam

Reproducibility in text evaluation is becoming a challenging issue. If you've used LLMs or similar models as automated judges for summarization, translation, or QA, you've likely noticed that change the prompt slightly and the scores shift, run it across non-English languages and quality drops, try to replicate someone else's setup and you get different numbers. It's convenient, but difficult to reproduce .

The question we kept coming back to: do you actually need a frontier LLM to evaluate generated text well, or is that just the path of least resistance?

We trained a family of small deterministic models (<1B parameters) called OmniScore that approximate LLM-judge behavior without the reproducibility headaches.

A few things that might be interesting to learn:

Trained on ~564k synthetic instances across 107 languages — most evaluation work is still very English-heavy, which is a real gap
Evaluated on 8,617 manually annotated examples across QA, translation, and summarization in 6 languages
Supports reference-based, source-grounded, and hybrid scoring modes
Deterministic by design — same input, same score, every time

The gap we're trying to fill sits between two unsatisfying options: frontier LLM judges (flexible but expensive and inconsistent) and traditional metrics like BLEU/ROUGE (cheap but limited to capture semantics). Our results suggest lightweight learned metrics can close much of that gap.

LLM-as-a-Judge is convenient, but reproducibility is a real issue — what are the alternatives?