u/Different-Antelope-5

I am working on a small project called OMNIA.

The idea is simple:

OMNIA does not try to decide whether an answer is true.

It checks whether an output is structurally admissible.

Example:

- incomplete answer

- expression instead of final answer

- wrong output format

- partial answer

- instability under small variations

These are not the same thing as semantic correctness.

A well-formed wrong answer can still pass OMNIA.

That is intentional.

The current boundary is:

structural validity != semantic correctness

So the intended pipeline is:

LLM -> OMNIA -> semantic evaluator -> decision

I tested this through several small validation stages.

The most useful result so far is that V9 catches structural incompleteness and malformed outputs, while keeping the scope explicitly limited.

Repo:

https://github.com/Tuttotorna/OMNIA

DOI:

https://doi.org/10.5281/zenodo.19739481

This is still early work.

I am looking for criticism on the boundary, not hype.

I am working on OMNIA, a small structural measurement layer for model outputs.

This is very early work in progress.

The goal is not to claim that all LLMs fail on simple tasks, and this is not a benchmark.

For now I tested the gate on a small local model, google/flan-t5-base, using 16 controlled QA, reasoning, and RAG cases.

Raw model result:

6 / 16 correct

accuracy: 0.375

OMNIA Gate V7:

GO: 6

NO_GO: 10

Alignment with observed errors:

TP: 10

FN: 0

FP: 0

The point of this test is narrow:

when this model produced wrong or unreliable outputs, could a structural gate flag them without blocking the correct ones?

In this small run, yes.

That does not prove generality.

It only gives a minimal reproducible starting point.

The next step is to test stronger models, harder datasets, and controlled variations of the same question to measure output divergence.

Repo:

https://github.com/Tuttotorna/OMNIA

DOI:

https://doi.org/10.5281/zenodo.19725235

I am sharing this as work in progress and would welcome criticism, especially on how to make the validation harder and less toy-like.

I built a small structural gate for LLM outputs. It does not check truth.

Testing a structural gate for unreliable LLM outputs

A simple question: how much of mathematics is the object, and how much is just representation?

I built a system that checks if an AI answer actually holds — or just looks good once

We built a structural measurement layer that cut false accepts in half on a focused hollow-response benchmark

OMNIA: reducing false accepts on suspicious-clean LLM outputs under a layered review policy

OMNIA: a bounded structural review layer for suspicious-clean LLM outputs