
what if it’s a procedural grammar instead of a natural language?
I’ve been playing with the idea that the Voynich manuscript might not be a natural language at all, but something closer to a procedural system — basically a compact grammar that generates instruction-like tokens instead of sentences with meaning in the usual sense.
This is not a claim that I’ve “decoded” anything. It’s just an attempt to model the structure and see if it behaves like a system that produces consistent outputs.
What pushed me in this direction is how regular the EVA text looks at the string level. There’s a lot of repetition, small variations between words, and patterns that look almost like prefixes and suffixes (qo…, …dy, …iin, …aiin). It doesn’t feel like normal language variation — it feels more like something rule-based.
So I tried to formalize that with a simple grammar. It’s basically a small set of building blocks (prefix-like elements, repeated vowel groups, and common endings), plus some connectors. When you run it over EVA, it actually covers a large part of the text structure — roughly most of the recurring patterns and a big chunk of the tokens.
Then, just as a way to test whether this structure can produce something coherent, I projected it into a procedural domain — in this case fermentation, mostly because it has very clear step-by-step processes.
For example, take a line like:
qokeedy qokedy chedy daiin
You can segment it pretty mechanically:
qo + k + ee + dyqo + k + e + dych + e + dyp + aiin(after normalizing d → p)
If you force it into something concrete (again, just as a test, not a claim), it looks like a simple process workflow. For instance, in a fermentation-like setup: prepare a base liquid, add something, repeat, introduce the main ingredient, then let it run for a longer phase.
After clustering the "words" in order to detect patterns, analysing syntax, and testing a lot of regular expressions, the language grammar looks like this:
S → W | W S
W → CORE | CORE C W
CORE → P BLOCK T | P BLOCK | BLOCK T | BLOCK
P → qo | q | o
BLOCK → UNIT | UNIT BLOCK
UNIT → M | B | G | V
M → k | t | p | f B → ch | sh G → cth | ckh | cph | cfh
V → e | ee | eee | i | ii | iii | a | ai | aiin
T → dy | iin | aiin | ε
C → l | r | n | s | m
So the working assumptions are:
- the text might encode processes rather than narrative
- repeated vowel groups could be acting as levels or modifiers
- the system might be domain-agnostic (fermentation is just a convenient test case)
- and possibly each page represents variations around a main element (suggested by the drawings)
When I tested the grammar against EVA text, it actually covered a large portion of what’s going on structurally. Roughly speaking, it can generate around 85–90% of the tokens and over 95% of the recurring patterns, even though it only captures about 65–80% of the distinct word forms. So it’s definitely not modeling the full vocabulary, but it does seem to capture most of the repetitive structure of the manuscript, which is the part that feels least like a natural language.
Using a fermetation domain to "decode" terminals:
Action / ingredient markers qo = liquid base / must / water q = general base marker o = mix / transfer / continuity k = sugars / fermentables t = heat / cooking p = yeast / fermentation start ch = main plant (always substituted with a safe edible plant) sh = secondary herb (safe edible) f = aroma modifier cth, ckh, cph, cfh = complex herbal compound (safe blend) Connectors (low semantic weight, used as transitions):
l, r, n, s, m State / time markers We treat the first vowel-run found in a word as an intensity/state cue:
e… = active extraction i… = cooling/rest a… = fermentation start/transition Run length encodes level:
e / i / a = level 1 ee / ii = level 2 eee / iii = level 3 Suffixes:
dy = interpret the vowel-run length as days iin = medium fermentation phase (heuristic) aiin = long fermentation/aging phase (heuristic)
So a recipe like this :
qokeedy qokedy chedy daiin
Translates:
"Start by preparing a liquid base and adding a fermentable component, letting it sit in an active phase for two days. Repeat the same step at a lower intensity for one additional day. Then introduce the main ingredient from the mixture and allow it to infuse for a day. Finally, begin the fermentation process and let it continue through a longer resting phase."
Treating each line as one recipe yields 5,385 distinct “recipes” across the manuscript in this dataset; 1,065 come out as coherent (internally consistent enough to map cleanly onto a plausible fermentation/drink-making workflow), while 4,320 are “partial” (they still parse, but the model lacks strong anchors—most often the page’s plant category is unknown/low-confidence—so timings/ingredients are more inferential).
I’m curious whether this aligns with other “non-language” ideas people have explored (automata, encodings, etc.), or if this just collapses into overfitting.
If anyone’s interested, I put a small repo together with the grammar and a toy interpreter:
https://github.com/jfrez/voynich-manuscript-translator
Happy to hear thoughts, especially on whether this kind of structural model makes sense or if I’m missing something obvious.