u/Equivalent-Gas2856 — reddlx

I’ve been experimenting with language-model-based compression for source code, using a simple n-gram model combined with arithmetic coding.

ill make a repo soo...

The setup is straightforward: tokenize Python source, estimate P(xt∣xt−n+1:t−1)P(x_t \mid x_{t-n+1:t-1})P(xt∣xt−n+1:t−1) using an order-4 n-gram model, and feed those probabilities into an arithmetic coder. The coder converts the predicted distribution into a bitstream, so compression performance is directly tied to how well the model estimates token probabilities.

I tested this on the Flask codebase (~575 KB of .py files):

Neural (n-gram) + arithmetic coding → 101 KB (0.176x, 82.4% reduction)
zlib / ZIP → 151 KB (0.263x)
lzma / 7z → 152 KB (0.265x)
zstd → 147 KB (0.256x)

So roughly ~33% better compression ratio than zlib.

This isn’t too surprising in hindsight: general-purpose compressors operate at the byte level, while even a small language model captures strong structure at the token level (e.g., syntactic patterns like def → identifier, return → expression, etc.). Arithmetic coding then translates that predictive structure into bits.

Architecture-wise, the model and tokenizer are implemented in Python, while the arithmetic coder is written in Zig and called via ctypes. The coder consumes a probability vector (f32[]) per token and maintains the range state / bitstream.

The main limitation is speed: encoding is currently ~1600× slower than zlib (~75s vs ~0.05s). The bottleneck is per-token probability computation in Python with no caching or batching.

This is obviously a very simple model (n-grams with small context), so it’s essentially a lower bound on what a stronger model could achieve. It would be interesting to see how far a small LSTM or transformer could push compression ratio vs. the added compute cost, especially if predictions are batched.

Curious if others have explored similar “LM + entropy coding” setups specifically for source code, and whether there are practical approaches to making them competitive on speed as well as ratio.

Been playing around with compression, but instead of treating code as raw bytes, I tried modeling it.

Idea is pretty simple: tokenize the Python source, use an n-gram model to predict the probability of the next token, and then feed those probabilities into an arithmetic coder. The more predictable the token, the fewer bits it costs.

Ran it on the Flask repo (~575 KB of .py files):

https://preview.redd.it/shyzoszjewyg1.png?width=1919&format=png&auto=webp&s=5df5959c35046ed615da33a134372918c4071541

Neural + AC → 101 KB (82.4% reduction)
zlib / ZIP → 151 KB (73.7%)
lzma / 7z → 152 KB (73.5%)
zstd → 147 KB (74.4%)

So yeah, about ~33% better compression ratio than zlib.

Nothing magical going on Python code just has a lot of structure at the token level, and n-grams pick up enough of it to make a difference. Arithmetic coding just turns that into actual bits.

The setup is split pretty cleanly: tokenizer + model in Python, and the arithmetic coder is written in Zig (compiled to a shared library) and called via ctypes. Python handles probability generation, Zig handles the actual encoding and bitstream.

The obvious downside: it’s slow. Like really slow. ~75 seconds vs ~0.05s for zlib on the same data (~1600× slower). Most of that is just calling the model once per token with no caching.

Still, kind of interesting to see that even a basic n-gram model can beat general-purpose compressors just by not treating code like noise.

Feels like there’s something here if the prediction side gets better (or faster). Curious if anyone else has tried something similar.