r/compression

Basically, the database is full of all the tons of the most common phrases paired with a unique ID. On average it seems like I can compress my message to half the size. I wasn't really aiming to do this. I was just trying to make a code book and this was a byproduct and I thought It might be interesting to share.

But I got me thinking, what's the highest data compression we can get on text currently?

u/bldrlife1 — 10 days ago

imagine that, or an algorithm that uses a mismatch of many algorithms, to get the best value, through automatic trial and error.

or something like an ai that picks the optimal settings based on a few tests or something. i think this is an idea that is worth exploring.

reddit.com
u/Physical-Owl691 — 14 days ago

[Seeking Review] SPX: A Lossless Image Codec using RCT + MED + Sharding + rANS

Hi all,

I've spent the last few months developing a lossless image compressor called SPX, aiming to balance compression density and encoding speed, that is, maintaining compression rate higher than .webp (m6) but lower than .jxl (e7) while significantly enhancing encoding speed.

I did some testing and the performance seems consistent in most datasets but compression savings aren't that consistent.

https://preview.redd.it/kg6xfriluuzg1.png?width=1231&format=png&auto=webp&s=0a2f2a5e4de4c1df3059f6f24536cad59bcf9d92

I think I've hit my limit as a self-taught amateur developer knowing a little Python. I can't come up with any new idea to improve it anymore so Gemini suggested coming here for professional advice.

It's an Apache 2.0 open source project. Any suggestion on how to improve compression rate without losing too much speed is highly appreciated! Thank you!

GitHub: https://github.com/nonkilife/SPX-Image-Lossless-Compression

Quick Start: pip install spx-codec

==

// The Architecture:

SPX isn't a fundamental breakthrough, but a streamlined 4-part pipeline designed for modern CPU throughput:

  1. RCT: Reversible Color Transform (Green-sub).
  2. MED: Branchless Median Edge Detector.
  3. Stateless Sharding: Pixels are allocated into 42 shards based on local gradient (v), luminance (i), and direction (t). These 3 parameters can be adjusted to accommodate different types of images to obtain better performance.
  4. Entropy Coding: Rust-based 4-way Interleaved rANS.

// Customization & Extensibility:

  • Dynamic Sharding: The (i, v, t) boundaries for pixel classification are not hard-coded. They can be easily re-tuned to accommodate specialized image distributions.
  • Flexible Entropy Modeling: The rANS probability modes are stored in .npz format. This allows users to swap or retrain templates for specific datasets without re-compiling the core Rust engine.
  • Adaptive Framework: While current design is a common solution, the architecture is designed to be a "compression sandbox" for specific domain needs.

// The Performance (Snapshot on AMD Ryzen 5 3500X):

  • Encoding Speed: ~12 MB/s on Kodak, peaking at 44 MB/s on standard synthetic sets.
  • Compression Ratio: Consistently 25-30% smaller than PNG; sits between WebP (M6) and JXL (E7) most of the time.
  • Validation: Bit-perfect verification (MSE = 0) with an integrated unified benchmark suite.
  • Target Data: Tested on CLIC, DIV2K, Tecnick, ICI, and Kodak (primarily natural photography).
  • Limitation: Validation on synthetic images is currently limited, so consistency in those specific domains remains a known unknown.
  • Comparative Benchmark: https://github.com/nonkilife/SPX-Image-Lossless-Compression/blob/main/technical/BENCHMARK.md

// The Bottleneck:

I've reached a point where manual optimizations (branchless logic, LUT, SIMD-friendly structures) are no longer yielding significant gains.

I've experimented with:

  • Predictors: Swapping MED for GAP or Paeth (MED still wins on speed/ratio balance).
  • Context: Adding UR, UU, LL pixel data to MED (speed tumbled, ratio improvement was negligible).
  • Sharding: Tested >5,000 shard combinations up to ~60 shards using Monte Carlo Simulation; the current 42-shard model seems to be the "sweet spot" for speed. Adaptive sharding based on image unique fingerprints (eg. H-entropy, AAD, size, R:G:B proportion, etc) was also tested but compression improvement was minor and experienced significant speed loss.
  • rANS PDF: High-bit modes proved too overhead-heavy for most shards after analyzed Clic 2021 dataset.

While 90% of approaches are proven failure, there is still unexplored territory:

  • 8-way Interleaving: I've considered scaling the rANS core to 8-way interleaving. However, initial analysis suggests my current Zen 2 architecture (3500X) might suffer from cache port contention or register pressure at that level. I've stuck with 4-way as a stable, high-efficiency baseline.
  • C++ & AVX-512: The current engine is a Python/Rust hybrid. I suspect a pure C++ implementation leveraging AVX-512 could push the throughput slightly higher, but that currently exceeds my personal technical stack.
reddit.com
u/Nonkilife — 6 days ago

Please don't tell me about Pigeonhole principle, i don't like pigeons, they dirty the street and Also roof of my house

reddit.com
u/-blahem- — 8 days ago
▲ 14 r/compression+1 crossposts

Been playing around with compression, but instead of treating code as raw bytes, I tried modeling it.

Idea is pretty simple: tokenize the Python source, use an n-gram model to predict the probability of the next token, and then feed those probabilities into an arithmetic coder. The more predictable the token, the fewer bits it costs.

Ran it on the Flask repo (~575 KB of .py files):

https://preview.redd.it/shyzoszjewyg1.png?width=1919&format=png&auto=webp&s=5df5959c35046ed615da33a134372918c4071541

Neural + AC → 101 KB (82.4% reduction)
zlib / ZIP → 151 KB (73.7%)
lzma / 7z → 152 KB (73.5%)
zstd → 147 KB (74.4%)

So yeah, about ~33% better compression ratio than zlib.

Nothing magical going on Python code just has a lot of structure at the token level, and n-grams pick up enough of it to make a difference. Arithmetic coding just turns that into actual bits.

The setup is split pretty cleanly: tokenizer + model in Python, and the arithmetic coder is written in Zig (compiled to a shared library) and called via ctypes. Python handles probability generation, Zig handles the actual encoding and bitstream.

The obvious downside: it’s slow. Like really slow. ~75 seconds vs ~0.05s for zlib on the same data (~1600× slower). Most of that is just calling the model once per token with no caching.

Still, kind of interesting to see that even a basic n-gram model can beat general-purpose compressors just by not treating code like noise.

Feels like there’s something here if the prediction side gets better (or faster). Curious if anyone else has tried something similar.

reddit.com
u/Equivalent-Gas2856 — 11 days ago

I built a small lossless preprocessing library called STRATA.

It exposes structural transforms such as:

  • 2D predictors
  • cube rotation
  • radial reordering for 3D voxels
  • YCoCg-R colour transform
  • automatic per-input transform selection

Repo: https://github.com/rjamesy/strata

The unusual finding is that STRATA works better as a preprocessor for general-purpose codecs than as a standalone codec.

In particular, STRATA-preprocess + zstd-1 can measurably beat raw zstd-22 on both speed and compression ratio for shape-aware data:

  • 27 MB RGB photo + YCoCg-R + zstd-1: 463× faster and 5.4% smaller than raw zstd-22
  • Smooth 2D heightmap + 2D predictor + zstd-1: 4.2× faster and 44.5% smaller
  • 64³ volume + cube rotation + radial + zstd-22: 14.4% smaller at roughly the same speed

The mechanism appears to be simple: zstd at -22 performs expensive long-range string matching, but on smooth or structured raw data there may be few exact repetitions to match. STRATA exposes the redundancy directly, so even zstd-1 can exploit it. Total work decreases.

The results are reproducible: bench/preprocess_demo.py writes a CSV covering all tested combinations.

Caveats:

  • STRATA does not beat WebP-lossless on natural-photo RGB, though it narrows the gap from about 50% to 13%.
  • It ties bzip2 on plain text.
  • The project is MIT-licensed.
u/Fantastic_Scratch767 — 13 days ago

Built a tool to stop paying twice for the same LLM tokens

Six months of heavy API usage and my bills felt higher than they should be. Finally sat down and traced exactly where the tokens were going.

Turned out most of it was repetition. Every API call resends the full context window, the whole conversation history, the system prompt, all of it. The context resets each call. You're paying for the same information over and over, every single request.

Built ContextPilot to fix it. It sits between your code and the API and compresses context before each call.

Saving around 60% on API costs at my usage level. MIT licensed, no account needed, works with OpenAI and Anthropic.

Still early, v0.2.2 on PyPI. Would genuinely appreciate feedback from anyone who gives it a try, especially on edge cases or integrations I haven't thought about.

github.com/msousa202/ContextPilot

contextpilot.org
u/Ok_Alternative_3007 — 4 days ago