u/Concert_Dependent — reddlx

The problem: If you run long-context inference locally, your GPU's KV cache fills up and evicts blocks. The next request with the same prompt prefix has to recompute everything from scratch. On a 30k-token document, that's 10+ seconds of prefill — every single time.

What I built: tierKV intercepts evicted KV blocks, quantizes them with a Rust INT8 compressor (3.9× smaller), and ships them over gRPC to a vault running on another machine on my LAN. When the same prefix appears again, it fetches the blocks back and injects them directly into vLLM's paged KV buffer — no attention recomputation at all.

vLLM numbers on a real 30,561-token document (Apple 10-K):

Cold prefill: 10.75s
GPU cache hit: 1.19s
Cold vault restore: 0.52s — faster than the GPU cache hit, because vault restore skips attention entirely

On EXO with an 8k-token prompt: 30.83s cold → 4.11s restored (7.3×).

The speedup grows with context length since prefill is O(n²) but restore is O(n) + network. At 128k tokens, the gap is over a minute per request.

My cluster:

DGX Spark (96GB HBM) — runs the model
Mac Pro (32GB RAM) — runs the KV vault
Mac Air (16GB RAM) — runs the SSM/linear-attention vault (for Qwen3.6-35B-A3B, which mixes attention + Mamba layers)
5GbE LAN, ~0.5ms RTT

Setup is just:

pip install tierkv
# configure role in tierkv.toml on each machine
tierkv vault   
# on the cold machines
# launch vLLM or EXO as normal

Works with vLLM (via KVConnectorBase_V1 plugin, no source changes) and EXO (post-install patch).

Honest limitations:

Only helps when the same prefix repeats — single-shot prompts get nothing
LAN only — WiFi/WAN latency kills the benefit
No tensor parallelism support yet
Vault is in-memory; data lost on restart

Full writeup: https://prasannakanagasabai126786.substack.com/p/your-llm-is-doing-math-it-already

Code: https://github.com/tierkv/tierkv

Happy to answer questions about the architecture, vLLM/EXO integration.