
The problem: If you run long-context inference locally, your GPU's KV cache fills up and evicts blocks. The next request with the same prompt prefix has to recompute everything from scratch. On a 30k-token document, that's 10+ seconds of prefill — every single time.
What I built: tierKV intercepts evicted KV blocks, quantizes them with a Rust INT8 compressor (3.9× smaller), and ships them over gRPC to a vault running on another machine on my LAN. When the same prefix appears again, it fetches the blocks back and injects them directly into vLLM's paged KV buffer — no attention recomputation at all.
vLLM numbers on a real 30,561-token document (Apple 10-K):
- Cold prefill: 10.75s
- GPU cache hit: 1.19s
- Cold vault restore: 0.52s — faster than the GPU cache hit, because vault restore skips attention entirely
On EXO with an 8k-token prompt: 30.83s cold → 4.11s restored (7.3×).
The speedup grows with context length since prefill is O(n²) but restore is O(n) + network. At 128k tokens, the gap is over a minute per request.
My cluster:
- DGX Spark (96GB HBM) — runs the model
- Mac Pro (32GB RAM) — runs the KV vault
- Mac Air (16GB RAM) — runs the SSM/linear-attention vault (for Qwen3.6-35B-A3B, which mixes attention + Mamba layers)
- 5GbE LAN, ~0.5ms RTT
Setup is just:
pip install tierkv
# configure role in tierkv.toml on each machine
tierkv vault
# on the cold machines
# launch vLLM or EXO as normal
Works with vLLM (via KVConnectorBase_V1 plugin, no source changes) and EXO (post-install patch).
Honest limitations:
- Only helps when the same prefix repeats — single-shot prompts get nothing
- LAN only — WiFi/WAN latency kills the benefit
- No tensor parallelism support yet
- Vault is in-memory; data lost on restart
Full writeup: https://prasannakanagasabai126786.substack.com/p/your-llm-is-doing-math-it-already
Code: https://github.com/tierkv/tierkv
Happy to answer questions about the architecture, vLLM/EXO integration.