
I wrote a paper on HoloKV: Using CDMA Phase-Shifting to achieve O(N/k) KV-Cache Compression. Looking for Triton/CUDA collaborators.
Hey everyone,
I’m a 22-year-old independent researcher, and I’ve been trying to tackle the "Memory Wall" for long-context LLMs. Standard methods either quantize precision (which hits a hard limit) or use token eviction (which degrades reasoning).
I just published an open research draft for a different geometric approach called HoloKV.
The concept: Instead of appending new memory slots, HoloKV multiplexes (stacks) k tokens into a single physical memory slot. It uses deterministic +1/-1 orthogonal phase keys (inspired by CDMA telecommunications) to separate the signals.
To make it work natively with modern architectures, I introduced:
- Variance Normalization: A sqrt(k) penalty to prevent Softmax entropy collapse caused by superimposing vectors.
- Strict Even-Boundary Rule: A constraint on phase-key generation that perfectly preserves the 2D rotary commutative math of RoPE (Llama/Qwen).
- LoRA Denoising: Injecting Query/Value LoRA adapters via Knowledge Distillation to natively filter out the Gaussian background static.
The Ask:
I have successfully built the mathematical simulator in PyTorch to prove the orthogonal extraction and RoPE preservation work. However, I am a solo dev working on a GTX 1650. To actually realize the 75%+ physical VRAM savings, this needs a custom SRAM Active Accumulation Buffer written in OpenAI Triton or CUDA to prevent the "Read-Modify-Write" penalty.
I am open-sourcing the math and the paper. If there are any Triton/FlashAttention kernel engineers here who want to collaborate and help me build the hardware kernel, please reach out or open a PR!
**Paper & Code:**https://github.com/0sami0/HoloKV