u/AchelousAce

I just published my first research paper - and I think we've been misunderstanding what fine-tuning actually does.

"Brainstacks: Cross-Domain Cognitive Capabilities via Frozen MoE-LoRA Stacks for Continual LLM Learning"

I built an architecture that adds unlimited domain expertise to any LLM - one domain at a time - with near-zero forgetting. Null-space projection constrains each new domain to subspaces orthogonal to previous ones, enforced by linear algebra, not regularization. A meta-router selectively gates which stacks fire at inference. Frozen weights can't change. Irrelevant stacks can't interfere. Two mechanisms, one anti-forgetting system. 😎

But the architecture isn't the headline. What it revealed is.

I trained domain stacks sequentially - chat, code, math, medical, reasoning - then built a meta-router that ignores domain labels entirely. It tests every combination of stacks and picks whichever produces the lowest loss. Pure empirical measurement.

It found that medical prompts route to chat+math stacks 97% of the time. Not the medical stack. Chat and math - trained on zero medical data - cut medical loss by 50-70%.

Domain adapters don't store domain knowledge. They store cognitive primitives! - instruction-following, numerical reasoning, procedural logic, chain-of-thought structure - that transfer across every domain boundary.

I pushed further. A model pretrained exclusively on children's stories - zero Python in training data - produced def with indented blocks and colon-terminated statements when the code block activated. In children's story words. It learned the structure of code without ever seeing code.

Fine-tuning injects composable capabilities, not knowledge!

The architecture is novel on multiple fronts - MoE-LoRA with Shazeer noisy routing across all 7 transformer projections (no prior work does this), rsLoRA + MoE-LoRA (first in the literature), residual boosting through frozen stacked adapters, null-space gradient projection, and an outcome-based sigmoid meta-router. Two-level routing - token-level MoE inside stacks, prompt-level meta-routing across stacks - with no precedent in the literature.

The system scales to constant GPU memory regardless of how many domains exist. A hospital loads medical stacks. A law firm loads legal stacks. Same base model. We call it the Superposition LLM. 🤖

Validated on TinyLlama-1.1B (4 domains, 9 stacks) and Gemma 3 12B IT (5 domains, 10 stacks). 2.5× faster convergence than single LoRA. Residual boosting breaks through the single-adapter ceiling.

5 cognitive primitives. 31 combinations. Linear investment, exponential coverage.

And this is just the foundation of a new era of LLM capabilities understanding. 👽

Code: https://github.com/achelousace/brainstacks

Paper: https://arxiv.org/abs/2604.01152

Mohammad R. Abu Ayyash

Brains Build Research

Ramallah, Palestine.