u/Hot_Loquat_3222

[Project] I built a 10-Layer Mixture-of-Experts architecture from absolute zero that mathematically rejects standard backprop and rewrites its own failing weights during runtime.
▲ 13 r/learnmachinelearning+2 crossposts

[Project] I built a 10-Layer Mixture-of-Experts architecture from absolute zero that mathematically rejects standard backprop and rewrites its own failing weights during runtime.

Hey everyone,

I’ve spent the last few months engineering a custom deep learning architecture called **MACRO-DREADNOUGHT**.

Most standard networks are entirely passive—they pass data blindly forward and rely purely on the law of averages during backpropagation. They suffer from mode collapse, convolutional amnesia, and rigid geometric blind spots. I wanted to build an engine to actively destroy those bottlenecks.

Here are the core mechanics of the engine:

* **The SpLR_V2 Activation Function:** I designed a custom, non-monotonic activation function (`f(x) = a * x * e^(-k x^2) + c * x`). It calculates its own Shannon Entropy per forward pass, actively widening or choking its gradient based on the network's real-time confidence.

* **The 3-Lane MoE Router (Gated Synergy):** To prevent "Symmetry Breaking Collapse" where one expert hogs all the data, I built a 70/30 Elastic Router. It forces 30% uniform distribution, guaranteeing that "underdog" specialist heads never starve and are always kept on life support.

* **The DNA Mutation Engine:** It doesn't just use an Adam Optimizer. Every few epochs, the network evaluates its own psychology. If a routing head is arrogant (high monopoly) but failing (high entropy), the engine physically scrubs the failing weights and violently rewrites the layer's DNA using a "Hit-List" of the exact VRAM images that defeated it.

* **Temporal Memory Spine:** It cures Convolutional Amnesia by using an Asymmetrical Forensic Bus to recycle rejected features into the global-context heads of deeper layers.

**The Benchmarks:**

I just verified the live-fire deployment on Kaggle. Using strict independent compute constraints (a single Tesla T4 GPU, 50 Epochs) on Tiny ImageNet (200 Classes), the architecture proves highly stable and demonstrates aggressive early-stage convergence.

I have open-sourced the complete mathematical physics, domain segregation logic, and the Kaggle live-fire runs.

📖 **The Master Blueprint & Code:** [https://github.com/MohammadALBiltaji/MACRO-DREADNOUGHT\]

I would love to hear any thoughts from the community on dynamic routing, custom activation design, or the pioneer protocol logic. Let me know if you have any questions about the math!

u/Hot_Loquat_3222 — 2 days ago