Modular MoE restructures how expert weights are stored, routed, and updated. Instead of keeping every expert resident across 5–8 GPUs, we extract a frozen shared core, compress the per-expert residuals by 8–16× (hierarchical shared-core extraction combined with S2LC—Shared Spectral Low-Rank Compression—spectral compression), and load only the active domain module on demand. The result: 1–2 GPUs per instance, sub-millisecond domain switching, and the ability to add or roll back capabilities without retraining.
u/EntertainmentWarm117
White paper: https://zenodo.org/records/19981611
We present S2LC (Shared Spectral Low-Rank Compression), a structured block-sparse compression method for neural network adapters, and the Parameter-Centric Architecture (PCA), a systems framework that treats trained parameter networks as primary execution engines and natural language specifications as directly executable programs.
S2LC compresses domain-specific adapter modules—and decomposed Mixture-of-Experts (MoE) residuals—via spectral energy thresholding at the block level, shared subspace projection, and hardware-aware sparse quantization. When integrated with Hierarchical Expert Decomposition (HED), S2LC achieves compression ratios of 8–16× on expert residuals (and up to 64× with precision-tiered distillation) by first extracting shared spectral components (global, cluster, and subcluster levels) that structurally concentrate residual energy into low-rank, block-sparse manifolds.
The resulting compressed artifacts are managed by an Adapter Store infrastructure featuring content-addressed deduplication (Global Block Dictionary), semantic versioning, delta-delta updates, and cryptographic provenance. A Context Router performs semantic pre-pass classification of natural language inputs, enabling just-in-time (JIT) adapter decompression and weight merging with a frozen base model without retraining. We further introduce Expert Morphing, a continuous interpolation mechanism in the shared spectral subspace that synthesizes hybrid experts from coefficient tensors without materializing full weight matrices, reducing active memory residency by over two orders of magnitude compared to soft MoE approaches.
The architecture is substrate-agnostic, with enablement for digital tensor cores, analog processing-in-memory (PIM) crossbars, and photonic holographic computing. Together, these mechanisms convert monolithic MoE models into modular, post-training extensible, independently distributable neural executable units.