
Today we’re releasing EMO, a new mixture-of-experts (MoE) model trained so modular structure emerges directly from data without human-defined priors.
Most LLMs are trained and deployed as one monolithic system, even when an application only needs a narrow capability like code or math. MoEs seem to break this pattern by using only a few experts per token. But across a full task, standard MoEs still rely on many experts.
EMO’s key idea: use each training document as a weak signal for shared context. Instead of letting every token route independently, EMO restricts tokens from the same document to a shared expert pool, encouraging experts to organize around coherent domains.
EMO’s expert clusters look very different from a traditional MoE—they organize around semantic domains like health, news, politics, & film/music. Traditional MoEs often cluster around surface patterns like prepositions and articles, making selective expert use tougher.
EMO is a 1B-active, 14B-total MoE trained on 1T tokens with 8 of 128 experts active per token. Without any subsequent fine-tuning, EMO remains robust when only a subset of experts is kept: with 25% of experts, it loses ~1 percentage point in overall performance; with 12.5%, it drops ~3 points. Standard MoEs degrade sharply.
We experiment on a smaller 130B token setting, where we show EMO subsets also match or outperform memory-matched models trained from scratch. Instead of training many separate small models for fixed memory budgets, one EMO model can provide many domain-specific expert subsets.
We're releasing EMO, a matched standard-MoE baseline, and training code to help the community study modularity & expert selection:
🧠 Models: https://huggingface.co/collections/allenai/emo
📝 Blog: https://allenai.org/blog/emo
📄 Tech report: https://allenai.org/papers/emo
📊 Visualization: https://emovisualization.netlify.app/