r/allenai

▲ 38 r/allenai+1 crossposts

Today we’re releasing EMO, a new mixture-of-experts (MoE) model trained so modular structure emerges directly from data without human-defined priors.

Most LLMs are trained and deployed as one monolithic system, even when an application only needs a narrow capability like code or math. MoEs seem to break this pattern by using only a few experts per token. But across a full task, standard MoEs still rely on many experts.

EMO’s key idea: use each training document as a weak signal for shared context. Instead of letting every token route independently, EMO restricts tokens from the same document to a shared expert pool, encouraging experts to organize around coherent domains.

EMO’s expert clusters look very different from a traditional MoE—they organize around semantic domains like health, news, politics, & film/music. Traditional MoEs often cluster around surface patterns like prepositions and articles, making selective expert use tougher.

EMO is a 1B-active, 14B-total MoE trained on 1T tokens with 8 of 128 experts active per token. Without any subsequent fine-tuning, EMO remains robust when only a subset of experts is kept: with 25% of experts, it loses ~1 percentage point in overall performance; with 12.5%, it drops ~3 points. Standard MoEs degrade sharply.

We experiment on a smaller 130B token setting, where we show EMO subsets also match or outperform memory-matched models trained from scratch. Instead of training many separate small models for fixed memory budgets, one EMO model can provide many domain-specific expert subsets.

We're releasing EMO, a matched standard-MoE baseline, and training code to help the community study modularity & expert selection:

🧠 Models: https://huggingface.co/collections/allenai/emo
📝 Blog: https://allenai.org/blog/emo
📄 Tech report: https://allenai.org/papers/emo

📊 Visualization: https://emovisualization.netlify.app/

u/ai2_official — 5 days ago
▲ 15 r/allenai+1 crossposts

Today we’re bringing new NSF OMAI compute online with NVIDIA Blackwell Ultra-powered systems, turning a $152M national investment from NSF & NVIDIA into a foundation for truly open AI research.

https://preview.redd.it/y1cexymrfqzg1.jpg?width=2048&format=pjpg&auto=webp&s=1da18fbb4b000c9ba7744da210ebe54d3ab5075b

https://preview.redd.it/39twiymrfqzg1.jpg?width=2048&format=pjpg&auto=webp&s=2e8742133dae244f8144f477fbf5b943b73f17f1

https://preview.redd.it/qd0b8zmrfqzg1.jpg?width=2048&format=pjpg&auto=webp&s=39623fd2608a27dc355b49cbabeffa2fcc00cf63

Built on NVIDIA B300 systems and deployed with Cirrascale Cloud Services, the new cluster supports scaled training and experimentation across language, multimodal, and scientific AI, helping extend research directions behind models like Molmo 2 & Olmo Hybrid.

Our research estimates that in today’s model training efforts, 82% of compute goes into exploratory work. At closed labs, the output of that work stays within those labs. In an open system, models, datasets, & methods are shared, and the value compounds across the field.

With the new NSF OMAI compute now online, Ai2 is building toward open, reusable AI systems that researchers can deeply inspect, study, and customize.

→ Read more in our blog: https://allenai.org/blog/omai-compute-now-live

reddit.com
u/ai2_official — 6 days ago
▲ 17 r/allenai

Recipes for teaching LLMs to handle long inputs don’t work equally well across model families. We wanted to understand why. 👇

We trained 26 7B models on the same data with the same context-extension recipe, varying only the architecture. We found that four common design choices – QK normalization, grouped-query attention, sliding-window attention, and shorter pretraining context length – can compound to reduce long-context scores by up to 47%.

The problem is hard to catch early. Training loss, validation perplexity, and 16 short-context benchmarks all failed to predict 32K/64K performance in our experiments. More data didn’t close the gap, either—even after 50B tokens of long-context training, the weakest architecture still couldn’t match what Llama’s architecture reached after 1B tokens.

We’re releasing 26 models covering pretraining and context extension to support better extension methods and research on early pretraining dynamics.

📝 Blog: https://allenai.org/blog/olmpool

📄 Tech report: https://allenai.org/papers/olmpool

🤗 Models: https://huggingface.co/collections/allenai/olmpool

💻 Code: https://github.com/allenai/olmpool/tree/main

u/ai2_official — 13 days ago