u/ai2_official

🌎 Introducing AIMIP: an open benchmark for comparing AI climate models over multi-decade simulations

Our new AI Model Intercomparison Project (AIMIP) brings together a shared benchmark experiment and dataset to make it easier to compare AI climate models side by side over multi-decade simulations. 🌎

We need transparent ways to evaluate how AI climate models perform on long-horizon forecasting. Weather models already have common evals like WeatherBench; AIMIP is a shared benchmark for AI climate modeling in the spirit of the Coupled Model Intercomparison Project (CMIP).

For AIMIP, models forecast the global atmosphere over 1979–2024, using historical data from 1979–2014 for training and leaving the final decade held out for testing. The benchmark focuses on the atmosphere alone, and leaves model architecture choices up to each submitter.

AIMIP evaluates model performance on:

◙ Overall climate averages

◙ Long-term trends

◙ El Niño-related atmospheric responses

◙ Day-to-day variability

◙ Out-of-sample behavior under warmer sea surface temperatures

For AIMIP’s first phase, 6 modeling groups – including Google Research, NVIDIA, and ArchesWeather – submitted 8 AI models spanning approaches such as hybrid systems, full autoregressive emulation, and conditioned diffusion.

The early results are promising—most submissions perform well on average historical climate patterns and often beat a conventional physically-based model on that task. But the picture is mixed on long-term warming trends, where some models underestimate warming significantly.

We also tested the models on harder scenarios, such as a rapidly warming ocean that was unfamiliar from training. In those tests, the models diverged much more—showing that generalization remains a major challenge.

We’re releasing the first-phase AIMIP dataset and our analysis of it. We hope to continue AIMIP with future phases that expand its scope and scale.

📘 Learn more in our blog: https://allenai.org/blog/AIMIP

📊 Paper: https://arxiv.org/abs/2605.06944

🗂️ Dataset: https://github.com/ai2cm/AIMIP/tree/main/evaluations#data

u/ai2_official — 18 hours ago

▲ 17 r/allenai

🧪 Introducing MyScholarQA: AI-powered personalized scientific deep research

Now available in AstaLabs in limited research preview: MyScholarQA, a personalized version of ScholarQA for scientific deep research. 👇

ScholarQA helps synthesize evidence from 12M+ open-access papers. MyScholarQA adds user profiles to tailor that synthesis to you.

AstaLabs is where we share experimental research tools from Asta, our platform for AI-assisted scientific discovery. MyScholarQA builds on ScholarQA, which powers parts of Asta, to explore how deep research systems can better understand the researcher asking the question.

Researchers bring different expertise, methods, audiences, & goals to the same literature as they compile reports. MyScholarQA uses a profile built from papers you choose so reports reflect that context, from what you know to how you prefer research framed.

We tested MyScholarQA against deep research systems including OpenScholar, Perplexity Sonar Deep Research, and OpenAI deep research powered by o3. Its reports answered research questions more completely and cited sources more accurately & consistently.

How it works in AstaLabs:

1️⃣ Add papers by pasting Semantic Scholar paper URLs or an author profile URL. MyScholarQA infers your research interests, and you can review & customize each inference.

2️⃣ Then ask a research question. MyScholarQA proposes actions for the report—papers to look for, connections to your work, or framing to use. Adjust the plan, then generate a report grounded in ScholarQA's synthesis over millions of open-access papers.

Try MyScholarQA in AstaLabs and read the paper behind the system:

🔬 AstaLabs: https://personalized-scholarqa.apps.allenai.org/

📄 Paper: https://arxiv.org/abs/2603.16120

📊 Analysis of user feedback collected in MyScholarQA: https://arxiv.org/abs/2604.23815

u/ai2_official — 1 day ago

▲ 15 r/allenai

📊 How Artificial Analysis is using Ai2's IFBench to probe frontier model instruction following

Artificial Analysis relies on our IFBench eval to test how closely models follow user prompts. 👇

Most evals in AA’s Intelligence Index saturate within months. IFBench hasn't because it measures what others miss—and what frontier models still struggle with.

Accepted to NeurIPS 2025, IFBench tests how well language models follow precise output constraints. It asks models to do things like answer only with “yes” or “no,” mention a specific word at least three times, or hit an exact sentence, word, or character count.

Together, those constraints expose a common failure mode: a model can understand the topic and still miss part of a request. "IFBench measures instruction following in a way that feels closer to real-world use than earlier instruction following evals," says AA’s Declan Jackson.

Inside AA's Intelligence Index, IFBench surfaces where instruction-following is improving, where progress is uneven, and how models that score well overall can still struggle with precise prompts. That kind of granularity is hard to see in aggregate scores alone.

IFBench is fully open so anyone can inspect it and run it across models. Open benchmarks make adoption like this possible, and they're how the field builds shared evaluation standards.

📊 IFBench: https://github.com/allenai/IFBench

u/ai2_official — 3 days ago

▲ 38 r/allenai+1 crossposts

Today we’re releasing EMO, a new mixture-of-experts (MoE) model trained so modular structure emerges directly from data without human-defined priors.

Most LLMs are trained and deployed as one monolithic system, even when an application only needs a narrow capability like code or math. MoEs seem to break this pattern by using only a few experts per token. But across a full task, standard MoEs still rely on many experts.

EMO’s key idea: use each training document as a weak signal for shared context. Instead of letting every token route independently, EMO restricts tokens from the same document to a shared expert pool, encouraging experts to organize around coherent domains.

EMO’s expert clusters look very different from a traditional MoE—they organize around semantic domains like health, news, politics, & film/music. Traditional MoEs often cluster around surface patterns like prepositions and articles, making selective expert use tougher.

EMO is a 1B-active, 14B-total MoE trained on 1T tokens with 8 of 128 experts active per token. Without any subsequent fine-tuning, EMO remains robust when only a subset of experts is kept: with 25% of experts, it loses ~1 percentage point in overall performance; with 12.5%, it drops ~3 points. Standard MoEs degrade sharply.

We experiment on a smaller 130B token setting, where we show EMO subsets also match or outperform memory-matched models trained from scratch. Instead of training many separate small models for fixed memory budgets, one EMO model can provide many domain-specific expert subsets.

We're releasing EMO, a matched standard-MoE baseline, and training code to help the community study modularity & expert selection:

🧠 Models: https://huggingface.co/collections/allenai/emo
📝 Blog: https://allenai.org/blog/emo
📄 Tech report: https://allenai.org/papers/emo

📊 Visualization: https://emovisualization.netlify.app/

u/ai2_official — 6 days ago

▲ 15 r/allenai+1 crossposts

Today we’re bringing new NSF OMAI compute online with NVIDIA Blackwell Ultra-powered systems, turning a $152M national investment from NSF & NVIDIA into a foundation for truly open AI research.

https://preview.redd.it/y1cexymrfqzg1.jpg?width=2048&format=pjpg&auto=webp&s=1da18fbb4b000c9ba7744da210ebe54d3ab5075b

https://preview.redd.it/39twiymrfqzg1.jpg?width=2048&format=pjpg&auto=webp&s=2e8742133dae244f8144f477fbf5b943b73f17f1

https://preview.redd.it/qd0b8zmrfqzg1.jpg?width=2048&format=pjpg&auto=webp&s=39623fd2608a27dc355b49cbabeffa2fcc00cf63

Built on NVIDIA B300 systems and deployed with Cirrascale Cloud Services, the new cluster supports scaled training and experimentation across language, multimodal, and scientific AI, helping extend research directions behind models like Molmo 2 & Olmo Hybrid.

Our research estimates that in today’s model training efforts, 82% of compute goes into exploratory work. At closed labs, the output of that work stays within those labs. In an open system, models, datasets, & methods are shared, and the value compounds across the field.

With the new NSF OMAI compute now online, Ai2 is building toward open, reusable AI systems that researchers can deeply inspect, study, and customize.

→ Read more in our blog: https://allenai.org/blog/omai-compute-now-live

reddit.com

u/ai2_official — 7 days ago

▲ 19 r/allenai+1 crossposts

Today we're releasing MolmoAct 2, a fully open robotics foundation model that makes coffee, buses tables, and assists with lab tasks. 🤖

Robotics models often struggle outside controlled environments. MolmoAct 2 is designed for real ones. Building on our first Action Reasoning Model (ARM), it reasons in 3D before acting, runs up to 37x faster, and handles two-armed tasks with no per-task fine-tuning.

We retained Cortex AI to run a third-party real-world fine-tuning benchmark. 📊 Across 50 trials on a suite of tabletop, in-the-wild, and mobile tasks, MolmoAct 2 outperformed systems including OpenVLA-OFT, π0.5, X-VLA, and Cosmos Policy.

We're already testing MolmoAct 2 outside controlled setups. In our office café, it makes popcorn and drinks while people move around it while handling practical tasks such as wiping surfaces, lifting trays, and folding towels. ☕

We've also piloted MolmoAct 2 with research partners including a Stanford Medicine team using it for hands-on CRISPR gene-editing work. It moves samples, uses lab equipment, and recovers from small mistakes during long experiments.

To lower the barrier to entry, we're sharing an affordable reference hardware setup: two YAM arms, overhead and close-up cameras, an extendable mount, and a tabletop workspace for bimanual manipulation. 🦾

Robotics models are often closed. MolmoAct 2 isn't. We're releasing model weights, an updated VLA architecture, a fully open action tokenizer, and the MolmoAct 2-Bimanual YAM dataset—the largest open bimanual robotics dataset on real-world tasks to date.

📝 Learn more in our blog: https://allenai.org/blog/molmoact2

🤖 Models: https://huggingface.co/collections/allenai/molmoact2-models

📊 Training dataset: https://huggingface.co/collections/allenai/molmoact2-datasets

u/ai2_official — 9 days ago

▲ 11 r/allenai

How do you train a coding agent to solve problems it hasn’t seen before? 👇

On Dev Interrupted, Ai2’s Tim Dettmers explains why it helps to teach models how developers approach a task—understand the request, find the right code, make a change, and check the work.

That idea is at the core of SERA, the first model in Ai2’s Open Coding Agents family. SERA shows how smaller models can learn the way developers work through coding tasks, making it easier for teams to adapt coding agents to their own codebases.

→ Listen to the full episode: https://podcasts.apple.com/us/podcast/the-best-model-for-your-team-you-havent-invented-it/id1537003676?i=1000762673427

u/ai2_official — 10 days ago

▲ 11 r/allenai

Today we published a Q&A with Interim CEO Peter Clark on what’s next for Ai2, from advancing truly open AI systems to applying AI in areas like scientific discovery & the planet.

The conversation covers why open models remain central to our work—and how we’re thinking about the road ahead.

→ Read it here: https://allenai.org/blog/peter-clark-qa

u/ai2_official — 13 days ago

▲ 17 r/allenai

Recipes for teaching LLMs to handle long inputs don’t work equally well across model families. We wanted to understand why. 👇

We trained 26 7B models on the same data with the same context-extension recipe, varying only the architecture. We found that four common design choices – QK normalization, grouped-query attention, sliding-window attention, and shorter pretraining context length – can compound to reduce long-context scores by up to 47%.

The problem is hard to catch early. Training loss, validation perplexity, and 16 short-context benchmarks all failed to predict 32K/64K performance in our experiments. More data didn’t close the gap, either—even after 50B tokens of long-context training, the weakest architecture still couldn’t match what Llama’s architecture reached after 1B tokens.

We’re releasing 26 models covering pretraining and context extension to support better extension methods and research on early pretraining dynamics.

📝 Blog: https://allenai.org/blog/olmpool

📄 Tech report: https://allenai.org/papers/olmpool

🤗 Models: https://huggingface.co/collections/allenai/olmpool

💻 Code: https://github.com/allenai/olmpool/tree/main

u/ai2_official — 14 days ago

▲ 6 r/allenai

New AstaBench results show frontier models making progress on scientific research, but the benchmark remains far from solved. 🧪

AstaBench measures how well AI agents perform various scientific tasks, from finding papers and writing code to analyzing datasets and running end-to-end discovery workflows. In this update, we tested the latest frontier models across 2.4K+ research problems using the ReAct agent framework.

📊 The topline: Claude Opus 4.7 ranks first overall at 58.0%, followed by Opus 4.6 and Sonnet 4.6. GPT-5.5 reaches 52.9% at $1.61 per problem, coming within 5.1 points of Opus 4.7 at less than half the measured cost per problem.

⚖️ The gains are uneven. GPT-5.5 leads Code & Execution and Data Analysis, and narrowly leads the top Claude run on Literature Understanding. But Claude Opus 4.7 still leads End-to-End Discovery, the hardest category in the suite.

🔬 That split has big implications: strong performance on coding, literature understanding, and data analysis doesn’t automatically translate into robust end-to-end scientific work. The hardest workflows are also where the highest costs show up, while Data Analysis remains relatively inexpensive across the new frontier runs.

We built AstaBench to give the field a shared, transparent way to measure whether AI can do rigorous scientific work—not just isolated tasks. We’re pleased to see adoption with the UK AISI via Inspect Evals and General Reasoning, which added an AstaBench task to OpenReward.

If you’re building scientific agents, join Elicit, SciSpace, Distyl AI, EvoScientist, and others testing on AstaBench.

📝 Learn more: https://allenai.org/blog/astabench-update-spring-2026📊 Full leaderboard: https://allenai-asta-bench-leaderboard.hf.space/home

u/ai2_official — 14 days ago

▲ 13 r/allenai

When we released Molmo, it was a bet that open vision-language models could compete with closed systems. Since then, Molmo has grown into a family of open visual AI building blocks for pointing, web interaction, 3D perception, & robotics. 👇

🔎 MolmoPoint helps identify the exact pixel, UI element, object, or video moment that matters, grounding what it sees in a form downstream apps can use. As Molmo research lead Chris Clark puts it, “Having models that can point is important for many things, including interpretability.”

🌐 MolmoWeb brings that same visual grounding into the browser. Given an instruction and a screenshot, it predicts the next action, from clicking and typing to navigating through a web interface. Instead of relying on website code that can change underneath it, MolmoWeb works from what the model can see.

The bigger story is how visual AI is moving from description to action: models that don’t just answer questions about images or videos, but use visual understanding to point, click, track, navigate, & interact.

→ Read more in our latest post: https://allenai.org/blog/molmo-learns-to-point-and-act

u/ai2_official — 14 days ago

▲ 7 r/allenai

OlmoEarth Studio now lets you compute and export custom embedding vectors from our OlmoEarth foundation models. 🌍

Choose your area, time range, encoder, resolution, and imagery sources, and Studio returns a GeoTIFF you can use however you like.

Instead of a single predicted label for each location, embeddings give you a numerical representation useful for tasks like similarity search, few-shot segmentation, unsupervised exploration, and change detection—all without fine-tuning.

For example, you can compare two time periods to see what changed on the ground. Or you can reduce embeddings to three dimensions with PCA, map them to RGB, and display the result as false color.

Custom embedding exports are available now in OlmoEarth Studio.

🔗 Blog: https://allenai.org/blog/olmoearth-embeddings

🌍 More on OlmoEarth: https://allenai.org/olmoearth

u/ai2_official — 21 days ago

▲ 17 r/allenai

We're at #ICLR2026 with papers & talks across the conference. Come say hello and learn about our latest research!

u/ai2_official — 21 days ago

▲ 10 r/allenai+1 crossposts

This Earth Day marks 10 years of Ai2 helping get real-time intelligence into the hands of the people protecting the planet—across land, sea, and everything in between.

EarthRanger brings together GPS collars, camera traps, patrol reports, and sensors into one real-time view for conservation teams across 900+ protected areas in 95 countries. In Thailand, AI-enabled camera traps and community rangers can now mobilize within minutes when elephants leave cover.

Skylight uses satellite imagery and millions of daily vessel signals to help surface potential illegal fishing in near real time. Earlier this year, Argentina used it to identify and fine a vessel without boarding it. We’re also expanding this work with SkyTruth to help bring pollution data into view.

OlmoEarth is our open foundation model for Earth observation, built to help accelerate how AI is applied to protect the planet. Trained on roughly 10TB of satellite and sensor data, it powers Skylight and helps deliver actionable intelligence for partners like Global Mangrove Watch.

The environmental challenges ahead are accelerating, and our commitment is to keep building for the people on the frontlines. EarthRanger, Skylight, and OlmoEarth are all released openly and at no cost.

→ Learn more: https://allenai.org/blog/earth-day-2026

u/Jlyplaylists — 21 days ago

▲ 8 r/allenai

Now available in AutoDiscovery: Reuse already-uploaded datasets, modify session configurations, & include insights from past runs to iterate over promising findings. 👇

AutoDiscovery autonomously explores your data, generates hypotheses, & runs experiments—surfacing findings you might not think to look for.

Researchers have generated 43K+ hypotheses across oncology, neuroscience, marine ecology, social science, cybersecurity, climate, & more. 🧪

The new run configuration feature is built to help you branch from a past session and uploaded data, accelerating your exploration.

→ Try it here: https://autodiscovery.allen.ai/

u/ai2_official — 23 days ago

▲ 18 r/allenai

WildDet3D is now even more open. 🚀

We’re releasing the training code, updated inference code, and training + data prep instructions so researchers and developers can reproduce the model, study how it works, and build on it for their own needs.

WildDet3D can turn a single image into a richer 3D understanding of a scene, which makes it useful for applications in VR and AR, robotics, and countless digital tools that need to place objects in 3D space.

💻 Get the code: https://github.com/allenai/WildDet3D

📝 Learn more about WildDet3D in our blog: https://allenai.org/blog/wilddet3d

u/ai2_official — 23 days ago