u/fargento

Shouldn't alignment evals be on the model's main launch scorecard?

Shouldn't alignment evals be on the model's main launch scorecard?

  • Every frontier model releases lead with the same or very similar benchmarks. None of them tell you whether the model is likely to lie to you or on your behalf. None of them tell you if the model will try to cheat, sandbag on your request or act shady/machiavellian in general.
  • Alignment evaluations seem to exist. But they’re not treated as first level information. They're hard to compare between models & labs. There is no canonical alignment number for Opus 4.7, GPT-5.5, or Gemini 3.1 Pro that I could find.
  • Everyone should care about this number, not only the AI-risk crowd. It’s a short-term/current user problem too. “Will this model lie about whether the test passed? Will it pretend a function exists because admitting it doesn’t is inconvenient? Will this agent act shady on my behalf? How likely is it to commit a crime?”
  • Putting an easy to digest alignment number as a featured item on the model announcement threads/blogposts creates three important side-effects: developers notice they should worry about it, academics race to build better versions of this benchmark and labs start competing on the metric.
  • Even a bad first benchmark is useful. Publishing an imperfect one is how you create the incentive for someone to build a better one.

I also wrote a ~longer post elucidating the points a bit more:
https://fargento.substack.com/p/alignment-benchmarks-belong-on-the

u/fargento — 7 days ago