u/ChippingCoder — reddlx

This private benchmark tests whether a model can recover the exact title of a real, already-published scientific paper given only its abstract. The model isn't being asked to generate a plausible-sounding title, it has to recall the specific one that actually exists, purely from memory. It's analogous to identifying a book or movie from a plot summary. This makes it an effective proxy for a model's ability to accurately attribute scientific claims to their correct source.

I find the jump between GPT 5.4 and GPT 5.5 interesting, does anyone have any insight on that? (even 5.4 mini is outperforming 5.4)

Note: Results are AVG @ 5