u/GreatestOfAllTime_69

Maybe SWE-bench Verified was never just a model benchmark

Recently, while browsing, I came across a few more discussions about SWE-bench Verified（from OpenAI）, and they made me think about what this metric really means.

When SWE-bench Verified and the evaluations behind it first appeared, I think the original purpose was very reasonable: we needed a shared standard to judge a model’s coding ability. It gave people a more concrete way to compare whether a model could actually solve software engineering tasks, instead of just writing code that looked good.

But later, I started to feel that metrics like SWE-bench Verified were slowly being “distorted” in meaning.

They began to become not just evaluation standards, but also a marketing point for model products. A high score could make a model look very strong in launch posts, product pages, and benchmark tables.

This is why OpenAI’s recent blog post was interesting to me. In that post, OpenAI said that SWE-bench Verified is no longer suitable for measuring frontier coding capability. Some tests may reject correct solutions, benchmark contamination has become harder to avoid, and models may have already seen the original problems or gold patches. Because of this, OpenAI stopped reporting SWE-bench Verified.

To me, this shows something important: these metrics can become evaluation standards only because they create consensus for a period of time. But for engineers, some questions do not change just because a benchmark becomes outdated.

Can it run the tests? Can it inspect the failure and revise the patch? These questions do not expire as quickly as a benchmark does. And I think these are the questions that have to be tested with real cases in your own hands.

This is similar to how I am currently testing Ring’s new 2.6 product. Public metrics can be useful as an initial filter or reference. They can help me decide what is worth paying attention to. But whether a model or product should actually enter my workflow still depends on my own cases.

Benchmarks will change. Leaderboards will expire. Marketing numbers will lose meaning. But the real question stays the same:

Can this thing actually help me ship correct code?

reddit.com

u/GreatestOfAllTime_69 — 19 hours ago

▲ 3 r/redditrequest

r/crystalsingingbowl

reddit.com

u/GreatestOfAllTime_69 — 2 days ago

▲ 1 r/WhatIsMyCQS

What is my CQS

reddit.com

u/GreatestOfAllTime_69 — 2 days ago

▲ 8 r/IndianHipHopHeads

King delhi concert

2022 mei gya tha Simba uproar.. aaj video dekhi toh upload krdi 🤧 end acha h dekh ke jana

song : tu maan meri jaan

u/GreatestOfAllTime_69 — 2 days ago

▲ 2 r/WhatIsMyCQS

Which is the best intro part song in dhh for you?

Me : 3 DRAGS

u/GreatestOfAllTime_69 — 3 days ago

▲ 2 r/SSCCGL

Which one is correct?

u/GreatestOfAllTime_69 — 6 days ago

▲ 11 r/postdoc

i had a depressing realization today while hunting through five different apps for a single primer sequence: my entire research system only works because i’m manually stitching it together in my head.

on paper, the stack looks okay. i have a tool for sequences, a place for lab records, and endless spreadsheets for design history. but the ""continuity"" isn't in the tools—it's just in my brain. i'm the one who has to remember that this note from 3 months ago actually explains the decision in that protocol edit.

at a certain point, it doesn't feel like i'm using tools anymore—it feels like i'm babysitting a dying system.

especially in molecular biology, the fragmentation is just insane. i spend more time doing ""digital archaeology"" on my own files than actually thinking about the science.

curious for the workflow nerds here:

how do u guys deal with this? do u just accept that your brain is the only integration layer, or have u found a setup that actually keeps the context alive across different platforms?

i'm lowkey losing my mind trying to keep everything in sync lol.

reddit.com

u/GreatestOfAllTime_69 — 8 days ago

▲ 10 r/singularity

stop thinking about chatbots. the real endgame is predictive social simulation.

i’ve been grinding on oransim (github.com). it’s a framework that cages llm agents inside a formal structural causal model (scm) and hawkes processes.

what this actually means:i can now ""query"" a human population’s reaction before an intervention happens. want to know how a specific narrative shift will cascade through a platform in 72 hours? simulate it first.

why i’m scared:i’m trying to map prompt-space to \(do\)-calculus on human states. the sim-to-real gap is closing. we are basically building a ""psychohistory"" engine for the agi era. i made this apache-2.0 because i’d rather this tech be transparent and on your laptop than hidden in a black-box at a mega-corp.

here is the question for the sub:if we can model the ""viral pulse"" of a crowd with a script, does free will even exist anymore, or are we just stochastic parrots with skin?

repo: https://github.com/OranAi-Ltd/oransim

u/GreatestOfAllTime_69 — 9 days ago

▲ 1 r/FindThatNovel

Novel Title: Wealth Without Him: Rising from Betrayal

Total Chapters: 196 chapters

Genre: Romance, Revenge, Billionaire, Female Empowerment

Free to read: Yes

Rating: 5 stars

Synopsis:

Grace discovers her ""husband"" Richard not only used a fake marriage certificate to deceive her for SIX YEARS, but has been hiding his real wife Laura and their biological daughter Emma the entire time. To make it worse, she was told she was ""infertile"" – another lie Richard fabricated. Just when she hits rock bottom, the truth comes crashing in: she's not an orphan at all. She's the biological daughter of Starport's richest man, Robert Wilson, inheriting a FORTY BILLION dollar empire. When she returns as the lost heiress, her ex-husband gets down on his knees begging for another chance? Girl, she's already married to Alex Morgan – a man a thousand times hotter and richer than him.

Review:

This novel is absolutely INSANE. From the moment the MC finds out she's been played to her transformation into a billionaire heiress, every single chapter had me screaming. The revenge arc is SO satisfying – watching the ex slowly lose everything while the side chick gets locked in a storage room? I replayed that scene three times. And the ML? Literal perfection. Rich, gorgeous, protective – when he finds out someone messed with his woman, he doesn't even hesitate. 196 chapters fly by because the pacing is tight and there's zero filler. The ending is pure gold with a baby bonus. If you're into revenge stories like Anatomy of a Scandal or The Hating Game but want something spicier and more dramatic, RUN don't walk to read this.

u/GreatestOfAllTime_69 — 9 days ago

▲ 36 r/learnmachinelearning

Hey everyone, I have been trying to understand Agentic AI properly not just at a theory level. I already know some basics of AI/ML, but now I want to learn things like LLMs, RAG, tool calling, AI agents, workflows, memory, and how these systems are actually built in real projects.

I came across a few options like DeepLearning.AI , Udacity Agentic AI related programs, Great Learning course and LogicMojo Agentic AI Course etc.Has anyone tried any of these? Which one is actually useful if the goal is to build real Agentic AI projects and not just watch videos? Any honest suggestions would help.

reddit.com

u/GreatestOfAllTime_69 — 13 days ago

▲ 2 r/LLMDevs

A lot of release discussion still stops at weights, benchmarks, and a few headline numbers.

What interests me more is what becomes testable once a model is public enough for builders to inspect seriously. Ling-2.6-1T is a good example of that kind of object for me. The interesting claim is not just scale. It is the profile: structured execution, tool-use fit, long-task handling, and lower token overhead than the usual “thinking theater” direction.

The HF page is here if anyone wants to look at the artifact directly: https://huggingface.co/inclusionAI/Ling-2.6-1T

If you had to evaluate a model like that for real agent loops, what would you measure first?

My instinct is that the useful metrics are things like retry drift, tool-call precision, schema compliance after context growth, token burn per resolved subtask, and intervention frequency once the run gets long.

But I’m more interested in what people here would add, remove, or redefine.

u/GreatestOfAllTime_69 — 14 days ago

▲ 952 r/DunderMifflin

u/GreatestOfAllTime_69 — 15 days ago

Maybe SWE-bench Verified was never just a model benchmark

r/crystalsingingbowl

What is my CQS

King delhi concert

What is my CQS

Absolute cinema ft. Breaking bad

What is my CQS

Which is the best intro part song in dhh for you?

Which one is correct?