u/Full_Cost2909

Fun benchmark got more fun

Fun benchmark got more fun

Sorry for the week long delay guys, but benchmark is back!! Meanwhile the engine got some upgrades so you can try it out for yourself if you want.

Currently the engine is coupled to Python but might upgrade it to other languages depending on wishes, there were some improvements and also scrapped up a frontend (work in progress so if you find something broken please let me know) for easier visibility and future benchmarks.

Since the last session I've added a couple of new contestant per your wishes, and created a section model Model Royale which displays the results of the latest run.

Model royale is just the consumer of the engine, and also every model runs judgment on itself so you can see the bias. The regular benchmarks which resemble the real agentic workflows will be added in the benchmarks page which is still work in progress. Also I'm not happy about UI yet but just wanted to go live and polish the site later. Sorry also about the generic text on the page, felt lazy writing last week.

Not sure should the model royale mode continue with each week a completely new task or continuity from the past week. I'm open to all ideas, and also any feedback would be more than welcome.

If you want a specific task tested but feel lazy to do it yourself let me know I would be more than happy to run it.

You can see the full results of the most recent round here.

edit: put localhost instead of the real link

u/Full_Cost2909 — 1 day ago

I made a simple benchmark that tested 3 models for coding that use the same specification just for fun to actually see what can they do. The prompt was simple (intentionally). The ultimate goal is to build a functional sandbox application and see the results in the end. The models judge each other's outputs without knowing that (at least the models say that).

This was made just for fun and I have never posted on Reddit so seems like a nice first post. I would appreciate any kind of feedback and ideas for the next round prompt.

The test of course has some limitations which are discussed in the docs — you can throw the repo at your agent and have it explain. Instructions for running it are in the README too if you want to try. The three models are configurable so you can swap in any opencode model.

Current caveat: the harness has the output filename sandbox.py hardcoded across four scripts, so forking to a different single-file Python task is a sed-replace, and anything multi-file or non-Python needs deeper edits. Proper task parametrisation is on the roadmap for round 2.

Repo: https://github.com/anfocic/open-bench

u/Full_Cost2909 — 11 days ago