May 25

TL;DR: presenting the ultimate benchmark, getting models to create benchmarks for each other, and GPT 5.2 is the current (only) winner

Read →

6 Comments

No Opus 4.7?

Costs growing quadratically :-)

😬

May 27Edited

very interesting, and great ideas

Thanks!

BenchBench tests models against each other. The grounded test is models against instruments professionals actually use.

Anthropic matched Opus 4.7 against ChemDraw on NMR. It matched on prediction and beat on splitting patterns. That measures whether the daily tool becomes the model.

Reply

Share

Strange Loop Canon

Introducing BenchBench