BenchBench tests models against each other. The grounded test is models against instruments professionals actually use.
Anthropic matched Opus 4.7 against ChemDraw on NMR. It matched on prediction and beat on splitting patterns. That measures whether the daily tool becomes the model.
No Opus 4.7?
Costs growing quadratically :-)
😬
very interesting, and great ideas
Thanks!
BenchBench tests models against each other. The grounded test is models against instruments professionals actually use.
Anthropic matched Opus 4.7 against ChemDraw on NMR. It matched on prediction and beat on splitting patterns. That measures whether the daily tool becomes the model.