5 Comments

Great article!

I do take LSAT scores, Olympiad solving, etc with a bag of salt due to data contamination. Can this be overcome by using an eval that was created after the cutoff date? For example, using Olympiad questions from 2024, I instead of 2020

Expand full comment

Great article! I think there is a lot of misunderstanding around the capacity of our current evals to actually "test" LLMs reasoning abilities. This is a great resource to point people to, for a high-level overview of the meta-problems.

Also, after reading this line "as someone who is definitely in the very highest percentile of LLM usage, I still can’t easily say which LLMs to use for particular use cases", it would be interesting for me to see what LLMs/stacks you use for various tasks

Expand full comment