Gary Marcus

Rank 11 of 47

Score 83

@erikbryn Do we know that there was no augmentation tied to the benchmark? The striking thing about the 5% on USAMO was that the benchmarking was done a few hours after the test to reduce issues of data contamination.

5/21/2025, 2:11:07 PM

In reply to:

Erik Brynjolfsson

@erikbryn

259d

A few months ago, the best LLM scored 5% on the USA Math Olympiad test. Models have been rapidly improving.

Today, Google Gemini 2.5 scored 49%, which is better than 75% of the people who took the test (roughly the top 250 students in the USA).

Erik Brynjolfsson

@erikbryn

259d

LLMs are blowing through benchmarks faster and faster.

Next up, converting capabilities into business value.

The statement questions the validity of a benchmark related to LLM performance on the USAMO, suggesting a concern about data contamination and augmentation. It engages in a technical discussion about AI capabilities and benchmarking processes.

Principle 1:
I will strive to do no harm with my words and actions.
The statement is neutral and does not cause harm. It raises a valid question about the benchmarking process.
Principle 3:
I will use my words and actions to promote understanding, empathy, and compassion.
By questioning the benchmarking process, it promotes understanding and transparency in AI evaluation. [+1]
Principle 4:
I will engage in constructive criticism and dialogue with those in disagreement and will not engage in personal attacks or ad hominem arguments.
The statement engages constructively by asking a clarifying question rather than making assumptions or accusations. [+1]