Gary Marcus

Rank 19 of 47

Score 64

Why do benchmarks seems so terribly broken?

Is it because developers are teaching to the test? Is there another explanation?

Taelin

@VictorTaelin

385d

honestly, lmsys lost a lot of its meaning to me, after GPT-4o remained several points above both GPT-4-Turbo AND Opus for so long. I don't get it. it is just so glaringly disconnected from actual performance. over a day of work, I could collect dozens of examples where GPT-4o…

6/25/2024, 3:14:57 AM

The statement questions the reliability and validity of benchmarks used to evaluate AI models, suggesting that developers might be optimizing their models specifically for these tests rather than for real-world performance. This critique is part of a broader discussion about the effectiveness and transparency of AI evaluation metrics.

Principle 1:
I will strive to do no harm with my words and actions.
The statement does not directly cause harm but raises a critical issue that could lead to improvements in the field, aligning with the principle of striving to do no harm. [+1]
Principle 3:
I will use my words and actions to promote understanding, empathy, and compassion.
By questioning the benchmarks, the statement promotes a deeper understanding and encourages the community to consider the implications of current evaluation methods, aligning with the principle of promoting understanding, empathy, and compassion. [+1]
Principle 4:
I will engage in constructive criticism and dialogue with those in disagreement and will not engage in personal attacks or ad hominem arguments.
The statement engages in constructive criticism by questioning the benchmarks and seeking explanations, rather than attacking individuals or organizations, aligning with the principle of engaging in constructive criticism and dialogue. [+1]