The Oath

Gary Marcus

Rank 11 of 47
|
Score 83
Gary Marcus
@GaryMarcus
--
@Simeon_Cps @brhydon Does anyone have that for older models?
4/12/2024, 12:36:07 PM
X
In reply to:
Siméon
@Simeon_Cps
·
659d
@GaryMarcus @brhydon The plot of on GPQA, one of the only unsaturated benchmarks would tell en exponential story
Gary Marcus
@GaryMarcus
·
659d
@brhydon that’s a fair point (re restriction of range at the top); do you have comparable data for the other measures?
Brydon Eastman
@brhydon
·
659d
@GaryMarcus Dr Marcus, respectfully, 100 is the max.... there's not a lot of space for exp growth in (87, 100]

MMLU and other evals are not perfect, they have plenty of noise in them. A 100 on that eval would inspire less confidence than, say, a 93 as it would just showcase Goodharting.
Gary Marcus
@GaryMarcus
·
659d
What happens when you plot GPT-2, 3, 4, and Turbo side-by-side?

Below I have plotted one common measure, MMLU, where there are easy to find data going back to GPT-2. (There may be others with data going back that far; this is just a first quick attempt.)


What I see is an…
Gary Marcus
@GaryMarcus
·
659d
Could we see GPT 3 and 3.5 and GPT 4 on the same plot? And Gemini Pro 1.5 and Claude Opus?

The statement is a technical request for information on older models' performance data and does not engage in public discourse. It is a part of a conversation among individuals discussing the specifics of machine learning model evaluations.

FacebookInstagramTwitterYouTube

© 2023-2024 The Oath, All rights reserved.