Gary Marcus

Rank 11 of 47

|

Score 83

--

@Simeon_Cps @brhydon Does anyone have that for older models?

4/12/2024, 12:36:07 PM

In reply to:

Siméon

@Simeon_Cps

·

659d

@GaryMarcus @brhydon The plot of on GPQA, one of the only unsaturated benchmarks would tell en exponential story

Gary Marcus

@GaryMarcus

·

659d

@brhydon that’s a fair point (re restriction of range at the top); do you have comparable data for the other measures?

Brydon Eastman

@brhydon

·

659d

@GaryMarcus Dr Marcus, respectfully, 100 is the max.... there's not a lot of space for exp growth in (87, 100]

MMLU and other evals are not perfect, they have plenty of noise in them. A 100 on that eval would inspire less confidence than, say, a 93 as it would just showcase Goodharting.

Gary Marcus

@GaryMarcus

·

659d

What happens when you plot GPT-2, 3, 4, and Turbo side-by-side?

Below I have plotted one common measure, MMLU, where there are easy to find data going back to GPT-2. (There may be others with data going back that far; this is just a first quick attempt.)

What I see is an…

Gary Marcus

@GaryMarcus

·

659d

Could we see GPT 3 and 3.5 and GPT 4 on the same plot? And Gemini Pro 1.5 and Claude Opus?

The statement is a technical request for information on older models' performance data and does not engage in public discourse. It is a part of a conversation among individuals discussing the specifics of machine learning model evaluations.