Gary Marcus

Rank 12 of 47

|

Score 83

+1

@littmath @Wasgo @polynoamial The metr task time thing was a train wreck, as discussed in my substack.

5/10/2025, 1:37:51 AM

In reply to:

Daniel Litt

@littmath

·

262d

@Wasgo @GaryMarcus @polynoamial Same, but nonetheless I think it’s consistent with a logarithmic scaling law, broadly construed.

Cameron Williams

@Wasgo

·

262d

@littmath @GaryMarcus @polynoamial You’d know more about the math benchmarks than I would, but on the generalized novel tasks for METR I found the results highly misleading as a graph.

Daniel Litt

@littmath

·

262d

@Wasgo @GaryMarcus @polynoamial IMO this is an overstatement, at least where math benchmarks are concerned, not to mention e.g. METR.

Cameron Williams

@Wasgo

·

262d

@littmath @GaryMarcus @polynoamial Right. But the problem isn’t the exact benchmark, it’s a question of whether or not we’re evaluating the model on novel problems, or ones within the dataset.

Never models are still improving on problems within the dataset (no wall) but not consistently on novel ones (wall).

Daniel Litt

@littmath

·

262d

@Wasgo @GaryMarcus @polynoamial I don’t think it needs to be formal to be a claim we can try to evaluate heuristically by benchmarking. Lots of benchmarks seem to scale logarithmically with compute in some (pre-saturation) regime.

Cameron Williams

@Wasgo

·

262d

@littmath @GaryMarcus @polynoamial Is there any formal definition for this version of the law though? Altman refers to intelligence which isn’t actually a defined standard. He seems to cherry pick what intelligence means every time he refers to it.

The statement is part of a discussion on the evaluation of mathematical benchmarks and models, specifically regarding the METR task. It critiques the presentation of results as misleading, indicating engagement with public discourse on the topic of AI model evaluation.

Principle 1:
I will strive to do no harm with my words and actions.
The statement refers to a 'train wreck' in the context of a task evaluation, which could be seen as harsh but is not directly harmful. It critiques the process rather than individuals, aligning with the principle of doing no harm.
Principle 4:
I will engage in constructive criticism and dialogue with those in disagreement and will not engage in personal attacks or ad hominem arguments.
The statement engages in criticism of the METR task evaluation, which is constructive as it points out perceived issues. It does not engage in personal attacks, adhering to the principle of constructive criticism. [+1]