best-of-n sampling which solved ARC-AGI
The low resource configuration of o3 that only aggregates 6 traces already improved on results of previous contenders a lot, the plot of dependence on problem size shows this very clearly. Is there a reason to suspect that aggregation is best-of-n rather than consensus (picking the most popular answer)? Their outcome reward model might have systematic errors worse than those of the generative model, since ground truth is in verifiers anyway.
There is water, H2O, drinking water, liquid, flood. Meanings can abstract away some details of a concrete thing from the real world, or add connotations that specialize it into a particular role. This is very useful in clear communication. The problem is sloppy or sneaky equivocation between different meanings, not the content of meanings getting to involve emotions, connotations, things not found in the real world, or combining them with concrete real world things into compound meanings.