That’s an interesting argument. However, something similar to your hypothetical explanation in footnote 6 suggests the following hypothesis: Most humans aren’t optimized by evolution to be good at abstract physics reasoning, while they easily could have been, with evolutionary small changes in hyperparameters. After all Einstein wasn’t too dissimilar in training/inference compute and architecture from the rest of us. This explanation seems somewhat plausible, since highly abstract reasoning ability perhaps wasn’t very useful for most of human history.
(An argument in a similar direction is the existence of Savant syndrome, which implies that quite small differences in brain hyperparameters can lead to strongly increased narrow capabilities of some form, which likely weren’t useful in the ancestral environment, which explains why humans generally don’t have them. The Einstein case suggests a similar phenomenon may also exists for more general abstract reasoning.)
If this is right, humans would be analogous to very strong base LLMs with poor instruction tuning, where the instruction tuning (for example) only involved narrow instruction-execution pairs that are more or less directly related to finding food in the wilderness, survival and reproduction. Which would lead to bad performance at many tasks not closely related to fitness, e.g. on Math benchmarks. The point is that a lot of the “raw intelligence” of the base LLM couldn’t be accessed just because the model wasn’t tuned to be good at diverse abstract tasks, even though it easily could have been, without a big change in architecture or training/inference compute.
But then it seems unlikely that artificial ML models (like LLMs) are or will be unoptimized for highly abstract reasoning in the same way evolution apparently didn’t “care” to make us all great at abstract physics and math style thinking. Since AI models are indeed actively optimized in diverse abstract directions. Which would make it unlikely to get a large capability jump (analogous to Einstein or von Neumann) just from tweaking the hyperparameters a bit, since those are probably pretty optimized already.
If this explanation is assumed to be true, it would mean we shouldn’t expect sudden large (Einstein-like) capability gains once AI models reach Einstein-like ability.
The (your) alternative explanation is that there is indeed at some point a phase transition at a certain intelligence level, which leads to big gains just from small tweaks in hyperparameters. Perhaps because of something like the “grokking cascade” you mentioned. That would mean Einstein wasn’t so good at physics because he happened to be, unlike most humans, “optimized for abstract reasoning”, but because he reached an intelligence level where some grokking cascade, or something like that, occurs naturally. Then indeed a similar thing could easily happen for AI at some point.
I’m not sure which explanation is better.
I guess for a cat classifier disentanglement is not possible, because it wants to classify things as cats if and only if it believes they are cats. Since values and beliefs are perfectly correlated here, there is no test we could perform which would distinguish what it wants from what it believes.
Though we could assume we don’t know what the classifier wants. If it doesn’t classify a cat image as “yes”, it could be because it is (say) actually a dog classifier, and it correctly believes the image contains something other than a dog. Or it could be because it is indeed a cat classifier, but it mistakenly believes the image doesn’t show a cat.
One way to find out would be to give the classifier an image of the same subject, but in higher resolution or from another angle, and check whether it changes its classification to “yes”. If it is a car classifier, it is likely it won’t make the mistake again, so it probably change its classification to “yes”. If it is a dog classifier, it will likely stay with “no”.
This assumes that mistakes are random and somewhat unlikely, so will probably disappear when the evidence is better or of a different sort. Beliefs react to changes in evidence of that sort, while values don’t.