Well, it would certainly be nice if that were true, but all the interpretability research thus far has pointed out the opposite of what you seem to be taking it to. The only cases where the neural nets turn out to learn a crisp, clear, extrapolable-out-many-orders-of-magnitude-correctly algorithm, verified by interpretability or formal methods to date, are not deep nets. They are tiny, tiny nets either constructed by hand or trained by grokking (which appears to not describe at all any GPT-4 model, and it’s not looking good for their successors either). The bigger deeper nets certainly get much more powerful and more intelligent, but they appear to be doing so by, well, slapping on ever more bags of heuristics at scale. Which is all well and good if you simply want raw intelligence and capability, but not good if anything morally important hinges on them reasoning correctly for the right reasons, rather than heuristics which can be broken when extrapolated far enough or manipulated by adversarial processes.
We actually have a resolution for the thread on whether LLMs naturally learn algorithmic reasoning as they scale up with COT vs just reasoning with memorized bags of heuristics, and the answer is that we have both real reasoning, which is indicative of LLMs actually using somewhat clean algorithms, but there are also a lot of heuristic reasoning involved.
So we both got some things wrong, but also got some things right.
The main thing I got wrong was in underestimating how much COT for current models still involves pretty significant memorization/bag of heuristics to get correct answers, which means I have to raise the complexity of human values, given that LLMs didn’t compress as well as I thought, and the thing I got right was that sequential computation like COT does incentive actual noisy reasoning/algorithms to appear, but I was wrong about the strength of the effect, though I was still right to be concerned about the fact that the OthelloGPT network was very wide and skinny, rather than deep and wide, which makes it harder to learn the correct algorithm.
A key crux is that I think those heuristics actually go quite far, because it’s much, much easier to learn a quite close to correct model of human values with simple heuristics and internalize the values from it’s training data as it’s own than it is to learn useful capabilities, and more generally it’s easier to learn and internalize human values as it’s own than it is to learn useful new capabilities, so even under a heuristic view of LLMs where LLMs are basically always learning a bag of heuristics and don’t have actual algorithms, the heuristics for internalizing human values is always simpler than heuristics for learning capabilities, because it’s easier to generate training data for human values than it is to generate any other capability.
See below for relevant points:
In general, it makes sense that, in some sense, specifying our values and a model to judge latent states is simpler than the ability to optimize the world. Values are relatively computationally simple and are learnt as part of a general unsupervised world model where there is ample data to learn them from (humans love to discuss values!). Values thus fall out mostly’for free’ from general unsupervised learning. As evidenced by the general struggles of AI agents, ability to actually optimize coherently in complex stochastic ‘real-world’ environments over long time horizons is fundamentally more difficult than simply building a detailed linguistic understanding of the world.
Good point though that the claim that current LLMs are definitely learning algorithms rather than just heuristics was definitely not supported very well by the current interpretability results/evidence, though I’d argue that o1-preview is mild evidence we will start seeing more algorithmic/search parts will be used for AIs in the future (though to be clear I believe a majority of the success comes from it’s data being higher quality, and only fairly little to it’s runtime search.)
Well, it would certainly be nice if that were true, but all the interpretability research thus far has pointed out the opposite of what you seem to be taking it to. The only cases where the neural nets turn out to learn a crisp, clear, extrapolable-out-many-orders-of-magnitude-correctly algorithm, verified by interpretability or formal methods to date, are not deep nets. They are tiny, tiny nets either constructed by hand or trained by grokking (which appears to not describe at all any GPT-4 model, and it’s not looking good for their successors either). The bigger deeper nets certainly get much more powerful and more intelligent, but they appear to be doing so by, well, slapping on ever more bags of heuristics at scale. Which is all well and good if you simply want raw intelligence and capability, but not good if anything morally important hinges on them reasoning correctly for the right reasons, rather than heuristics which can be broken when extrapolated far enough or manipulated by adversarial processes.
We actually have a resolution for the thread on whether LLMs naturally learn algorithmic reasoning as they scale up with COT vs just reasoning with memorized bags of heuristics, and the answer is that we have both real reasoning, which is indicative of LLMs actually using somewhat clean algorithms, but there are also a lot of heuristic reasoning involved.
So we both got some things wrong, but also got some things right.
The main thing I got wrong was in underestimating how much COT for current models still involves pretty significant memorization/bag of heuristics to get correct answers, which means I have to raise the complexity of human values, given that LLMs didn’t compress as well as I thought, and the thing I got right was that sequential computation like COT does incentive actual noisy reasoning/algorithms to appear, but I was wrong about the strength of the effect, though I was still right to be concerned about the fact that the OthelloGPT network was very wide and skinny, rather than deep and wide, which makes it harder to learn the correct algorithm.
The thread is below:
https://x.com/aksh_555/status/1843326181950828753
I wish someone is willing to do this for the o1 series of models as well.
One other relevant comment is here:
https://www.lesswrong.com/posts/gcpNuEZnxAPayaKBY/othellogpt-learned-a-bag-of-heuristics-1#HqDWs9NHmYivyeBGk
A key crux is that I think those heuristics actually go quite far, because it’s much, much easier to learn a quite close to correct model of human values with simple heuristics and internalize the values from it’s training data as it’s own than it is to learn useful capabilities, and more generally it’s easier to learn and internalize human values as it’s own than it is to learn useful new capabilities, so even under a heuristic view of LLMs where LLMs are basically always learning a bag of heuristics and don’t have actual algorithms, the heuristics for internalizing human values is always simpler than heuristics for learning capabilities, because it’s easier to generate training data for human values than it is to generate any other capability.
See below for relevant points:
https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/
Good point though that the claim that current LLMs are definitely learning algorithms rather than just heuristics was definitely not supported very well by the current interpretability results/evidence, though I’d argue that o1-preview is mild evidence we will start seeing more algorithmic/search parts will be used for AIs in the future (though to be clear I believe a majority of the success comes from it’s data being higher quality, and only fairly little to it’s runtime search.)