A key crux is that I think those heuristics actually go quite far, because it’s much, much easier to learn a quite close to correct model of human values with simple heuristics and internalize the values from it’s training data as it’s own than it is to learn useful capabilities, and more generally it’s easier to learn and internalize human values as it’s own than it is to learn useful new capabilities, so even under a heuristic view of LLMs where LLMs are basically always learning a bag of heuristics and don’t have actual algorithms, the heuristics for internalizing human values is always simpler than heuristics for learning capabilities, because it’s easier to generate training data for human values than it is to generate any other capability.
See below for relevant points:
In general, it makes sense that, in some sense, specifying our values and a model to judge latent states is simpler than the ability to optimize the world. Values are relatively computationally simple and are learnt as part of a general unsupervised world model where there is ample data to learn them from (humans love to discuss values!). Values thus fall out mostly’for free’ from general unsupervised learning. As evidenced by the general struggles of AI agents, ability to actually optimize coherently in complex stochastic ‘real-world’ environments over long time horizons is fundamentally more difficult than simply building a detailed linguistic understanding of the world.
Good point though that the claim that current LLMs are definitely learning algorithms rather than just heuristics was definitely not supported very well by the current interpretability results/evidence, though I’d argue that o1-preview is mild evidence we will start seeing more algorithmic/search parts will be used for AIs in the future (though to be clear I believe a majority of the success comes from it’s data being higher quality, and only fairly little to it’s runtime search.)
A key crux is that I think those heuristics actually go quite far, because it’s much, much easier to learn a quite close to correct model of human values with simple heuristics and internalize the values from it’s training data as it’s own than it is to learn useful capabilities, and more generally it’s easier to learn and internalize human values as it’s own than it is to learn useful new capabilities, so even under a heuristic view of LLMs where LLMs are basically always learning a bag of heuristics and don’t have actual algorithms, the heuristics for internalizing human values is always simpler than heuristics for learning capabilities, because it’s easier to generate training data for human values than it is to generate any other capability.
See below for relevant points:
https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/
Good point though that the claim that current LLMs are definitely learning algorithms rather than just heuristics was definitely not supported very well by the current interpretability results/evidence, though I’d argue that o1-preview is mild evidence we will start seeing more algorithmic/search parts will be used for AIs in the future (though to be clear I believe a majority of the success comes from it’s data being higher quality, and only fairly little to it’s runtime search.)