OK. We might have some technical disagreement remaining about the promisingness of data-control strategies, but overall it seems like we are on basically the same page.
Zooming in on our potential disagreement though: ”...as it turns out, a lot of what the AI values is predicted very well if you know it’s data” Can you say more about what you mean by this and what your justification is? IMO there are lots of things about AI values that we currently failed to predict in advance (though usually it’s possible to tell a plausible story with the benefit of hindsight). idk. Curious to hear more.
What I mean by this is that if you want to predict what an AI will do, for example will it do better in having a new capability than other models, or what it’s values are like, especially if you want to predict OOD behavior accurately, you would be far better off if you knew what it’s data sources are, as well as the quality of it’s data, than if you only knew it’s prior/architecture.
Re my justification for it, my basic justification comes from this tweet thread, which points out that a lot of o1′s success could come from high-quality data, and while I don’t like the argument that search/fancy bits aren’t happening at all (I do think o1 is doing a very small run-time search), I agree with the conclusion that the data quality was probably most of the reason o1 is so good in coding.
Somewhat more generally, I’m pretty influenced by this post, and while I don’t go as far as claiming that all of what an AI is the dataset, I do think a weaker version of the claim is pretty likely to be true.
But one prediction we could have made in advance, if we knew that data was a major factor in how AIs learn values, is that value misspecification was likely to be far less severe of a problem than 2000-2010s thinking on LW had, and value learning had a tractable direction of progress, as training on human books and language would load into it mostly-correct values, and another prediction we could have made is that human values wouldn’t all be that complicated, and could instead be represented by say several hundred megabyte/1 gigabyte codes quite well, and we could plausibly simplify that further.
To be clear, I don’t think you could have had too high of a probability writ it’s prediction before LLMs, but you’d at least have the hypothesis in serious consideration.
Cf here:
The answer isn’t embedded in 100 lines of Python, but in a subset of the weights of GPT-4 Notably the human value function (as expressed by GPT-4) is necessarily significantly simpler than the weights of GPT-4, as GPT-4 knows so much more than just human values. What we have now isn’t a perfect specification of human values, but instead roughly the level of understanding of human values that a 85th percentile human can come up with. The human value function as expressed by GPT-4 is also immune to almost all in-practice, non-adversarial, perturbations
From here on Matthew Barnett’s post about the historical value misspecification argument, and note I’m not claiming that alignment is solved right now:
and here, where it talks about the point that to the extent there’s a gap between loading in correct values versus loading in capabilities, it’s that loading in values data is easier than loading in capabilities data, which kind of contradicts this post from @Rob Bensinger here on motivations being harder to learn, and one could have predicted this because there was a lot of data on human values, and a whole lot of the complexity of the values is in the data, not the generative model, thus it’s very easy to learn values, but predictably harder to learn a lot of the most useful capabilities.
Again, we couldn’t have too high a probability for this specific outcome happening, but you’d at least seriously consider the hypothesis.
From Rob Bensinger:
“Hidden Complexity of Wishes” isn’t arguing that a superintelligence would lack common sense, or that it would be completely unable to understand natural language. It’s arguing that loading the right *motivations* into the AI is a lot harder than loading the right understanding.
From @beren on alignment generalizing further than capabilities, in spiritual response to Bensinger:
In general, it makes sense that, in some sense, specifying our values and a model to judge latent states is simpler than the ability to optimize the world. Values are relatively computationally simple and are learnt as part of a general unsupervised world model where there is ample data to learn them from (humans love to discuss values!). Values thus fall out mostly’for free’ from general unsupervised learning. As evidenced by the general struggles of AI agents, ability to actually optimize coherently in complex stochastic ‘real-world’ environments over long time horizons is fundamentally more difficult than simply building a detailed linguistic understanding of the world.
But that’s how we could well have made predictions about AIs, or at least elevated these hypotheses to reasonable probability mass, in an alternate universe where LW didn’t anchor too hard on their previous models of AI like AIXI and Solomonoff induction.
Note that in order for my argument to go through, we also need the brain to be similar enough to DL systems that we can validly transfer insights from DL to the brain, and while I don’t think you could place too high of a probability 10-20 years ago on that, I do think that at the very least this should have been considered as a serious possibility, which LW mostly didn’t do.
However, we now have that evidence, and I’ll post links below:
OK. We might have some technical disagreement remaining about the promisingness of data-control strategies, but overall it seems like we are on basically the same page.
Zooming in on our potential disagreement though: ”...as it turns out, a lot of what the AI values is predicted very well if you know it’s data” Can you say more about what you mean by this and what your justification is? IMO there are lots of things about AI values that we currently failed to predict in advance (though usually it’s possible to tell a plausible story with the benefit of hindsight). idk. Curious to hear more.
What I mean by this is that if you want to predict what an AI will do, for example will it do better in having a new capability than other models, or what it’s values are like, especially if you want to predict OOD behavior accurately, you would be far better off if you knew what it’s data sources are, as well as the quality of it’s data, than if you only knew it’s prior/architecture.
Re my justification for it, my basic justification comes from this tweet thread, which points out that a lot of o1′s success could come from high-quality data, and while I don’t like the argument that search/fancy bits aren’t happening at all (I do think o1 is doing a very small run-time search), I agree with the conclusion that the data quality was probably most of the reason o1 is so good in coding.
https://x.com/aidanogara_/status/1838779311999918448
Somewhat more generally, I’m pretty influenced by this post, and while I don’t go as far as claiming that all of what an AI is the dataset, I do think a weaker version of the claim is pretty likely to be true.
https://nonint.com/2023/06/10/the-it-in-ai-models-is-the-dataset/
But one prediction we could have made in advance, if we knew that data was a major factor in how AIs learn values, is that value misspecification was likely to be far less severe of a problem than 2000-2010s thinking on LW had, and value learning had a tractable direction of progress, as training on human books and language would load into it mostly-correct values, and another prediction we could have made is that human values wouldn’t all be that complicated, and could instead be represented by say several hundred megabyte/1 gigabyte codes quite well, and we could plausibly simplify that further.
To be clear, I don’t think you could have had too high of a probability writ it’s prediction before LLMs, but you’d at least have the hypothesis in serious consideration.
Cf here:
From here on Matthew Barnett’s post about the historical value misspecification argument, and note I’m not claiming that alignment is solved right now:
https://www.lesswrong.com/posts/i5kijcjFJD6bn7dwq/evaluating-the-historical-value-misspecification-argument#N9ManBfJ7ahhnqmu7
and here, where it talks about the point that to the extent there’s a gap between loading in correct values versus loading in capabilities, it’s that loading in values data is easier than loading in capabilities data, which kind of contradicts this post from @Rob Bensinger here on motivations being harder to learn, and one could have predicted this because there was a lot of data on human values, and a whole lot of the complexity of the values is in the data, not the generative model, thus it’s very easy to learn values, but predictably harder to learn a lot of the most useful capabilities.
Again, we couldn’t have too high a probability for this specific outcome happening, but you’d at least seriously consider the hypothesis.
From Rob Bensinger:
https://x.com/robbensinger/status/1648120202708795392
From @beren on alignment generalizing further than capabilities, in spiritual response to Bensinger:
https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/
But that’s how we could well have made predictions about AIs, or at least elevated these hypotheses to reasonable probability mass, in an alternate universe where LW didn’t anchor too hard on their previous models of AI like AIXI and Solomonoff induction.
Note that in order for my argument to go through, we also need the brain to be similar enough to DL systems that we can validly transfer insights from DL to the brain, and while I don’t think you could place too high of a probability 10-20 years ago on that, I do think that at the very least this should have been considered as a serious possibility, which LW mostly didn’t do.
However, we now have that evidence, and I’ll post links below:
https://www.lesswrong.com/posts/rjghymycfrMY2aRk5/llm-cognition-is-probably-not-human-like#KBpfGY3uX8rDJgoSj
https://x.com/BogdanIonutCir2/status/1837653632138772760
https://x.com/SharmakeFarah14/status/1837528997556568523