Getting a shape into the AI’s preferences is different from getting it into the AI’s predictive model. MIRI is always in every instance talking about the first thing and not the second.
You obviously need to get a thing into the AI at all, in order to get it into the preferences, but getting it into the AI’s predictive model is not sufficient. It helps, but only in the same sense that having low-friction smooth ball-bearings would help in building a perpetual motion machine; the low-friction ball-bearings are not the main problem, they are a kind of thing it is much easier to make progress on compared to the main problem.
I read this as saying “GPT-4 has successfully learned to predict human preferences, but it has not learned to actually fulfill human preferences, and that’s a far harder goal”. But in the case of GPT-4, it seems to me like this distinction is not very clear-cut—it’s useful to us because, in its architecture, there’s a sense in which “predicting” and “fulfilling” are basically the same thing.
It also seems to me that this distinction is not very clear-cut in humans, either—that a significant part of e.g. how humans internalize moral values while growing up has to do with building up predictive models of how other people would react to you doing something and then having your decision-making be guided by those predictive models. So given that systems like GPT-4 seem to have a relatively easy time doing something similar, that feels like an update toward alignment being easier than expected.
Of course, there’s a high chance that a superintelligent AI will generalize from that training data differently than most humans would. But that seems to me more like a risk of superintelligence than a risk from AI as such; a superintelligent human would likely also arrive at different moral conclusions than non-superintelligent humans would.
I read this as saying “GPT-4 has successfully learned to predict human preferences, but it has not learned to actually fulfill human preferences, and that’s a far harder goal”. But in the case of GPT-4, it seems to me like this distinction is not very clear-cut—it’s useful to us because, in its architecture, there’s a sense in which “predicting” and “fulfilling” are basically the same thing.
It also seems to me that this distinction is not very clear-cut in humans, either—that a significant part of e.g. how humans internalize moral values while growing up has to do with building up predictive models of how other people would react to you doing something and then having your decision-making be guided by those predictive models. So given that systems like GPT-4 seem to have a relatively easy time doing something similar, that feels like an update toward alignment being easier than expected.
Of course, there’s a high chance that a superintelligent AI will generalize from that training data differently than most humans would. But that seems to me more like a risk of superintelligence than a risk from AI as such; a superintelligent human would likely also arrive at different moral conclusions than non-superintelligent humans would.