Alternatively, reality is looking to me like the hard alignment problem is just based on fundamentally mistaken models about the world. It’s not about playing our out, it’s that it doesn’t seem like we live in a hard alignment world.
The major mistakes I think happened is that we had biases towards overweighting and clicking on negative news, and it looks like we don’t actually have to solve the problems of embedded agency, probably the most dominant framework on LW, due to Pretraining from Human Feedback. It was the first alignment technique that actually scales with more data. In other words, we dissolved, rather than resolved the problem of embedded agency: We managed to create Cartesian boundaries that actually work in an embedded world.
I’ve seen you comment several times about the link between Pretraining from Human Feedback and embedded agency, but despite being quite familiar with the embedded agency sequence I’m not getting your point.
I think my main confusion is that to me “the problem of embedded agency” means “the fact that our models of agency are non-embedded, but real world agents are embedded, and so our models don’t really correspond to reality”, whereas you seem to use “the problem of embedded agency” to mean a specific reason why we might expect misalignment.
Could you say (i) what the problem of embedded agency means to you, and in particular what it has to do with AI risk, and (ii) in what sense PTHF avoids it?
To respond to i: The problem of embedded agency that is relevant for alignment is that you can’t put boundaries that an AI can’t breach or manipulate,
like say school and not school, since those boundaries are themselves manipulatable, that is all defenses are manipulatable, and the AI can affect it’s own distribution such that it can manipulate a human’s values or amplify Goodhart errors in the data set like RLHF. That is, there are no real, naturally occurring Cartesian boundaries that aren’t breakable or manipulatable, except maybe the universe itself.
To respond to ii: Pretraining from Human Feedback avoids embedded agency concerns by using an offline training schedule, where we give it a data set on human values that it learns, and hopefully generalizes. The key things to note here:
We do alignment first and early, to prevent it from learning undesirable behavior, and get it to learn human values from text. In particular, we want to make sure it has learned aligned intentions early.
The real magic, and how it solves alignment, is that we select the data and give it in batches, and critically, offline training does not allow an AI to hack or manipulate the distribution because it cannot select which parts of human values as embodied in text, it must learn all of the human values in the data. No control or degrees of freedom are given to the AI, unlike in online training, meaning we can create a Cartesian boundary between an AI’s values and a specific human’s values, which is very important for AI Alignment, as the AI can’t amplify Goodhart in human preferences or affect the distribution of human preferences.
That’s how it translates the Cartesian, as well as it’s boundaries ontology into an embedded world properly.
There are many more benefits to Pretraining from Human Feedback, but I hope this response answers your question.
The major mistakes I think happened is that we had biases towards overweighting and clicking on negative news, and it looks like we don’t actually have to solve the problems of embedded agency, probably the most dominant framework on LW, due to Pretraining from Human Feedback. It was the first alignment technique that actually scales with more data. In other words, we dissolved, rather than resolved the problem of embedded agency: We managed to create Cartesian boundaries that actually work in an embedded world.
Link to Pretraining from Human Feedback:
https://www.lesswrong.com/posts/8F4dXYriqbsom46x5/pretraining-language-models-with-human-preferences
Link to an Atlantic article about negativity bias in the news, archived so that no paywall exists:
https://archive.is/7EhiX
I’ve seen you comment several times about the link between Pretraining from Human Feedback and embedded agency, but despite being quite familiar with the embedded agency sequence I’m not getting your point.
I think my main confusion is that to me “the problem of embedded agency” means “the fact that our models of agency are non-embedded, but real world agents are embedded, and so our models don’t really correspond to reality”, whereas you seem to use “the problem of embedded agency” to mean a specific reason why we might expect misalignment.
Could you say (i) what the problem of embedded agency means to you, and in particular what it has to do with AI risk, and (ii) in what sense PTHF avoids it?
To respond to i: The problem of embedded agency that is relevant for alignment is that you can’t put boundaries that an AI can’t breach or manipulate, like say school and not school, since those boundaries are themselves manipulatable, that is all defenses are manipulatable, and the AI can affect it’s own distribution such that it can manipulate a human’s values or amplify Goodhart errors in the data set like RLHF. That is, there are no real, naturally occurring Cartesian boundaries that aren’t breakable or manipulatable, except maybe the universe itself.
To respond to ii: Pretraining from Human Feedback avoids embedded agency concerns by using an offline training schedule, where we give it a data set on human values that it learns, and hopefully generalizes. The key things to note here:
We do alignment first and early, to prevent it from learning undesirable behavior, and get it to learn human values from text. In particular, we want to make sure it has learned aligned intentions early.
The real magic, and how it solves alignment, is that we select the data and give it in batches, and critically, offline training does not allow an AI to hack or manipulate the distribution because it cannot select which parts of human values as embodied in text, it must learn all of the human values in the data. No control or degrees of freedom are given to the AI, unlike in online training, meaning we can create a Cartesian boundary between an AI’s values and a specific human’s values, which is very important for AI Alignment, as the AI can’t amplify Goodhart in human preferences or affect the distribution of human preferences.
That’s how it translates the Cartesian, as well as it’s boundaries ontology into an embedded world properly.
There are many more benefits to Pretraining from Human Feedback, but I hope this response answers your question.