The problem is that the even if the model of Quintin Pope is wrong, there is other evidence that contradicts the AI doom premise that Eliezer ignores, and in this I believe it is a confirmation bias at work here.
Also, any issues with Quintin Pope’s model is going to be subtle, not obvious, and it’s a real difference to argue against good arguments + bad arguments from only bad arguments.
The problem is that the even if the model of Quintin Pope is wrong, there is other evidence that contradicts the AI doom premise that Eliezer ignores, and in this I believe it is a confirmation bias at work here.
I think that this is a statement Eliezer does not believe is true, and which the conversations in the MIRI conversations sequence failed to convince him of. Which is the point: since Eliezer has already engaged in extensive back-and-forth with critics of his broad view (including the likes of Paul Christiano, Richard Ngo, Rohin Shah, etc), there is actually not much continued expected update to be found in engaging with someone else who posts a criticism of his view. Do you think otherwise?
What I was talking about is that Eliezer (And arguably the entire MIRI-sphere) ignored evidence that AI safety could actually work and doesn’t need entirely new paradigms, and one of the best examples of empirical work is the Pretraining from Human Feedback.
The big improvements compared to other methods are:
It can avoid deceptive alignment because it gives a simple goal that’s myopic, completely negating the incentives for deceptively aligned AI.
It cannot affect the distribution it’s trained on, since it’s purely offline learning, meaning we can enforce an IID assumption, and enforce a Cartesian boundary, completely avoiding embedded agency. It cannot hack the distribution it has, unlike online learning, meaning it can’t unboundedly Goodhart the values we instill.
Increasing the data set aligns it more and more, essentially meaning we can trust the AI to be aligned as it grows more capable, and improves it’s alignment.
Now I don’t blame Eliezer for ignoring this piece specifically too much, as I think it didn’t attract much attention.
But the reason I’m mentioning this is that this is evidence against the worldview of Eliezer and a lot of pessimists who believe empirical evidence doesn’t work for the alignment field, and Eliezer and a lot of pessimists seem to systematically ignore evidence that harms their case.
Could you elaborate on what you mean by “avoid embedded agency”? I don’t understand how one avoids it. Any solution that avoids having to worry about it in your AGI will fall apart once it becomes a deployed superintelligence.
I think there’s a double meaning to the word “Alignment” where people now use it to refer to making LLMs say nice things and assume that this extrapolates to aligning the goals of agentic systems. The former is only a subproblem of the latter. When you say “Increasing the data set aligns it more and more, essentially meaning we can trust the AI to be aligned as it grows more capable, and improves it’s alignment” I question if we really have evidence that this relationship will hold indefinitely.
One of the issues with embedded agency is that you can’t reliably take advantage of the IID assumption, and in particular you can’t hold data fixed. You also have the issue of potentially having the AI hacking the process, given it’s embeddedness, since there isn’t a way before Pretraining from Human Feedback to translate Cartesian boundaries, or at least a subset of boundaries into the embedded universe.
The point here is we don’t have to solve the problem, as it’s only a problem if we let the AI control the updating process like online training.
Instead, we give the AI a data set, and offline train it so that it learns what alignment looks like before we give it general capabilities.
In particular, we can create a Cartesian boundary between IID and OOD inputs that work in an embedded setting, and the AI has no control over the data set of human values, meaning it can’t gradient or reward hack the humans into having different values, or unboundedly Goodhart human values, which would undermine the project. This is another Cartesian boundary, though this one is the boundary between an AI’s values, and a human’s values, and the AI can’t hack the human values if it’s offline trained.
I think there’s a double meaning to the word “Alignment” where people now use it to refer to making LLMs say nice things and assume that this extrapolates to aligning the goals of agentic systems.
I disagree, and I think I can explain why. The important point of the tests in the Pretraining from Human Feedback paper, and the AI saying nice things, is that they show that we can align AI to any goal we want, so if we can reliably shift it towards niceness, than we have techniques to align our agents/simulators.
The important point of the tests in the Pretraining from Human Feedback paper, and the AI saying nice things, is that they show that we can align AI to any goal we want
I don’t see how the bolded follows from the unbolded, sorry. Could you explain in more detail how you reached this conclusion?
The point is that similar techniques can be used to align them, since both (or arguably all goals) are both functionally arbitrary in what we pick, and important for us.
One major point I did elide is the amount of power seeking involved, since in the niceness goal, there’s almost no power seeking involved, unlike the existential risk concerns we have.
But in some of the tests for alignment in Pretraining from Human Feedback, they showed that they can make models avoid taking certain power seeking actions, like getting personal identifying information.
In essence, it’s at least some evidence that as AI gets more capable, that we can make sure that power seeking actions can be avoided if it’s misaligned with human interests.
The first part here makes sense, you’re saying you can train it in such a fashion that it avoids the issues of embedded agency during training (among other things) and then guarantee that the alignment will hold in deployment (when it must be an embedded agent almost by definition)
The second part I think I think I disagree with. Does the paper really “show that we can align AI to any goal we want”? That seems like an extremely strong statement.
Actually this sort of highlights what I mean by the dual use of ‘alignment’ here. You were talking about aligning a model with human values that will end up being deployed (and being an embedded agent) but then we’re using ‘align’ to refer to language model outputs.
The second part I think I think I disagree with. Does the paper really “show that we can align AI to any goal we want”? That seems like an extremely strong statement.
Yes, though admittedly I’m making some inferences here.
The point is that similar techniques can be used to align them, since both (or arguably all goals) are both functionally arbitrary in what we pick, and important for us.
One major point I did elide is the amount of power seeking involved, since in the niceness goal, there’s almost no power seeking involved, unlike the existential risk concerns we have.
But in some of the tests for alignment in Pretraining from Human Feedback, they showed that they can make models avoid taking certain power seeking actions, like getting personal identifying information.
In essence, it’s at least some evidence that as AI gets more capable, that we can make sure that power seeking actions can be avoided if it’s misaligned with human interests.
I believe our disagreement stems from the fact that I am skeptical of the idea that statements made about contemporary language models can be extrapolated to apply to all existentially risky AI systems.
I definitely agree that some version of this is the crux, at least on how well we can generalize the result, since I think it does more generally apply than just contemporary language models, and I suspect it applies to almost all AI that can use Pretraining from Human Feedback, which is offline training, so the crux is really how much can we expect a alignment technique to generalize and scale
The problem is that the even if the model of Quintin Pope is wrong, there is other evidence that contradicts the AI doom premise that Eliezer ignores, and in this I believe it is a confirmation bias at work here.
Also, any issues with Quintin Pope’s model is going to be subtle, not obvious, and it’s a real difference to argue against good arguments + bad arguments from only bad arguments.
I think that this is a statement Eliezer does not believe is true, and which the conversations in the MIRI conversations sequence failed to convince him of. Which is the point: since Eliezer has already engaged in extensive back-and-forth with critics of his broad view (including the likes of Paul Christiano, Richard Ngo, Rohin Shah, etc), there is actually not much continued expected update to be found in engaging with someone else who posts a criticism of his view. Do you think otherwise?
What I was talking about is that Eliezer (And arguably the entire MIRI-sphere) ignored evidence that AI safety could actually work and doesn’t need entirely new paradigms, and one of the best examples of empirical work is the Pretraining from Human Feedback.
The big improvements compared to other methods are:
It can avoid deceptive alignment because it gives a simple goal that’s myopic, completely negating the incentives for deceptively aligned AI.
It cannot affect the distribution it’s trained on, since it’s purely offline learning, meaning we can enforce an IID assumption, and enforce a Cartesian boundary, completely avoiding embedded agency. It cannot hack the distribution it has, unlike online learning, meaning it can’t unboundedly Goodhart the values we instill.
Increasing the data set aligns it more and more, essentially meaning we can trust the AI to be aligned as it grows more capable, and improves it’s alignment.
The goal found has a small capabilities tax.
There’s a post on it I’ll link here:
https://www.lesswrong.com/posts/8F4dXYriqbsom46x5/pretraining-language-models-with-human-preferences
Now I don’t blame Eliezer for ignoring this piece specifically too much, as I think it didn’t attract much attention.
But the reason I’m mentioning this is that this is evidence against the worldview of Eliezer and a lot of pessimists who believe empirical evidence doesn’t work for the alignment field, and Eliezer and a lot of pessimists seem to systematically ignore evidence that harms their case.
Could you elaborate on what you mean by “avoid embedded agency”? I don’t understand how one avoids it. Any solution that avoids having to worry about it in your AGI will fall apart once it becomes a deployed superintelligence.
I think there’s a double meaning to the word “Alignment” where people now use it to refer to making LLMs say nice things and assume that this extrapolates to aligning the goals of agentic systems. The former is only a subproblem of the latter. When you say “Increasing the data set aligns it more and more, essentially meaning we can trust the AI to be aligned as it grows more capable, and improves it’s alignment” I question if we really have evidence that this relationship will hold indefinitely.
One of the issues with embedded agency is that you can’t reliably take advantage of the IID assumption, and in particular you can’t hold data fixed. You also have the issue of potentially having the AI hacking the process, given it’s embeddedness, since there isn’t a way before Pretraining from Human Feedback to translate Cartesian boundaries, or at least a subset of boundaries into the embedded universe.
The point here is we don’t have to solve the problem, as it’s only a problem if we let the AI control the updating process like online training.
Instead, we give the AI a data set, and offline train it so that it learns what alignment looks like before we give it general capabilities.
In particular, we can create a Cartesian boundary between IID and OOD inputs that work in an embedded setting, and the AI has no control over the data set of human values, meaning it can’t gradient or reward hack the humans into having different values, or unboundedly Goodhart human values, which would undermine the project. This is another Cartesian boundary, though this one is the boundary between an AI’s values, and a human’s values, and the AI can’t hack the human values if it’s offline trained.
I disagree, and I think I can explain why. The important point of the tests in the Pretraining from Human Feedback paper, and the AI saying nice things, is that they show that we can align AI to any goal we want, so if we can reliably shift it towards niceness, than we have techniques to align our agents/simulators.
I don’t see how the bolded follows from the unbolded, sorry. Could you explain in more detail how you reached this conclusion?
The point is that similar techniques can be used to align them, since both (or arguably all goals) are both functionally arbitrary in what we pick, and important for us.
One major point I did elide is the amount of power seeking involved, since in the niceness goal, there’s almost no power seeking involved, unlike the existential risk concerns we have.
But in some of the tests for alignment in Pretraining from Human Feedback, they showed that they can make models avoid taking certain power seeking actions, like getting personal identifying information.
In essence, it’s at least some evidence that as AI gets more capable, that we can make sure that power seeking actions can be avoided if it’s misaligned with human interests.
The first part here makes sense, you’re saying you can train it in such a fashion that it avoids the issues of embedded agency during training (among other things) and then guarantee that the alignment will hold in deployment (when it must be an embedded agent almost by definition)
The second part I think I think I disagree with. Does the paper really “show that we can align AI to any goal we want”? That seems like an extremely strong statement.
Actually this sort of highlights what I mean by the dual use of ‘alignment’ here. You were talking about aligning a model with human values that will end up being deployed (and being an embedded agent) but then we’re using ‘align’ to refer to language model outputs.
Yes, though admittedly I’m making some inferences here.
The point is that similar techniques can be used to align them, since both (or arguably all goals) are both functionally arbitrary in what we pick, and important for us.
One major point I did elide is the amount of power seeking involved, since in the niceness goal, there’s almost no power seeking involved, unlike the existential risk concerns we have.
But in some of the tests for alignment in Pretraining from Human Feedback, they showed that they can make models avoid taking certain power seeking actions, like getting personal identifying information.
In essence, it’s at least some evidence that as AI gets more capable, that we can make sure that power seeking actions can be avoided if it’s misaligned with human interests.
I believe our disagreement stems from the fact that I am skeptical of the idea that statements made about contemporary language models can be extrapolated to apply to all existentially risky AI systems.
I definitely agree that some version of this is the crux, at least on how well we can generalize the result, since I think it does more generally apply than just contemporary language models, and I suspect it applies to almost all AI that can use Pretraining from Human Feedback, which is offline training, so the crux is really how much can we expect a alignment technique to generalize and scale