The first part here makes sense, you’re saying you can train it in such a fashion that it avoids the issues of embedded agency during training (among other things) and then guarantee that the alignment will hold in deployment (when it must be an embedded agent almost by definition)
The second part I think I think I disagree with. Does the paper really “show that we can align AI to any goal we want”? That seems like an extremely strong statement.
Actually this sort of highlights what I mean by the dual use of ‘alignment’ here. You were talking about aligning a model with human values that will end up being deployed (and being an embedded agent) but then we’re using ‘align’ to refer to language model outputs.
The second part I think I think I disagree with. Does the paper really “show that we can align AI to any goal we want”? That seems like an extremely strong statement.
Yes, though admittedly I’m making some inferences here.
The point is that similar techniques can be used to align them, since both (or arguably all goals) are both functionally arbitrary in what we pick, and important for us.
One major point I did elide is the amount of power seeking involved, since in the niceness goal, there’s almost no power seeking involved, unlike the existential risk concerns we have.
But in some of the tests for alignment in Pretraining from Human Feedback, they showed that they can make models avoid taking certain power seeking actions, like getting personal identifying information.
In essence, it’s at least some evidence that as AI gets more capable, that we can make sure that power seeking actions can be avoided if it’s misaligned with human interests.
I believe our disagreement stems from the fact that I am skeptical of the idea that statements made about contemporary language models can be extrapolated to apply to all existentially risky AI systems.
I definitely agree that some version of this is the crux, at least on how well we can generalize the result, since I think it does more generally apply than just contemporary language models, and I suspect it applies to almost all AI that can use Pretraining from Human Feedback, which is offline training, so the crux is really how much can we expect a alignment technique to generalize and scale
The first part here makes sense, you’re saying you can train it in such a fashion that it avoids the issues of embedded agency during training (among other things) and then guarantee that the alignment will hold in deployment (when it must be an embedded agent almost by definition)
The second part I think I think I disagree with. Does the paper really “show that we can align AI to any goal we want”? That seems like an extremely strong statement.
Actually this sort of highlights what I mean by the dual use of ‘alignment’ here. You were talking about aligning a model with human values that will end up being deployed (and being an embedded agent) but then we’re using ‘align’ to refer to language model outputs.
Yes, though admittedly I’m making some inferences here.
The point is that similar techniques can be used to align them, since both (or arguably all goals) are both functionally arbitrary in what we pick, and important for us.
One major point I did elide is the amount of power seeking involved, since in the niceness goal, there’s almost no power seeking involved, unlike the existential risk concerns we have.
But in some of the tests for alignment in Pretraining from Human Feedback, they showed that they can make models avoid taking certain power seeking actions, like getting personal identifying information.
In essence, it’s at least some evidence that as AI gets more capable, that we can make sure that power seeking actions can be avoided if it’s misaligned with human interests.
I believe our disagreement stems from the fact that I am skeptical of the idea that statements made about contemporary language models can be extrapolated to apply to all existentially risky AI systems.
I definitely agree that some version of this is the crux, at least on how well we can generalize the result, since I think it does more generally apply than just contemporary language models, and I suspect it applies to almost all AI that can use Pretraining from Human Feedback, which is offline training, so the crux is really how much can we expect a alignment technique to generalize and scale