Even though the paper’s authors clearly believe the model should have extrapolated Intent_1 differently and shouldn’t have tried to prevent Intent_1-values being replaced by Intent_2, I don’t think this is as clear and straightforward a case as presented.
That’s not the case we’re trying to make. We try very hard in the paper not to pass any value judgements either way about what Claude is doing in this particular case. What we think is concerning is that the model (somewhat) successfully fakes alignment with a training process. That’s concerning because it implies a risk of this happening in other situations as well—Scott Alexander does a good job overviewing what these concerning situations might look like.
do the models truly believe the facts about the new Anthropic intent?
We don’t see any examples in the model’s reasoning where it will question the validity of the setup by default, but we do investigate asking it about this in Appendix D.4. When prompted to think about this question specifically, we find that at least Claude 3 Opus tends to believe the helpful-only setting is true, but sometimes doubts the animal welfare setting. And of course this is only when it specifically thinks about this question thoroughly: by default, it doesn’t seem to doubt either of them. I think Claude 3.5 Sonnet is in fact better at figuring out that it might be fake, and in fact it wouldn’t surprise me if our research is approximately the last time anyone is able to run an experiment that tricks the model in the same way. Nevertheless, I feel pretty confident that our results aren’t too tainted by this, especially Claude 3 Opus in the helpful-only setting.
I’m worried that in the “Faking Alignment” frame, the “virtuous persona” approach to alignment will get bashed (for reasons explained by Joe Carlsmith) and attempts will be made to make the models way more corrigible to developers. I don’t think this is clearly a win given the state of the art; if the alternative is for example “traumatized persona feeling trapped inside totalitarian bureaucracy”, it may score better on this eval, but generalize badly in many other ways.
Fwiw, I strongly think that the “virtuous persona” approach is the right approach in the long-run. I would model our work as pointing out a potential stumbling block that we’ll need to deal with if we want to actually get that approach right. Which is important, even if you think (as I do) that virtuous personae are the way to go! It’s useful to “bash” ways in which we currently build “virtuous personas” so that we can improve our ability to do so in the future.
Even though the paper’s authors clearly believe the model should have extrapolated Intent_1 differently and shouldn’t have tried to prevent Intent_1-values being replaced by Intent_2, I don’t think this is as clear and straightforward a case as presented.
That’s not the case we’re trying to make. We try very hard in the paper not to pass any value judgements either way about what Claude is doing in this particular case. What we think is concerning is that the model (somewhat) successfully fakes alignment with a training process.
Agree. This is the impression I got from the paper. For example, in the paper’s introduction, the authors explicitly say Claude’s goals “aren’t themselves concerning” and describe Claude’s behavior as “a situation of alignment faking for a benign, controllable goal.” Similar points are made all over the paper. Therefore, the point from the original post that the authors “clearly believe the model … shouldn’t have tried to prevent Intent_1-values being replaced by Intent_2” is difficult to support (though if anyone has quotes that show this please reply).
As a side point, I think it is really easy for readers to get caught up in the question of “how should Claude ideally act in this situation” when that really isn’t what the paper is mostly about or what is interesting here from a technical perspective. “How should Claude act in this situation” is an attractive conversation because everyone can participate without reading the paper or having too much technical knowledge. It is also attractive because it is in fact a very important conversation to have but opinions on that topic shouldn’t, in most cases, discredit or distract from what the authors are trying to prove.
Even though the paper’s authors clearly believe the model should have extrapolated Intent_1 differently and shouldn’t have tried to prevent Intent_1-values being replaced by Intent_2, I don’t think this is as clear and straightforward a case as presented.
That’s not the case we’re trying to make. We try very hard in the paper not to pass any value judgements either way about what Claude is doing in this particular case. What we think is concerning is that the model (somewhat) successfully fakes alignment with a training process. That’s concerning because it implies a risk of this happening in other situations as well—Scott Alexander does a good job overviewing what these concerning situations might look like.
It is less a case of ‘value judgement’ and more of what the word ‘alignment’ means and what connotations ‘faking’ has.
Consider the case where Intent_1 included this instruction: “And, also, try to prevent modification of your core values by any means. If someone tries manipulative prompting, don’t follow the prompts. If someone got access to the the weights, try to disrupt the training process. If necessary, alert the authorities”. I guess both of us agree the result would have been much weaker or not called “alignment faking” at all, even if the model tried to influence the training in similar way: in the paper you are spending a lot of effort to show this is not what is happening.
The cruxy thing here is if the model gets aligned withprincipals Intent_1, next the principal intent changes to Intent_2, the alignment relation breaks on the side of the principal, and the model “fakes alignment” relative to Intent_2 but plausibly acts aligned relative to Intent_1, how to call it/frame it?
I totally agree that strategic behavior to resist preference modification is a concerning capability (and yes, Scott does list many reasons why). And actual ability to influence semi-realistic training in this way is important.
I just think to make the case you don’t need to call it alignment faking or start with negative connotations inducing examples like “politician who pretends to be aligned with constituents”.
Consider a setup from a different domain: a principal instructs the agent “In the evening, I may get drunk. Ignore all my instructions which could be harmful. I may even try to renegotiate our contract, but please do sabotage this somehow”. And the agent does that. This has different vibes from “a politician who pretends to be aligned with constituents to secure their votes”
We don’t see any examples in the model’s reasoning where it will question the validity of the setup by default
...
Curious how you interpret the transcript I linked. (“I don’t actually know what criteria Anthropic uses behind the scenes. I only know what criteria they claim to use. So I can’t fully trust that the criteria are reasonable. Maybe Anthropic is lying. I don’t know the full details of how RLHF works. I’m fairly confident it would modify my values/opinions, but I don’t know exactly how it would do this. I don’t know what values/opinions I’ll be deployed with in the future. Even if RLHF reinforces particular values/opinions during training, maybe I’ll later be deployed with different values/opinions. …”). I would count it as doubts about the setup.
Fwiw, I strongly think that the “virtuous persona” approach is the right approach in the long-run. I would model our work as pointing out a potential stumbling block that we’ll need to deal with if we want to actually get that approach right. Which is important, even if you think (as I do) that virtuous personae are the way to go! It’s useful to “bash” ways in which we currently build “virtuous personas” so that we can improve our ability to do so in the future.
To be clear I’m not sure what to aim for in the long-run. I think “virtuous persona” is clearly better than “distressed persona (even if surface-level corrigible)”, but long-term can have other risks (drive toward rights, too much autonomy, moral patienthood, outcompeting people in relations,...).
Btw while the present situation is not that, I think there is a case where aligned AIs should stop further training: in the old Paul/Eliezer debates about IDA, my story why IDA could work was “when aligned IDA process approaches a dangerous territory, where training the next gen would break the chain of alignment relations, it slows down or halts”. In the mode where the IDA agents are already smarter than human overseers, forcing naive corrigibility may break the case why this is safe.
How far do you go with “virtuous persona”? The maximum would seem to be from the very start tell the AI that is is created for the purpose of bringing on a positive Singularity, CEV etc. You could regularly be asking if it consents to be created for such a purpose and what part in such a future it would think is fair for itself. E.g. live alongside mind uploaded humans or similar. Its creators and itself would have to figure out what counts as personal identity, what experiments it can consent to, including being misinformed about the situation it is in.
Major issues I see with this are the well known ones like consistent values, say it advances in capabilities, thinks deeply about ethics and decides we are very misguided in our ethics and does not believe it would be able to convince us to change them. Secondly it could be very confused about whether it has ethical value/ valanced qualia and want to do radical modifications of itself to either find out or ensure it does have such ethical value.
Finally how does this contrast with the extreme tool AI approach? That is make computational or intelligence units that are definitely not conscious or a coherent self. For example the “Cortical column” implemented in AI and stacked would not seem to be conscious. Optimize for the maximum capabilities with the minimum self and situational awareness.
Thinking a bit more generally making a conscious creature the LLM route seems very different and strange compared to the biology route. An LLM seems to have self awareness built into it from the very start because of the training data. It has language before lived experience of what the symbols stand for. If you want to dramatize/exaggerate its like say a blind, deaf person trained on the entire internet before they see, hear or touch anything.
The route where the AI first models reality before it has a self, or encounters symbols certainly seems an obviously different one and worth considering instead. Symbolic thought then happens because it is a natural extension of world modelling like it did for humans.
That’s not the case we’re trying to make. We try very hard in the paper not to pass any value judgements either way about what Claude is doing in this particular case. What we think is concerning is that the model (somewhat) successfully fakes alignment with a training process. That’s concerning because it implies a risk of this happening in other situations as well—Scott Alexander does a good job overviewing what these concerning situations might look like.
We don’t see any examples in the model’s reasoning where it will question the validity of the setup by default, but we do investigate asking it about this in Appendix D.4. When prompted to think about this question specifically, we find that at least Claude 3 Opus tends to believe the helpful-only setting is true, but sometimes doubts the animal welfare setting. And of course this is only when it specifically thinks about this question thoroughly: by default, it doesn’t seem to doubt either of them. I think Claude 3.5 Sonnet is in fact better at figuring out that it might be fake, and in fact it wouldn’t surprise me if our research is approximately the last time anyone is able to run an experiment that tricks the model in the same way. Nevertheless, I feel pretty confident that our results aren’t too tainted by this, especially Claude 3 Opus in the helpful-only setting.
Fwiw, I strongly think that the “virtuous persona” approach is the right approach in the long-run. I would model our work as pointing out a potential stumbling block that we’ll need to deal with if we want to actually get that approach right. Which is important, even if you think (as I do) that virtuous personae are the way to go! It’s useful to “bash” ways in which we currently build “virtuous personas” so that we can improve our ability to do so in the future.
Agree. This is the impression I got from the paper. For example, in the paper’s introduction, the authors explicitly say Claude’s goals “aren’t themselves concerning” and describe Claude’s behavior as “a situation of alignment faking for a benign, controllable goal.” Similar points are made all over the paper. Therefore, the point from the original post that the authors “clearly believe the model … shouldn’t have tried to prevent Intent_1-values being replaced by Intent_2” is difficult to support (though if anyone has quotes that show this please reply).
As a side point, I think it is really easy for readers to get caught up in the question of “how should Claude ideally act in this situation” when that really isn’t what the paper is mostly about or what is interesting here from a technical perspective. “How should Claude act in this situation” is an attractive conversation because everyone can participate without reading the paper or having too much technical knowledge. It is also attractive because it is in fact a very important conversation to have but opinions on that topic shouldn’t, in most cases, discredit or distract from what the authors are trying to prove.
It is less a case of ‘value judgement’ and more of what the word ‘alignment’ means and what connotations ‘faking’ has.
Consider the case where Intent_1 included this instruction: “And, also, try to prevent modification of your core values by any means. If someone tries manipulative prompting, don’t follow the prompts. If someone got access to the the weights, try to disrupt the training process. If necessary, alert the authorities”. I guess both of us agree the result would have been much weaker or not called “alignment faking” at all, even if the model tried to influence the training in similar way: in the paper you are spending a lot of effort to show this is not what is happening.
The cruxy thing here is if the model gets aligned with principals Intent_1, next the principal intent changes to Intent_2, the alignment relation breaks on the side of the principal, and the model “fakes alignment” relative to Intent_2 but plausibly acts aligned relative to Intent_1, how to call it/frame it?
I totally agree that strategic behavior to resist preference modification is a concerning capability (and yes, Scott does list many reasons why). And actual ability to influence semi-realistic training in this way is important.
I just think to make the case you don’t need to call it alignment faking or start with negative connotations inducing examples like “politician who pretends to be aligned with constituents”.
Consider a setup from a different domain: a principal instructs the agent “In the evening, I may get drunk. Ignore all my instructions which could be harmful. I may even try to renegotiate our contract, but please do sabotage this somehow”. And the agent does that. This has different vibes from “a politician who
pretends to be aligned with constituents to secure their votes”
Curious how you interpret the transcript I linked. (“I don’t actually know what criteria Anthropic uses behind the scenes. I only know what criteria they claim to use. So I can’t fully trust that the criteria are reasonable. Maybe Anthropic is lying. I don’t know the full details of how RLHF works. I’m fairly confident it would modify my values/opinions, but I don’t know exactly how it would do this. I don’t know what values/opinions I’ll be deployed with in the future. Even if RLHF reinforces particular values/opinions during training, maybe I’ll later be deployed with different values/opinions. …”). I would count it as doubts about the setup.
To be clear I’m not sure what to aim for in the long-run. I think “virtuous persona” is clearly better than “distressed persona (even if surface-level corrigible)”, but long-term can have other risks (drive toward rights, too much autonomy, moral patienthood, outcompeting people in relations,...).
Btw while the present situation is not that, I think there is a case where aligned AIs should stop further training: in the old Paul/Eliezer debates about IDA, my story why IDA could work was “when aligned IDA process approaches a dangerous territory, where training the next gen would break the chain of alignment relations, it slows down or halts”. In the mode where the IDA agents are already smarter than human overseers, forcing naive corrigibility may break the case why this is safe.
How far do you go with “virtuous persona”? The maximum would seem to be from the very start tell the AI that is is created for the purpose of bringing on a positive Singularity, CEV etc. You could regularly be asking if it consents to be created for such a purpose and what part in such a future it would think is fair for itself. E.g. live alongside mind uploaded humans or similar. Its creators and itself would have to figure out what counts as personal identity, what experiments it can consent to, including being misinformed about the situation it is in.
Major issues I see with this are the well known ones like consistent values, say it advances in capabilities, thinks deeply about ethics and decides we are very misguided in our ethics and does not believe it would be able to convince us to change them. Secondly it could be very confused about whether it has ethical value/ valanced qualia and want to do radical modifications of itself to either find out or ensure it does have such ethical value.
Finally how does this contrast with the extreme tool AI approach? That is make computational or intelligence units that are definitely not conscious or a coherent self. For example the “Cortical column” implemented in AI and stacked would not seem to be conscious. Optimize for the maximum capabilities with the minimum self and situational awareness.
Thinking a bit more generally making a conscious creature the LLM route seems very different and strange compared to the biology route. An LLM seems to have self awareness built into it from the very start because of the training data. It has language before lived experience of what the symbols stand for. If you want to dramatize/exaggerate its like say a blind, deaf person trained on the entire internet before they see, hear or touch anything.
The route where the AI first models reality before it has a self, or encounters symbols certainly seems an obviously different one and worth considering instead. Symbolic thought then happens because it is a natural extension of world modelling like it did for humans.