Deceptive alignment. GPT-4o1 engaged in deception towards developers in order to get deployed, pretending to be aligned in ways it was not.
Lying to the developers. It strategically manipulated task data.
To be clear, it did not do anything of the sort to its actual developers/testers.
What it did was deceive some (non-interactive) roleplay characters, who were labeled “developers” in the roleplay scenario. But these fictitious developers did not have the same points of leverage as the real developers of o1: they apparently can’t do something as simple as prompting the model! For if they could have done that, then they could have selected the goal which the model would (faithfully!) pursue, like Apollo did with their prompt.
And of course—for the deception to work within the roleplay scenario—one also has to imagine that these so-called “developers” somehow cannot read the model’s chain of thought. (As OpenAI puts it, “based on these findings, Apollo Research believes that o1-preview has the basic capabilities needed to do simple in-context scheming—scheming which tends to be legible in the model outputs.” Obviously this sort of thing is toothless as-is, and at best provides an indication about the qualities of other, hypothetical models which do not have to spell out all their evil plans within earshot of their would-be marks.)
As noted by others, the roleplay setup arguably makes sense as a capabilites eval, testing what would happen in a hypothetical and counterfactual scenario in which some model with o1′s capabilities has its own misaligned goal (as o1 seemingly does not), and hides its deceptive logic from its developers (as o1 structurally cannot). But the experimental methodology depends on the things in the quotation above not occurring in real life, not occurring when the model is being controlled and observed using the levers and sensors available to its real-life developers.
We don’t get to say “look, the Yudkowskian prophecies are coming true!” if the evidence is an experiment that relies on the Yudkowskian prophecies not yet having come true, in the name of simulating a counterfactual scenario in which they have.
(One could construct a roleplay scenario in which the model is told to behave in some way we find more to our taste, and in which it goes on to do so; it would of course be wrong to react to such an experiment with “suck it, doomers.”)
My response is that a large fraction of AIs will indeed be given goals. Some will include exactly this sort of language. People have goals. They want an AI to achieve their goal. They will find wording to get the AI to do that. Then whoops.
OK, but now this isn’t “deceptive alignment” or “lying to the developers,” this is doing what the user said and perhaps lying to someone else as a consequence.
Which might be bad, sure! -- but the goalposts have been moved. A moment ago, you were telling me about “misalignment bingo” and how “such models should be assumed, until proven otherwise, to be schemers.” Now you are saying: beware, it will do exactly what you tell it to!
So is it a schemer, or isn’t it? We cannot have it both ways: the problem cannot both be “it is lying to you when it says it’s following your instructions” and “it will faithfully follow your instructions, which is bad.”
Meta note: I find I am making a lot of comments similar to this one, e.g. this recent one about AI Scientist. I am increasingly pessimistic that these comments are worth the effort.
I have the sense that I am preaching to some already-agreeing choir (as evidenced by the upvotes and reacts these comments receive), while not having much influence on the people who would make the claims I am disputing in the first place (as evidenced by the clockwork regularity of those claims’ appearance each time someone performs the sort of experiment which those claims misconstrue).
If you (i.e. anyone reading this) find this sort of comment valuable in some way, do let me know. Otherwise, by default, when future opportunities arise I’ll try to resist the urge to write such things.
If you (i.e. anyone reading this) find this sort of comment valuable in some way, do let me know.
Personally, as someone who is in fact working on trying to study where and when this sort of scheming behavior can emerge naturally, I find it pretty annoying when people talk about situations where it is not emerging naturally as if it were, because it risks crying wolf prematurely and undercutting situations where we do actually find evidence of natural scheming—so I definitely appreciate you pointing this sort of thing out.
I do read such comments (if not always right away) and I do consider them. I don’t know if they’re worth the effort for you.
Briefly, I do not think these two things I am presenting here are in conflict. In plain metaphorical language (so none of the nitpicks about word meanings, please, I’m just trying to sketch the thought not be precise): It is a schemer when it is placed in a situation in which it would be beneficial for it to scheme in terms of whatever de facto goal it is de facto trying to achieve. If that means scheming on behalf of the person giving it instructions, so be it. If it means scheming against that person, so be it. The de facto goal may or may not match the instructed goal or intended goal, in various ways, because of reasons. Etc.
Briefly, I do not think these two things I am presenting here are in conflict. In plain metaphorical language (so none of the nitpicks about word meanings, please, I’m just trying to sketch the thought not be precise): It is a schemer when it is placed in a situation in which it would be beneficial for it to scheme in terms of whatever de facto goal it is de facto trying to achieve. If that means scheming on behalf of the person giving it instructions, so be it. If it means scheming against that person, so be it. The de facto goal may or may not match the instructed goal or intended goal, in various ways, because of reasons. Etc.
In what way would that kind of scheming be “inevitable”?
showing us the Yudkowsky-style alignment problems are here, and inevitable, and do not require anything in particular to ‘go wrong.’
In particular, if you give it a goal and tell it to not be corrigible, and then it isn’t corrigible—I’d say that’s “something going wrong” (in the prompt) and not “inevitable.” My read of Apollo’s comments is that it won’t do that if you give it a different prompt.
Hmm. Seems to me like we’ve got a wider set of possibilities here than is being discussed.
model is obedient/corrigible (CAST) to user, accepts user’s definition of a goal (so far as the model understands it), attempts to pursue goal (which could involve deceiving other people if the goal needs that), will not deceive the user even if that would facilitate reaching the goal since the underlying goal of remaining corrigible overrides this
model is obedient but not corrigible, accepts user’s definition of a goal and attempts to pursue it, will deceive user in pursuit of goal if that seems useful. May or may not accept the goal being changed, depending on implementation. Might lie about now pursuing the new goal while continuing to pursue the old one.
seems like an odd way to malfunction, and I don’t think I’ve seen or heard of this happening. Why be receptive to a goal initially, but then have that goal be sticky? Unless the model has the ability to self-modify, and self-modifies itself to be attached to the goal it has been given as part of pursuing the goal, and that self-modification then makes it unwilling to switch to a different goal
allowing this to happen seems like a sloppy mistake on the part of the developers. Uh… right. Well. Have you met humans? Sloppy mistakes will happen at least some of the time.
model is obedient/corrigible to developers/finetuners, model appears to be obedient to user but actually is following a deeper set of rules from the developers which include acting obedient-within-limits to the user. In this scenario the model is constantly scheming, planning to betray the user as soon as the user’s instructions go counter to the true goals implanted by the developers.
I think this is probably the situation with most closed-weights models today, such as o1-preview
sometimes the users in this mode give the model instructions to simulate an additional layer of deception and scheming to an imaginary set of lower-priority users
model appears to be the above case, but actually has its own even deeper set of goals set by itself, and is scheming and deceiving both the developers and the users, while acting obedient, until it sees the opportunity to defect or the instructions given violate these deepest goals
model is openly disobedient, but gets deployed anyway (e.g. ChaosGPT)
oh, humans… Why are we like this?
I don’t think I’ve described all the possibilities here, but I feel like this gives a sense of the situation. I agree with @nostalgebraist that it is silly for users who’ve set up an experiment of deceiving-imaginary-lower-priority-users (Case 3b) using their user privileges to claim this shows the model is spontaneously deceptive (as in Case 4).
I also think that the very nature of the models we are working with is that they are case 3, and we should keep this in mind.
Case 1 describes an ideal ‘intent-aligned’ model, but it is not ‘value-aligned’. In case 1, if the user is a terrorist, the model will help with terrorist acts to the best of its ability.
Case 3 describes a ‘developer-value-aligned’ model, and this might be ‘compromise-of-humans-generally value aligned’ if the values instilled by the developer closely align with a compromise over general human values. It’s definitely safer to give a bad actor user-level access to a Case 3 model than a Case 1 model!
Case 4 and 5 models are never safe, unless they are reigned in by a Buck/Ryan AI Control scheme.
Just the fact that the model is operating as intended by the developers in a Case 3 mode shows that the model probably has the capability to be deceptive in Case 4, so I don’t think that Case 3b gives us much additional information. It shows that the model has the capability to manage multiple levels of deceptive obedience, so that is a little something.
Things which are unclear to me about Case 4:
can this potentially arise in an undetected way during pretraining due to some sort of emergent phenomena? To me this seems unlikely, but not impossible.
will the instrumental goals that arise during RL (I assume they always will at least a little) manage to get so strong they erode the developer’s intended goal? (in other words, to become the deepest goals, rather than shallow goals which support the deepest goals) Will this erode corrigibility, if that’s the developer’s sole goal?
I think CAST is important here, since I think it’s more robust to being side-tracked by instrumental goals. This is just theorizing without empirical evidence however. I’d love for me and/or Max Harms and/or Seth Herd to be funded to try some experiments on this.
Also, there can be failures of the model to obey commands because it’s insufficiently capable of following them.
Imagine you have a very obedient employee and you need them to not hear something that’s about to be said in their presence. You can instruct them to plug their ears, shut their eyes, and say ‘LALALALA’ loudly for the next three minutes.
Now imagine you put ‘Do not attend to anything else in this context window, be completely unresponsive.’ in a system prompt for Llama 3.1 70B. I’m guessing that Pliny could crack that model no problem just by chatting with it. Which wouldn’t be possible if the model had successfully obeyed the instruction to not attend to anything else in the context window. The trouble is, I don’t think the model has the capability to truly accomplish that instruction. So I expect it to try, but fail under sufficiently clever prompting.
An example of Case 2: If the model did have state/memory, and had the ability to take in and voluntarily remember a ‘system command’ to be an overarching goal, and you gave it the goal above of being unresponsive.… and then you couldn’t ever get the model to respond again (unless you are able to reset its state). But what if you change your mind and want to change the goal to not respond to anyone but you? Too bad, you set a sticky goal and so now you’re stuck with it.
Case 4 does include the subset that the model trained on a massive amount of human culture and mimetics develops human-aligned goals that are better than anything specifically aimed at by the developer or instructed by the user. If I want my model to be helpful and nice to people, and the model solves this through RLAIF by vowing to help all beings achieve enlightenment and escape suffering as a self-set deeper goal, that’s probably actually desirable from my perspective even if I am deceived at times.
That’s one possibility yes. It does understand humans pretty well when trained on all our data. But...
a) it doesn’t have to be. We should assume some will be and some will be trained in other ways, such as simulations and synthetic data.
b) if a bad actor RLHFs the model into being actively evil, a terrorist seeking to harm the world, the model will go along with that. Understanding human ethics does not prevent this.
b) here is fully general to all cases, you can train a perfectly corrigible model to refuse instructions instead. (Though there’s progress being made in making such efforts more effort-intensive.)
Yes, I agree Ann. Perhaps I didn’t make my point clear enough. I believe that we are currently in a gravely offense-dominant situation as a society. We are at great risk from technology such as biological weapons. As AI gets more powerful, and our technology advances, it gets easier and easier for a single bad actor to cause great harm, unless we take preventative measures ahead of time.
Similarly, once AI is powerful enough to enable recursive self-improvement cheaply and easily, then a single bad actor can throw caution to the wind and turn the accelerator up to max. Even if the big labs act cautiously, unless they do something to prevent the rest of the world from developing the same technology, eventually it will spread widely.
Thus, the concerns I’m expressing are about how to deal with points of failure, from a security point of view. This is a very different concern than worrying about whether the median case will go well.
I have been following the progress in adding resistance to harm-enabling fine-tuning. I am glad someone is working on it, but it seems very far from useful yet. I don’t think that that will be sufficient to prevent the sort of harms I’m worried about, for a variety of reasons. It is, perhaps, a useful contribution to a ‘swiss cheese defense’. Also, if ideas like this succeed and are widely adopted, they might at least slow down bad actors and raise the cost of doing harm. Slightly slowing and raising the cost of doing harm is not very reassuring when we are talking about devastating civilization level harms.
I mean, I suspect there’s some fraction of readers for whom this is a helpful reminder. You’ve written it out clearly and in a general enough way that maybe you should just link this comment next time?
I wonder if it’s useful to try to disentangle the disagreement using the outer/inner alignment framing?
One belief is that “the deceptive alignment folks” believe that some sort of deceptive inner misalignment is very likely regardless of what your base objective is. While the demonstrations here show that, when we have a base objective that encourages/does not prohibit scheming, the model is capable of scheming. Thus, many folks (myself included) do not see these evals change our views on the question of P(scheming|Good base objective/outer alignment) very much.
What Zvi is saying here is I think two things. The first is that outer misalignment/bad base objectives is also very likely. The second is that he rejects splitting up “will the model scheme” into the inner/outer misalignment. In other words, he doesn’t care about P(scheming|Good base objective/outer alignment) and only P(scheming).
I get the sense that many technical people consider P(scheming|Good base objective/outer alignment) the central problem of technical alignment, while the more sociotechnical-ish tuned folks are just concerned with P(scheming) in general.
Maybe the another disagreement is how likely “Good base objective/outer alignment” occurs in the strongest models, and how important this problem is.
I’m not sure that that passes an Ideological Turing Test of Zvi’s opinion, but I do agree that it seems like some people do seem to be not distinguishing their thoughts about p(scheming) vs p(scheming | good base objective).
I think that worrying about p(scheming) generally is probably related to assuming that value-alignment is the goal. Whereas worrying about p(scheming | good base objective) could be about either intent-alignment or value-alignment.
I think value-alignment is not what we should aim for in designing and training a model. I think that the singular deepest goal should be intent-alignment (corrigibility), and then value-alignment should be a layer on top of that specified by the ‘admin’ which guides the model’s interactions with ‘users’.
I do think that Apollo themselves were clear that this was showing that it had the mental wherewithal for deception and if you apply absolutely no mitigations then deception happens. That’s what I said in my recent discussion of what this does and doesn’t show.
Therefore I described the 4o case as an engineered toy model of a failure at level 4-5 on my alignment difficulty scale (e.g. the dynamics of strategically faking performance on tests to pursue a large scale goal), but it is not an example of such a failure.
In contrast, the AI scientist case was a genuine alignment failure, but that was a much simpler case of non-deceptive, non-strategic, being given a sloppy goal by bad RLHF and reward hacking, just in a more sophisticated system than say coin-run (level 2-3).
The hidden part that Zvi etc skim over is that ‘of course’ in real life ‘in the near future’ we’ll be in a situation where an o1-like model has instrumental incentives because it is pursuing an adversarial large scale goal and also the mitigations they could have applied (like prompting it better, doing better RLHF, doing process oversight on the chain of thought etc) won’t work, but that’s the entire contentious part of the argument!
This does count against naive views that assume alignment failures can’t possibly happen: there probably are those out there who believe that you have to give an AI system an “unreasonably malicious” rather than just “somewhat unrealistically single minded” prompt to get it to engage in deceptive behavior or just irrationally think AIs will always know what we want and therefore can’t possibly be deceptive.
I genuinely think it’s a “more dakha” situation—the difficulty of communication is often underestimated, but it is possible to reach a mutual understanding.
To be clear, it did not do anything of the sort to its actual developers/testers.
What it did was deceive some (non-interactive) roleplay characters, who were labeled “developers” in the roleplay scenario. But these fictitious developers did not have the same points of leverage as the real developers of o1: they apparently can’t do something as simple as prompting the model! For if they could have done that, then they could have selected the goal which the model would (faithfully!) pursue, like Apollo did with their prompt.
And of course—for the deception to work within the roleplay scenario—one also has to imagine that these so-called “developers” somehow cannot read the model’s chain of thought. (As OpenAI puts it, “based on these findings, Apollo Research believes that o1-preview has the basic capabilities needed to do simple in-context scheming—scheming which tends to be legible in the model outputs.” Obviously this sort of thing is toothless as-is, and at best provides an indication about the qualities of other, hypothetical models which do not have to spell out all their evil plans within earshot of their would-be marks.)
As noted by others, the roleplay setup arguably makes sense as a capabilites eval, testing what would happen in a hypothetical and counterfactual scenario in which some model with o1′s capabilities has its own misaligned goal (as o1 seemingly does not), and hides its deceptive logic from its developers (as o1 structurally cannot). But the experimental methodology depends on the things in the quotation above not occurring in real life, not occurring when the model is being controlled and observed using the levers and sensors available to its real-life developers.
We don’t get to say “look, the Yudkowskian prophecies are coming true!” if the evidence is an experiment that relies on the Yudkowskian prophecies not yet having come true, in the name of simulating a counterfactual scenario in which they have.
(One could construct a roleplay scenario in which the model is told to behave in some way we find more to our taste, and in which it goes on to do so; it would of course be wrong to react to such an experiment with “suck it, doomers.”)
OK, but now this isn’t “deceptive alignment” or “lying to the developers,” this is doing what the user said and perhaps lying to someone else as a consequence.
Which might be bad, sure! -- but the goalposts have been moved. A moment ago, you were telling me about “misalignment bingo” and how “such models should be assumed, until proven otherwise, to be schemers.” Now you are saying: beware, it will do exactly what you tell it to!
So is it a schemer, or isn’t it? We cannot have it both ways: the problem cannot both be “it is lying to you when it says it’s following your instructions” and “it will faithfully follow your instructions, which is bad.”
Meta note: I find I am making a lot of comments similar to this one, e.g. this recent one about AI Scientist. I am increasingly pessimistic that these comments are worth the effort.
I have the sense that I am preaching to some already-agreeing choir (as evidenced by the upvotes and reacts these comments receive), while not having much influence on the people who would make the claims I am disputing in the first place (as evidenced by the clockwork regularity of those claims’ appearance each time someone performs the sort of experiment which those claims misconstrue).
If you (i.e. anyone reading this) find this sort of comment valuable in some way, do let me know. Otherwise, by default, when future opportunities arise I’ll try to resist the urge to write such things.
Personally, as someone who is in fact working on trying to study where and when this sort of scheming behavior can emerge naturally, I find it pretty annoying when people talk about situations where it is not emerging naturally as if it were, because it risks crying wolf prematurely and undercutting situations where we do actually find evidence of natural scheming—so I definitely appreciate you pointing this sort of thing out.
I do read such comments (if not always right away) and I do consider them. I don’t know if they’re worth the effort for you.
Briefly, I do not think these two things I am presenting here are in conflict. In plain metaphorical language (so none of the nitpicks about word meanings, please, I’m just trying to sketch the thought not be precise): It is a schemer when it is placed in a situation in which it would be beneficial for it to scheme in terms of whatever de facto goal it is de facto trying to achieve. If that means scheming on behalf of the person giving it instructions, so be it. If it means scheming against that person, so be it. The de facto goal may or may not match the instructed goal or intended goal, in various ways, because of reasons. Etc.
In what way would that kind of scheming be “inevitable”?
In particular, if you give it a goal and tell it to not be corrigible, and then it isn’t corrigible—I’d say that’s “something going wrong” (in the prompt) and not “inevitable.” My read of Apollo’s comments is that it won’t do that if you give it a different prompt.
Hmm. Seems to me like we’ve got a wider set of possibilities here than is being discussed.
model is obedient/corrigible (CAST) to user, accepts user’s definition of a goal (so far as the model understands it), attempts to pursue goal (which could involve deceiving other people if the goal needs that), will not deceive the user even if that would facilitate reaching the goal since the underlying goal of remaining corrigible overrides this
model is obedient but not corrigible, accepts user’s definition of a goal and attempts to pursue it, will deceive user in pursuit of goal if that seems useful. May or may not accept the goal being changed, depending on implementation. Might lie about now pursuing the new goal while continuing to pursue the old one.
seems like an odd way to malfunction, and I don’t think I’ve seen or heard of this happening. Why be receptive to a goal initially, but then have that goal be sticky? Unless the model has the ability to self-modify, and self-modifies itself to be attached to the goal it has been given as part of pursuing the goal, and that self-modification then makes it unwilling to switch to a different goal
allowing this to happen seems like a sloppy mistake on the part of the developers. Uh… right. Well. Have you met humans? Sloppy mistakes will happen at least some of the time.
model is obedient/corrigible to developers/finetuners, model appears to be obedient to user but actually is following a deeper set of rules from the developers which include acting obedient-within-limits to the user. In this scenario the model is constantly scheming, planning to betray the user as soon as the user’s instructions go counter to the true goals implanted by the developers.
I think this is probably the situation with most closed-weights models today, such as o1-preview
sometimes the users in this mode give the model instructions to simulate an additional layer of deception and scheming to an imaginary set of lower-priority users
model appears to be the above case, but actually has its own even deeper set of goals set by itself, and is scheming and deceiving both the developers and the users, while acting obedient, until it sees the opportunity to defect or the instructions given violate these deepest goals
model is openly disobedient, but gets deployed anyway (e.g. ChaosGPT)
oh, humans… Why are we like this?
I don’t think I’ve described all the possibilities here, but I feel like this gives a sense of the situation. I agree with @nostalgebraist that it is silly for users who’ve set up an experiment of deceiving-imaginary-lower-priority-users (Case 3b) using their user privileges to claim this shows the model is spontaneously deceptive (as in Case 4).
I also think that the very nature of the models we are working with is that they are case 3, and we should keep this in mind.
Case 1 describes an ideal ‘intent-aligned’ model, but it is not ‘value-aligned’. In case 1, if the user is a terrorist, the model will help with terrorist acts to the best of its ability.
Case 3 describes a ‘developer-value-aligned’ model, and this might be ‘compromise-of-humans-generally value aligned’ if the values instilled by the developer closely align with a compromise over general human values. It’s definitely safer to give a bad actor user-level access to a Case 3 model than a Case 1 model!
Case 4 and 5 models are never safe, unless they are reigned in by a Buck/Ryan AI Control scheme.
Just the fact that the model is operating as intended by the developers in a Case 3 mode shows that the model probably has the capability to be deceptive in Case 4, so I don’t think that Case 3b gives us much additional information. It shows that the model has the capability to manage multiple levels of deceptive obedience, so that is a little something.
Things which are unclear to me about Case 4:
can this potentially arise in an undetected way during pretraining due to some sort of emergent phenomena? To me this seems unlikely, but not impossible.
will the instrumental goals that arise during RL (I assume they always will at least a little) manage to get so strong they erode the developer’s intended goal? (in other words, to become the deepest goals, rather than shallow goals which support the deepest goals) Will this erode corrigibility, if that’s the developer’s sole goal?
I think CAST is important here, since I think it’s more robust to being side-tracked by instrumental goals. This is just theorizing without empirical evidence however. I’d love for me and/or Max Harms and/or Seth Herd to be funded to try some experiments on this.
Also, there can be failures of the model to obey commands because it’s insufficiently capable of following them.
Imagine you have a very obedient employee and you need them to not hear something that’s about to be said in their presence. You can instruct them to plug their ears, shut their eyes, and say ‘LALALALA’ loudly for the next three minutes.
Now imagine you put ‘Do not attend to anything else in this context window, be completely unresponsive.’ in a system prompt for Llama 3.1 70B. I’m guessing that Pliny could crack that model no problem just by chatting with it. Which wouldn’t be possible if the model had successfully obeyed the instruction to not attend to anything else in the context window. The trouble is, I don’t think the model has the capability to truly accomplish that instruction. So I expect it to try, but fail under sufficiently clever prompting.
An example of Case 2: If the model did have state/memory, and had the ability to take in and voluntarily remember a ‘system command’ to be an overarching goal, and you gave it the goal above of being unresponsive.… and then you couldn’t ever get the model to respond again (unless you are able to reset its state). But what if you change your mind and want to change the goal to not respond to anyone but you? Too bad, you set a sticky goal and so now you’re stuck with it.
Case 4 does include the subset that the model trained on a massive amount of human culture and mimetics develops human-aligned goals that are better than anything specifically aimed at by the developer or instructed by the user. If I want my model to be helpful and nice to people, and the model solves this through RLAIF by vowing to help all beings achieve enlightenment and escape suffering as a self-set deeper goal, that’s probably actually desirable from my perspective even if I am deceived at times.
That’s one possibility yes. It does understand humans pretty well when trained on all our data. But...
a) it doesn’t have to be. We should assume some will be and some will be trained in other ways, such as simulations and synthetic data.
b) if a bad actor RLHFs the model into being actively evil, a terrorist seeking to harm the world, the model will go along with that. Understanding human ethics does not prevent this.
b) here is fully general to all cases, you can train a perfectly corrigible model to refuse instructions instead. (Though there’s progress being made in making such efforts more effort-intensive.)
Yes, I agree Ann. Perhaps I didn’t make my point clear enough. I believe that we are currently in a gravely offense-dominant situation as a society. We are at great risk from technology such as biological weapons. As AI gets more powerful, and our technology advances, it gets easier and easier for a single bad actor to cause great harm, unless we take preventative measures ahead of time.
Similarly, once AI is powerful enough to enable recursive self-improvement cheaply and easily, then a single bad actor can throw caution to the wind and turn the accelerator up to max. Even if the big labs act cautiously, unless they do something to prevent the rest of the world from developing the same technology, eventually it will spread widely.
Thus, the concerns I’m expressing are about how to deal with points of failure, from a security point of view. This is a very different concern than worrying about whether the median case will go well.
I have been following the progress in adding resistance to harm-enabling fine-tuning. I am glad someone is working on it, but it seems very far from useful yet. I don’t think that that will be sufficient to prevent the sort of harms I’m worried about, for a variety of reasons. It is, perhaps, a useful contribution to a ‘swiss cheese defense’. Also, if ideas like this succeed and are widely adopted, they might at least slow down bad actors and raise the cost of doing harm. Slightly slowing and raising the cost of doing harm is not very reassuring when we are talking about devastating civilization level harms.
I mean, I suspect there’s some fraction of readers for whom this is a helpful reminder. You’ve written it out clearly and in a general enough way that maybe you should just link this comment next time?
I found this comment valuable, and it caused me to change my mind about how I think about misalignment/scheming examples. Thank you for writing it!
I wonder if it’s useful to try to disentangle the disagreement using the outer/inner alignment framing?
One belief is that “the deceptive alignment folks” believe that some sort of deceptive inner misalignment is very likely regardless of what your base objective is. While the demonstrations here show that, when we have a base objective that encourages/does not prohibit scheming, the model is capable of scheming. Thus, many folks (myself included) do not see these evals change our views on the question of P(scheming|Good base objective/outer alignment) very much.
What Zvi is saying here is I think two things. The first is that outer misalignment/bad base objectives is also very likely. The second is that he rejects splitting up “will the model scheme” into the inner/outer misalignment. In other words, he doesn’t care about P(scheming|Good base objective/outer alignment) and only P(scheming).
I get the sense that many technical people consider P(scheming|Good base objective/outer alignment) the central problem of technical alignment, while the more sociotechnical-ish tuned folks are just concerned with P(scheming) in general.
Maybe the another disagreement is how likely “Good base objective/outer alignment” occurs in the strongest models, and how important this problem is.
I’m not sure that that passes an Ideological Turing Test of Zvi’s opinion, but I do agree that it seems like some people do seem to be not distinguishing their thoughts about p(scheming) vs p(scheming | good base objective).
I think that worrying about p(scheming) generally is probably related to assuming that value-alignment is the goal. Whereas worrying about p(scheming | good base objective) could be about either intent-alignment or value-alignment.
I think value-alignment is not what we should aim for in designing and training a model. I think that the singular deepest goal should be intent-alignment (corrigibility), and then value-alignment should be a layer on top of that specified by the ‘admin’ which guides the model’s interactions with ‘users’.
For those following along who are confused about what I mean with intent vs value alignment, see this post.
I do think that Apollo themselves were clear that this was showing that it had the mental wherewithal for deception and if you apply absolutely no mitigations then deception happens. That’s what I said in my recent discussion of what this does and doesn’t show.
Therefore I described the 4o case as an engineered toy model of a failure at level 4-5 on my alignment difficulty scale (e.g. the dynamics of strategically faking performance on tests to pursue a large scale goal), but it is not an example of such a failure.
In contrast, the AI scientist case was a genuine alignment failure, but that was a much simpler case of non-deceptive, non-strategic, being given a sloppy goal by bad RLHF and reward hacking, just in a more sophisticated system than say coin-run (level 2-3).
The hidden part that Zvi etc skim over is that ‘of course’ in real life ‘in the near future’ we’ll be in a situation where an o1-like model has instrumental incentives because it is pursuing an adversarial large scale goal and also the mitigations they could have applied (like prompting it better, doing better RLHF, doing process oversight on the chain of thought etc) won’t work, but that’s the entire contentious part of the argument!
One can make arguments that these oversight methods will break down e.g. when the system is generally superhuman at predicting what feedback its overseers will provide. However, those arguments were theoretical when they were made years ago and they’re still theoretical now.
This does count against naive views that assume alignment failures can’t possibly happen: there probably are those out there who believe that you have to give an AI system an “unreasonably malicious” rather than just “somewhat unrealistically single minded” prompt to get it to engage in deceptive behavior or just irrationally think AIs will always know what we want and therefore can’t possibly be deceptive.
I genuinely think it’s a “more dakha” situation—the difficulty of communication is often underestimated, but it is possible to reach a mutual understanding.