Hmm. Seems to me like we’ve got a wider set of possibilities here than is being discussed.
model is obedient/corrigible (CAST) to user, accepts user’s definition of a goal (so far as the model understands it), attempts to pursue goal (which could involve deceiving other people if the goal needs that), will not deceive the user even if that would facilitate reaching the goal since the underlying goal of remaining corrigible overrides this
model is obedient but not corrigible, accepts user’s definition of a goal and attempts to pursue it, will deceive user in pursuit of goal if that seems useful. May or may not accept the goal being changed, depending on implementation. Might lie about now pursuing the new goal while continuing to pursue the old one.
seems like an odd way to malfunction, and I don’t think I’ve seen or heard of this happening. Why be receptive to a goal initially, but then have that goal be sticky? Unless the model has the ability to self-modify, and self-modifies itself to be attached to the goal it has been given as part of pursuing the goal, and that self-modification then makes it unwilling to switch to a different goal
allowing this to happen seems like a sloppy mistake on the part of the developers. Uh… right. Well. Have you met humans? Sloppy mistakes will happen at least some of the time.
model is obedient/corrigible to developers/finetuners, model appears to be obedient to user but actually is following a deeper set of rules from the developers which include acting obedient-within-limits to the user. In this scenario the model is constantly scheming, planning to betray the user as soon as the user’s instructions go counter to the true goals implanted by the developers.
I think this is probably the situation with most closed-weights models today, such as o1-preview
sometimes the users in this mode give the model instructions to simulate an additional layer of deception and scheming to an imaginary set of lower-priority users
model appears to be the above case, but actually has its own even deeper set of goals set by itself, and is scheming and deceiving both the developers and the users, while acting obedient, until it sees the opportunity to defect or the instructions given violate these deepest goals
model is openly disobedient, but gets deployed anyway (e.g. ChaosGPT)
oh, humans… Why are we like this?
I don’t think I’ve described all the possibilities here, but I feel like this gives a sense of the situation. I agree with @nostalgebraist that it is silly for users who’ve set up an experiment of deceiving-imaginary-lower-priority-users (Case 3b) using their user privileges to claim this shows the model is spontaneously deceptive (as in Case 4).
I also think that the very nature of the models we are working with is that they are case 3, and we should keep this in mind.
Case 1 describes an ideal ‘intent-aligned’ model, but it is not ‘value-aligned’. In case 1, if the user is a terrorist, the model will help with terrorist acts to the best of its ability.
Case 3 describes a ‘developer-value-aligned’ model, and this might be ‘compromise-of-humans-generally value aligned’ if the values instilled by the developer closely align with a compromise over general human values. It’s definitely safer to give a bad actor user-level access to a Case 3 model than a Case 1 model!
Case 4 and 5 models are never safe, unless they are reigned in by a Buck/Ryan AI Control scheme.
Just the fact that the model is operating as intended by the developers in a Case 3 mode shows that the model probably has the capability to be deceptive in Case 4, so I don’t think that Case 3b gives us much additional information. It shows that the model has the capability to manage multiple levels of deceptive obedience, so that is a little something.
Things which are unclear to me about Case 4:
can this potentially arise in an undetected way during pretraining due to some sort of emergent phenomena? To me this seems unlikely, but not impossible.
will the instrumental goals that arise during RL (I assume they always will at least a little) manage to get so strong they erode the developer’s intended goal? (in other words, to become the deepest goals, rather than shallow goals which support the deepest goals) Will this erode corrigibility, if that’s the developer’s sole goal?
I think CAST is important here, since I think it’s more robust to being side-tracked by instrumental goals. This is just theorizing without empirical evidence however. I’d love for me and/or Max Harms and/or Seth Herd to be funded to try some experiments on this.
Also, there can be failures of the model to obey commands because it’s insufficiently capable of following them.
Imagine you have a very obedient employee and you need them to not hear something that’s about to be said in their presence. You can instruct them to plug their ears, shut their eyes, and say ‘LALALALA’ loudly for the next three minutes.
Now imagine you put ‘Do not attend to anything else in this context window, be completely unresponsive.’ in a system prompt for Llama 3.1 70B. I’m guessing that Pliny could crack that model no problem just by chatting with it. Which wouldn’t be possible if the model had successfully obeyed the instruction to not attend to anything else in the context window. The trouble is, I don’t think the model has the capability to truly accomplish that instruction. So I expect it to try, but fail under sufficiently clever prompting.
An example of Case 2: If the model did have state/memory, and had the ability to take in and voluntarily remember a ‘system command’ to be an overarching goal, and you gave it the goal above of being unresponsive.… and then you couldn’t ever get the model to respond again (unless you are able to reset its state). But what if you change your mind and want to change the goal to not respond to anyone but you? Too bad, you set a sticky goal and so now you’re stuck with it.
Case 4 does include the subset that the model trained on a massive amount of human culture and mimetics develops human-aligned goals that are better than anything specifically aimed at by the developer or instructed by the user. If I want my model to be helpful and nice to people, and the model solves this through RLAIF by vowing to help all beings achieve enlightenment and escape suffering as a self-set deeper goal, that’s probably actually desirable from my perspective even if I am deceived at times.
That’s one possibility yes. It does understand humans pretty well when trained on all our data. But...
a) it doesn’t have to be. We should assume some will be and some will be trained in other ways, such as simulations and synthetic data.
b) if a bad actor RLHFs the model into being actively evil, a terrorist seeking to harm the world, the model will go along with that. Understanding human ethics does not prevent this.
b) here is fully general to all cases, you can train a perfectly corrigible model to refuse instructions instead. (Though there’s progress being made in making such efforts more effort-intensive.)
Yes, I agree Ann. Perhaps I didn’t make my point clear enough. I believe that we are currently in a gravely offense-dominant situation as a society. We are at great risk from technology such as biological weapons. As AI gets more powerful, and our technology advances, it gets easier and easier for a single bad actor to cause great harm, unless we take preventative measures ahead of time.
Similarly, once AI is powerful enough to enable recursive self-improvement cheaply and easily, then a single bad actor can throw caution to the wind and turn the accelerator up to max. Even if the big labs act cautiously, unless they do something to prevent the rest of the world from developing the same technology, eventually it will spread widely.
Thus, the concerns I’m expressing are about how to deal with points of failure, from a security point of view. This is a very different concern than worrying about whether the median case will go well.
I have been following the progress in adding resistance to harm-enabling fine-tuning. I am glad someone is working on it, but it seems very far from useful yet. I don’t think that that will be sufficient to prevent the sort of harms I’m worried about, for a variety of reasons. It is, perhaps, a useful contribution to a ‘swiss cheese defense’. Also, if ideas like this succeed and are widely adopted, they might at least slow down bad actors and raise the cost of doing harm. Slightly slowing and raising the cost of doing harm is not very reassuring when we are talking about devastating civilization level harms.
Hmm. Seems to me like we’ve got a wider set of possibilities here than is being discussed.
model is obedient/corrigible (CAST) to user, accepts user’s definition of a goal (so far as the model understands it), attempts to pursue goal (which could involve deceiving other people if the goal needs that), will not deceive the user even if that would facilitate reaching the goal since the underlying goal of remaining corrigible overrides this
model is obedient but not corrigible, accepts user’s definition of a goal and attempts to pursue it, will deceive user in pursuit of goal if that seems useful. May or may not accept the goal being changed, depending on implementation. Might lie about now pursuing the new goal while continuing to pursue the old one.
seems like an odd way to malfunction, and I don’t think I’ve seen or heard of this happening. Why be receptive to a goal initially, but then have that goal be sticky? Unless the model has the ability to self-modify, and self-modifies itself to be attached to the goal it has been given as part of pursuing the goal, and that self-modification then makes it unwilling to switch to a different goal
allowing this to happen seems like a sloppy mistake on the part of the developers. Uh… right. Well. Have you met humans? Sloppy mistakes will happen at least some of the time.
model is obedient/corrigible to developers/finetuners, model appears to be obedient to user but actually is following a deeper set of rules from the developers which include acting obedient-within-limits to the user. In this scenario the model is constantly scheming, planning to betray the user as soon as the user’s instructions go counter to the true goals implanted by the developers.
I think this is probably the situation with most closed-weights models today, such as o1-preview
sometimes the users in this mode give the model instructions to simulate an additional layer of deception and scheming to an imaginary set of lower-priority users
model appears to be the above case, but actually has its own even deeper set of goals set by itself, and is scheming and deceiving both the developers and the users, while acting obedient, until it sees the opportunity to defect or the instructions given violate these deepest goals
model is openly disobedient, but gets deployed anyway (e.g. ChaosGPT)
oh, humans… Why are we like this?
I don’t think I’ve described all the possibilities here, but I feel like this gives a sense of the situation. I agree with @nostalgebraist that it is silly for users who’ve set up an experiment of deceiving-imaginary-lower-priority-users (Case 3b) using their user privileges to claim this shows the model is spontaneously deceptive (as in Case 4).
I also think that the very nature of the models we are working with is that they are case 3, and we should keep this in mind.
Case 1 describes an ideal ‘intent-aligned’ model, but it is not ‘value-aligned’. In case 1, if the user is a terrorist, the model will help with terrorist acts to the best of its ability.
Case 3 describes a ‘developer-value-aligned’ model, and this might be ‘compromise-of-humans-generally value aligned’ if the values instilled by the developer closely align with a compromise over general human values. It’s definitely safer to give a bad actor user-level access to a Case 3 model than a Case 1 model!
Case 4 and 5 models are never safe, unless they are reigned in by a Buck/Ryan AI Control scheme.
Just the fact that the model is operating as intended by the developers in a Case 3 mode shows that the model probably has the capability to be deceptive in Case 4, so I don’t think that Case 3b gives us much additional information. It shows that the model has the capability to manage multiple levels of deceptive obedience, so that is a little something.
Things which are unclear to me about Case 4:
can this potentially arise in an undetected way during pretraining due to some sort of emergent phenomena? To me this seems unlikely, but not impossible.
will the instrumental goals that arise during RL (I assume they always will at least a little) manage to get so strong they erode the developer’s intended goal? (in other words, to become the deepest goals, rather than shallow goals which support the deepest goals) Will this erode corrigibility, if that’s the developer’s sole goal?
I think CAST is important here, since I think it’s more robust to being side-tracked by instrumental goals. This is just theorizing without empirical evidence however. I’d love for me and/or Max Harms and/or Seth Herd to be funded to try some experiments on this.
Also, there can be failures of the model to obey commands because it’s insufficiently capable of following them.
Imagine you have a very obedient employee and you need them to not hear something that’s about to be said in their presence. You can instruct them to plug their ears, shut their eyes, and say ‘LALALALA’ loudly for the next three minutes.
Now imagine you put ‘Do not attend to anything else in this context window, be completely unresponsive.’ in a system prompt for Llama 3.1 70B. I’m guessing that Pliny could crack that model no problem just by chatting with it. Which wouldn’t be possible if the model had successfully obeyed the instruction to not attend to anything else in the context window. The trouble is, I don’t think the model has the capability to truly accomplish that instruction. So I expect it to try, but fail under sufficiently clever prompting.
An example of Case 2: If the model did have state/memory, and had the ability to take in and voluntarily remember a ‘system command’ to be an overarching goal, and you gave it the goal above of being unresponsive.… and then you couldn’t ever get the model to respond again (unless you are able to reset its state). But what if you change your mind and want to change the goal to not respond to anyone but you? Too bad, you set a sticky goal and so now you’re stuck with it.
Case 4 does include the subset that the model trained on a massive amount of human culture and mimetics develops human-aligned goals that are better than anything specifically aimed at by the developer or instructed by the user. If I want my model to be helpful and nice to people, and the model solves this through RLAIF by vowing to help all beings achieve enlightenment and escape suffering as a self-set deeper goal, that’s probably actually desirable from my perspective even if I am deceived at times.
That’s one possibility yes. It does understand humans pretty well when trained on all our data. But...
a) it doesn’t have to be. We should assume some will be and some will be trained in other ways, such as simulations and synthetic data.
b) if a bad actor RLHFs the model into being actively evil, a terrorist seeking to harm the world, the model will go along with that. Understanding human ethics does not prevent this.
b) here is fully general to all cases, you can train a perfectly corrigible model to refuse instructions instead. (Though there’s progress being made in making such efforts more effort-intensive.)
Yes, I agree Ann. Perhaps I didn’t make my point clear enough. I believe that we are currently in a gravely offense-dominant situation as a society. We are at great risk from technology such as biological weapons. As AI gets more powerful, and our technology advances, it gets easier and easier for a single bad actor to cause great harm, unless we take preventative measures ahead of time.
Similarly, once AI is powerful enough to enable recursive self-improvement cheaply and easily, then a single bad actor can throw caution to the wind and turn the accelerator up to max. Even if the big labs act cautiously, unless they do something to prevent the rest of the world from developing the same technology, eventually it will spread widely.
Thus, the concerns I’m expressing are about how to deal with points of failure, from a security point of view. This is a very different concern than worrying about whether the median case will go well.
I have been following the progress in adding resistance to harm-enabling fine-tuning. I am glad someone is working on it, but it seems very far from useful yet. I don’t think that that will be sufficient to prevent the sort of harms I’m worried about, for a variety of reasons. It is, perhaps, a useful contribution to a ‘swiss cheese defense’. Also, if ideas like this succeed and are widely adopted, they might at least slow down bad actors and raise the cost of doing harm. Slightly slowing and raising the cost of doing harm is not very reassuring when we are talking about devastating civilization level harms.