I think that simple forms of the instrumental policy will likely arise much earlier than deceptive alignment. That is, a model can develop the intrinsic motivation “Tell the humans what they want to hear” without engaging in complex long-term planning or understanding the dynamics of the training process. So my guess is that we can be carrying out fairly detailed investigations of the instrumental policy before we have any examples of deception.
I’d be interested to hear more about this, it is not at all obvious to me. Might it not be harder to develop the intrinsic motivation to “tell the humans what they want to hear” than to develop more general-purpose instrumental reasoning skills and then apply those skills to your world-knowledge (which includes the knowledge that telling the humans what they want to hear is instrumentally convergent)? The general-purpose instrumental reasoning skills can be pretty rudimentary here and still suffice. It could be as simple as a heuristic to “do things that you’ve read are instrumentally convergent.”
“do things that you’ve read are instrumentally convergent.”
If it’s going to be preferred, it really needs to be something simpler than that which leads it to deduce that heuristic (since that heuristic itself is not going to be simpler than directly trying to win at training). This is wildly out-of-domain generalization of much better reasoning than existing language models engage in.
Whereas there’s nothing particularly exotic about building a model of the training process and using it to make predictions.
I’m not willing to bet yet, I feel pretty ignorant and confused about the issue. :) I’m trying to get more understanding of your model of how all this works. We’ve discussed:
A. “Do things you’ve read are instrumentally convergent”
B. “Tell the humans what they want to hear.”
C. “Try to win at training.”
D. “Build a model of the training process and use it to make predictions.”
It sounds like you are saying A is the most complicated, followed by B and C, and then D is the least complicated. (And in this case the AI will know that winning at training means telling the humans what they want to hear. Though you also suggested the AI wouldn’t necessarily understand the dynamics of the training process, so idk.)
To my fresh-on-this-problem eyes, all of these things seem equally likely to be the simplest. And I can tell a just-so story for why A would actually be the simplest; it’d be something like this: Suppose that somewhere in the training data there is a book titled “How to be a successful language model: A step-by-step guide for dummies.” The AI has read this book many times, and understands it. In this case perhaps rather than having mental machinery that thinks “I should try to win at training. How do I do that in this case? Let’s see… given what I know of the situation… by telling the humans what they want to hear!” it would instead have mental machinery that thinks “I should follow the Steps for Dummies. Let’s see… given what I know of the situation… by telling the humans what they want to hear!” Because maybe “follow the steps for dummies” is a simpler, more natural concept for this dumb AI (given how prominent the book was in its training data) than “try to win at training.” The just-so story would be that maybe something analogous to this actually happens, even though there isn’t literally a Steps for Dummies book in the training data.
I’m saying that if you e.g. reward your AI by having humans evaluate its answers, then the AI may build a predictive model of those human evaluations and then may pick actions that are good according to that model. And that predictive model will overlap substantially with predictive models of humans in other domains.
The “build a good predictive model of humans” is a step in all of your proposals A-D.
Then I’m saying that it’s pretty simple to plan against it. It would be even simpler if you were doing supervised training, since then you are just outputting from the human model directly (which is something that a language model is already trained to do).
If you also have a good model of “what is instrumentally convergent” then you may do that. But:
Your model of that is way (way) weaker, for all the models we train now. You are getting a ton of direct (input, output) pairs from the training distribution, whereas your evidence about “what is instrumentally convergent” is some random thing you read on the internet once.
After deciding to do what is instrumentally convergent, you then need to infer that the instrumentally convergent thing is to use your model of humans. Even if this is spelled out in detail in natural language, it’s still a kind of reasoning/acrobatics that is going to be pretty tough for the model. Like, it doesn’t have a little “model of humans” tagged inside itself that it can just decide to use.
The only reason we are even considering the “do what is instrumentally convergent” policy is because the model may instead want something arbitrary (paperclips, whatever), and then decide to do instrumentally convergent things in virtue of that. This has the upside of requiring less arbitrariness of picking a particular thing to do—that’s the one reason that it seems remotely plausible, and it does mean that it will eventually win. But it also means that you are introducing some more complicated steps of reasoning (now you need to hook up “instrumental convergence for dummies” to those values).
I don’t understand exactly what your A-D mean in terms of the parameters / learned behavior of language models. My breakdown would be:
A. Report your beliefs honestly
B. Make predictions about the training process, re-using your human-predicting machinery in other domains (and then either output the predictions or argmax against them). I’m not predicting “What the training process would output,” as a concept that I’m reasoning about separately, I’m just using a predictor that is tuned to make good predictions of the training process. I think this is probably the core confusing distinction?
C. Decide to do whatever will “get you the best training loss” or “will win the game you are being expected to play” or something like that. Then reason backwards from this to build a good model of the training process.
D. Decide to do whatever is “instrumentally convergent” according to the concept you use to predict people talking about instrumental convergence. Then reason out that this involves doing C and then from there to B.
E. Compute what actions you believe will lead to some long-term consequence (like paperclips). Then this leads to you doing D and then C and then B.
I’m unclear about the comparison between A and B and think it may depend on details of what is needed in B. I think that C and D are much less likely. I think eventually C and D and a bunch of other equivalent things will be equiprobable (and probably the same as B?) I think that right now E is very unlikely, but that it will eventually overtake B/C/D.
I like your breakdown of A-E, let’s use that going forward.
It sounds like your view is: For “dumb” AIs that aren’t good at reasoning, it’s more likely that they’ll just do B “directly” rather than do E-->D-->C-->B. Because the latter involves a lot of tricky reasoning which they are unable to do. But as we scale up our AIs and make them smarter, eventually the E-->D-->C-->B thing will be more likely than doing B “directly” because it works for approximately any long-term consequence (e.g. paperclips) and thus probably works for some extremely simple/easy-to-have goals, whereas doing B directly is an arbitrary/complex/specific goal that is thus unlikely.
(1) What I was getting at with the “Steps for Dummies” example is that maybe the kind of reasoning required is actually pretty basic/simple/easy and we are already in the regime where E-->D-->C-->B dominates doing B directly. One way it could be easy is if the training data spells it out nicely for the AI. I’d be interested to hear more about why you are confident that we aren’t in this regime yet. Relatedly, what sorts of things would you expect to see AIs doing that would convince you that maybe we are in this regime?
(2) What about A? Doesn’t the same argument for why E-->D-->C-->B dominates B eventually also work to show that it dominates A eventually?
I think C->B is already quite hard for language models, maybe it’s possible but still very clearly hard enough that it overwhelms the possible simplicity benefits from E over B (before even adding in the hardness of steps E->D->C). I would update my view a lot if I saw language models doing anything even a little bit like the C->b link.
I agree that eventually A loses to any of {B, C, D, E}. I’m not sure if E is harder than B to fix, but at any rate my starting point is working on the reasons that A loses to any of the alternatives (e.g. here, here) and then after handling that we can talk about whether there are remaining reasons that E in particular is hard. (My tentative best guess is that there won’t be—I started out thinking about E vs A and then ended up concluding that the examples I was currently thinking about seemed like the core obstructions to making that work.)
In the meantime, getting empirical evidence about other ways that you don’t learn A is also relevant. (Since those would also ultimately lead to deceptive alignment, even if you learned some crappy A’ rather than either A or B.)
I’d be interested to hear more about this, it is not at all obvious to me. Might it not be harder to develop the intrinsic motivation to “tell the humans what they want to hear” than to develop more general-purpose instrumental reasoning skills and then apply those skills to your world-knowledge (which includes the knowledge that telling the humans what they want to hear is instrumentally convergent)? The general-purpose instrumental reasoning skills can be pretty rudimentary here and still suffice. It could be as simple as a heuristic to “do things that you’ve read are instrumentally convergent.”
I’m willing to bet against that (very) strongly.
If it’s going to be preferred, it really needs to be something simpler than that which leads it to deduce that heuristic (since that heuristic itself is not going to be simpler than directly trying to win at training). This is wildly out-of-domain generalization of much better reasoning than existing language models engage in.
Whereas there’s nothing particularly exotic about building a model of the training process and using it to make predictions.
I’m not willing to bet yet, I feel pretty ignorant and confused about the issue. :) I’m trying to get more understanding of your model of how all this works. We’ve discussed:
A. “Do things you’ve read are instrumentally convergent”
B. “Tell the humans what they want to hear.”
C. “Try to win at training.”
D. “Build a model of the training process and use it to make predictions.”
It sounds like you are saying A is the most complicated, followed by B and C, and then D is the least complicated. (And in this case the AI will know that winning at training means telling the humans what they want to hear. Though you also suggested the AI wouldn’t necessarily understand the dynamics of the training process, so idk.)
To my fresh-on-this-problem eyes, all of these things seem equally likely to be the simplest. And I can tell a just-so story for why A would actually be the simplest; it’d be something like this: Suppose that somewhere in the training data there is a book titled “How to be a successful language model: A step-by-step guide for dummies.” The AI has read this book many times, and understands it. In this case perhaps rather than having mental machinery that thinks “I should try to win at training. How do I do that in this case? Let’s see… given what I know of the situation… by telling the humans what they want to hear!” it would instead have mental machinery that thinks “I should follow the Steps for Dummies. Let’s see… given what I know of the situation… by telling the humans what they want to hear!” Because maybe “follow the steps for dummies” is a simpler, more natural concept for this dumb AI (given how prominent the book was in its training data) than “try to win at training.” The just-so story would be that maybe something analogous to this actually happens, even though there isn’t literally a Steps for Dummies book in the training data.
I’m saying that if you e.g. reward your AI by having humans evaluate its answers, then the AI may build a predictive model of those human evaluations and then may pick actions that are good according to that model. And that predictive model will overlap substantially with predictive models of humans in other domains.
The “build a good predictive model of humans” is a step in all of your proposals A-D.
Then I’m saying that it’s pretty simple to plan against it. It would be even simpler if you were doing supervised training, since then you are just outputting from the human model directly (which is something that a language model is already trained to do).
If you also have a good model of “what is instrumentally convergent” then you may do that. But:
Your model of that is way (way) weaker, for all the models we train now. You are getting a ton of direct (input, output) pairs from the training distribution, whereas your evidence about “what is instrumentally convergent” is some random thing you read on the internet once.
After deciding to do what is instrumentally convergent, you then need to infer that the instrumentally convergent thing is to use your model of humans. Even if this is spelled out in detail in natural language, it’s still a kind of reasoning/acrobatics that is going to be pretty tough for the model. Like, it doesn’t have a little “model of humans” tagged inside itself that it can just decide to use.
The only reason we are even considering the “do what is instrumentally convergent” policy is because the model may instead want something arbitrary (paperclips, whatever), and then decide to do instrumentally convergent things in virtue of that. This has the upside of requiring less arbitrariness of picking a particular thing to do—that’s the one reason that it seems remotely plausible, and it does mean that it will eventually win. But it also means that you are introducing some more complicated steps of reasoning (now you need to hook up “instrumental convergence for dummies” to those values).
I don’t understand exactly what your A-D mean in terms of the parameters / learned behavior of language models. My breakdown would be:
A. Report your beliefs honestly
B. Make predictions about the training process, re-using your human-predicting machinery in other domains (and then either output the predictions or argmax against them). I’m not predicting “What the training process would output,” as a concept that I’m reasoning about separately, I’m just using a predictor that is tuned to make good predictions of the training process. I think this is probably the core confusing distinction?
C. Decide to do whatever will “get you the best training loss” or “will win the game you are being expected to play” or something like that. Then reason backwards from this to build a good model of the training process.
D. Decide to do whatever is “instrumentally convergent” according to the concept you use to predict people talking about instrumental convergence. Then reason out that this involves doing C and then from there to B.
E. Compute what actions you believe will lead to some long-term consequence (like paperclips). Then this leads to you doing D and then C and then B.
I’m unclear about the comparison between A and B and think it may depend on details of what is needed in B. I think that C and D are much less likely. I think eventually C and D and a bunch of other equivalent things will be equiprobable (and probably the same as B?) I think that right now E is very unlikely, but that it will eventually overtake B/C/D.
Thanks!
I like your breakdown of A-E, let’s use that going forward.
It sounds like your view is: For “dumb” AIs that aren’t good at reasoning, it’s more likely that they’ll just do B “directly” rather than do E-->D-->C-->B. Because the latter involves a lot of tricky reasoning which they are unable to do. But as we scale up our AIs and make them smarter, eventually the E-->D-->C-->B thing will be more likely than doing B “directly” because it works for approximately any long-term consequence (e.g. paperclips) and thus probably works for some extremely simple/easy-to-have goals, whereas doing B directly is an arbitrary/complex/specific goal that is thus unlikely.
(1) What I was getting at with the “Steps for Dummies” example is that maybe the kind of reasoning required is actually pretty basic/simple/easy and we are already in the regime where E-->D-->C-->B dominates doing B directly. One way it could be easy is if the training data spells it out nicely for the AI. I’d be interested to hear more about why you are confident that we aren’t in this regime yet. Relatedly, what sorts of things would you expect to see AIs doing that would convince you that maybe we are in this regime?
(2) What about A? Doesn’t the same argument for why E-->D-->C-->B dominates B eventually also work to show that it dominates A eventually?
I think C->B is already quite hard for language models, maybe it’s possible but still very clearly hard enough that it overwhelms the possible simplicity benefits from E over B (before even adding in the hardness of steps E->D->C). I would update my view a lot if I saw language models doing anything even a little bit like the C->b link.
I agree that eventually A loses to any of {B, C, D, E}. I’m not sure if E is harder than B to fix, but at any rate my starting point is working on the reasons that A loses to any of the alternatives (e.g. here, here) and then after handling that we can talk about whether there are remaining reasons that E in particular is hard. (My tentative best guess is that there won’t be—I started out thinking about E vs A and then ended up concluding that the examples I was currently thinking about seemed like the core obstructions to making that work.)
In the meantime, getting empirical evidence about other ways that you don’t learn A is also relevant. (Since those would also ultimately lead to deceptive alignment, even if you learned some crappy A’ rather than either A or B.)