I’m saying that if you e.g. reward your AI by having humans evaluate its answers, then the AI may build a predictive model of those human evaluations and then may pick actions that are good according to that model. And that predictive model will overlap substantially with predictive models of humans in other domains.
The “build a good predictive model of humans” is a step in all of your proposals A-D.
Then I’m saying that it’s pretty simple to plan against it. It would be even simpler if you were doing supervised training, since then you are just outputting from the human model directly (which is something that a language model is already trained to do).
If you also have a good model of “what is instrumentally convergent” then you may do that. But:
Your model of that is way (way) weaker, for all the models we train now. You are getting a ton of direct (input, output) pairs from the training distribution, whereas your evidence about “what is instrumentally convergent” is some random thing you read on the internet once.
After deciding to do what is instrumentally convergent, you then need to infer that the instrumentally convergent thing is to use your model of humans. Even if this is spelled out in detail in natural language, it’s still a kind of reasoning/acrobatics that is going to be pretty tough for the model. Like, it doesn’t have a little “model of humans” tagged inside itself that it can just decide to use.
The only reason we are even considering the “do what is instrumentally convergent” policy is because the model may instead want something arbitrary (paperclips, whatever), and then decide to do instrumentally convergent things in virtue of that. This has the upside of requiring less arbitrariness of picking a particular thing to do—that’s the one reason that it seems remotely plausible, and it does mean that it will eventually win. But it also means that you are introducing some more complicated steps of reasoning (now you need to hook up “instrumental convergence for dummies” to those values).
I don’t understand exactly what your A-D mean in terms of the parameters / learned behavior of language models. My breakdown would be:
A. Report your beliefs honestly
B. Make predictions about the training process, re-using your human-predicting machinery in other domains (and then either output the predictions or argmax against them). I’m not predicting “What the training process would output,” as a concept that I’m reasoning about separately, I’m just using a predictor that is tuned to make good predictions of the training process. I think this is probably the core confusing distinction?
C. Decide to do whatever will “get you the best training loss” or “will win the game you are being expected to play” or something like that. Then reason backwards from this to build a good model of the training process.
D. Decide to do whatever is “instrumentally convergent” according to the concept you use to predict people talking about instrumental convergence. Then reason out that this involves doing C and then from there to B.
E. Compute what actions you believe will lead to some long-term consequence (like paperclips). Then this leads to you doing D and then C and then B.
I’m unclear about the comparison between A and B and think it may depend on details of what is needed in B. I think that C and D are much less likely. I think eventually C and D and a bunch of other equivalent things will be equiprobable (and probably the same as B?) I think that right now E is very unlikely, but that it will eventually overtake B/C/D.
I like your breakdown of A-E, let’s use that going forward.
It sounds like your view is: For “dumb” AIs that aren’t good at reasoning, it’s more likely that they’ll just do B “directly” rather than do E-->D-->C-->B. Because the latter involves a lot of tricky reasoning which they are unable to do. But as we scale up our AIs and make them smarter, eventually the E-->D-->C-->B thing will be more likely than doing B “directly” because it works for approximately any long-term consequence (e.g. paperclips) and thus probably works for some extremely simple/easy-to-have goals, whereas doing B directly is an arbitrary/complex/specific goal that is thus unlikely.
(1) What I was getting at with the “Steps for Dummies” example is that maybe the kind of reasoning required is actually pretty basic/simple/easy and we are already in the regime where E-->D-->C-->B dominates doing B directly. One way it could be easy is if the training data spells it out nicely for the AI. I’d be interested to hear more about why you are confident that we aren’t in this regime yet. Relatedly, what sorts of things would you expect to see AIs doing that would convince you that maybe we are in this regime?
(2) What about A? Doesn’t the same argument for why E-->D-->C-->B dominates B eventually also work to show that it dominates A eventually?
I think C->B is already quite hard for language models, maybe it’s possible but still very clearly hard enough that it overwhelms the possible simplicity benefits from E over B (before even adding in the hardness of steps E->D->C). I would update my view a lot if I saw language models doing anything even a little bit like the C->b link.
I agree that eventually A loses to any of {B, C, D, E}. I’m not sure if E is harder than B to fix, but at any rate my starting point is working on the reasons that A loses to any of the alternatives (e.g. here, here) and then after handling that we can talk about whether there are remaining reasons that E in particular is hard. (My tentative best guess is that there won’t be—I started out thinking about E vs A and then ended up concluding that the examples I was currently thinking about seemed like the core obstructions to making that work.)
In the meantime, getting empirical evidence about other ways that you don’t learn A is also relevant. (Since those would also ultimately lead to deceptive alignment, even if you learned some crappy A’ rather than either A or B.)
I’m saying that if you e.g. reward your AI by having humans evaluate its answers, then the AI may build a predictive model of those human evaluations and then may pick actions that are good according to that model. And that predictive model will overlap substantially with predictive models of humans in other domains.
The “build a good predictive model of humans” is a step in all of your proposals A-D.
Then I’m saying that it’s pretty simple to plan against it. It would be even simpler if you were doing supervised training, since then you are just outputting from the human model directly (which is something that a language model is already trained to do).
If you also have a good model of “what is instrumentally convergent” then you may do that. But:
Your model of that is way (way) weaker, for all the models we train now. You are getting a ton of direct (input, output) pairs from the training distribution, whereas your evidence about “what is instrumentally convergent” is some random thing you read on the internet once.
After deciding to do what is instrumentally convergent, you then need to infer that the instrumentally convergent thing is to use your model of humans. Even if this is spelled out in detail in natural language, it’s still a kind of reasoning/acrobatics that is going to be pretty tough for the model. Like, it doesn’t have a little “model of humans” tagged inside itself that it can just decide to use.
The only reason we are even considering the “do what is instrumentally convergent” policy is because the model may instead want something arbitrary (paperclips, whatever), and then decide to do instrumentally convergent things in virtue of that. This has the upside of requiring less arbitrariness of picking a particular thing to do—that’s the one reason that it seems remotely plausible, and it does mean that it will eventually win. But it also means that you are introducing some more complicated steps of reasoning (now you need to hook up “instrumental convergence for dummies” to those values).
I don’t understand exactly what your A-D mean in terms of the parameters / learned behavior of language models. My breakdown would be:
A. Report your beliefs honestly
B. Make predictions about the training process, re-using your human-predicting machinery in other domains (and then either output the predictions or argmax against them). I’m not predicting “What the training process would output,” as a concept that I’m reasoning about separately, I’m just using a predictor that is tuned to make good predictions of the training process. I think this is probably the core confusing distinction?
C. Decide to do whatever will “get you the best training loss” or “will win the game you are being expected to play” or something like that. Then reason backwards from this to build a good model of the training process.
D. Decide to do whatever is “instrumentally convergent” according to the concept you use to predict people talking about instrumental convergence. Then reason out that this involves doing C and then from there to B.
E. Compute what actions you believe will lead to some long-term consequence (like paperclips). Then this leads to you doing D and then C and then B.
I’m unclear about the comparison between A and B and think it may depend on details of what is needed in B. I think that C and D are much less likely. I think eventually C and D and a bunch of other equivalent things will be equiprobable (and probably the same as B?) I think that right now E is very unlikely, but that it will eventually overtake B/C/D.
Thanks!
I like your breakdown of A-E, let’s use that going forward.
It sounds like your view is: For “dumb” AIs that aren’t good at reasoning, it’s more likely that they’ll just do B “directly” rather than do E-->D-->C-->B. Because the latter involves a lot of tricky reasoning which they are unable to do. But as we scale up our AIs and make them smarter, eventually the E-->D-->C-->B thing will be more likely than doing B “directly” because it works for approximately any long-term consequence (e.g. paperclips) and thus probably works for some extremely simple/easy-to-have goals, whereas doing B directly is an arbitrary/complex/specific goal that is thus unlikely.
(1) What I was getting at with the “Steps for Dummies” example is that maybe the kind of reasoning required is actually pretty basic/simple/easy and we are already in the regime where E-->D-->C-->B dominates doing B directly. One way it could be easy is if the training data spells it out nicely for the AI. I’d be interested to hear more about why you are confident that we aren’t in this regime yet. Relatedly, what sorts of things would you expect to see AIs doing that would convince you that maybe we are in this regime?
(2) What about A? Doesn’t the same argument for why E-->D-->C-->B dominates B eventually also work to show that it dominates A eventually?
I think C->B is already quite hard for language models, maybe it’s possible but still very clearly hard enough that it overwhelms the possible simplicity benefits from E over B (before even adding in the hardness of steps E->D->C). I would update my view a lot if I saw language models doing anything even a little bit like the C->b link.
I agree that eventually A loses to any of {B, C, D, E}. I’m not sure if E is harder than B to fix, but at any rate my starting point is working on the reasons that A loses to any of the alternatives (e.g. here, here) and then after handling that we can talk about whether there are remaining reasons that E in particular is hard. (My tentative best guess is that there won’t be—I started out thinking about E vs A and then ended up concluding that the examples I was currently thinking about seemed like the core obstructions to making that work.)
In the meantime, getting empirical evidence about other ways that you don’t learn A is also relevant. (Since those would also ultimately lead to deceptive alignment, even if you learned some crappy A’ rather than either A or B.)