For what it’s worth, I often find Eliezer’s arguments unpersuasive because they seem shallow. For example:
The insight is in realizing that the hypothetical planner is only one line of outer shell command away from being a Big Scary Thing and is therefore also liable to be Big and Scary in many ways.
This seem like a fuzzy “outside view” sort of argument. (Compare with: “A loaded gun is one trigger pull away from killing someone and is therefore liable to be deadly in many ways.” On the other hand, a causal model of a gun lets you explain which specific gun operations can be deadly and why.)
I’m not saying Eliezer’s conclusion is false. I find other arguments for that conclusion much more persuasive, e.g. involving mesa-optimizers, because there is a proposed failure type which I understand in causal/mechanistic terms.
(I can provide other examples of shallow-seeming arguments if desired.)
I agree that it’s a shallow argument presentation, but that’s not the same thing as being based on shallow ideas. The context provided more depth, and in general a fair few of the shallowly presented arguments seem to be counters to even more shallow arguments.
In general one of the deeper concepts underlying all these shallow arguments appears to be some sort of thesis of “AGI-completeness”, in which any single system that can reach or exceed human mental capability on most tasks, will almost certainly reach or exceed on all mental tasks, including deceiving and manipulating humans. Combining that with potentially very much greater flexibility and extensibility of computing substrate means you get an incredibly dangerous situation no matter how clever the designers think they’ve been.
One such “clever designer” idea is decoupling plan generation from plan execution, which really just means that the plan generator has humans as part of the initial plan executing hardware. You don’t need a deep argument to point out an obvious flaw there. Talking about mesa-optimizers in a such a context is just missing the point from a view in which humans can potentially be used as part of a toolchain in much the same way as robot arms or protein factories.
One such “clever designer” idea is decoupling plan generation from plan execution, which really just means that the plan generator has humans as part of the initial plan executing hardware. You don’t need a deep argument to point out an obvious flaw there.
I don’t see the “obvious flaw” you’re pointing at and would appreciate a more in-depth explanation.
In my mind, decoupling plan generation from plan execution, if done well, accomplishes something like this:
You ask your AGI to generate a plan for how it could maximize paperclips.
Your AGI generates a plan. “Step 1: Manipulate human operator into thinking that paperclips are the best thing ever, using the following argument...”
You stop reading the plan at that point, and don’t click “execute” for it.
I had the same view as you, and was persuaded out of it in this thread. Maybe to shift focus a little, one interesting question here is about training. How do you train a plan-generating AI? If you reward plans that sound like they’d succeed, regardless of how icky they seem, then the AI will become useless to you by outputting effective-sounding but icky plans. But if you reward only plans that look nice enough to execute, that tempts the AI to make plans that manipulate whoever is reading them, and we’re back at square one.
Maybe that’s a good way to look at the general problem. Instead of talking about AI architecture, just say we don’t know any training methods that would make AI better than humans at real world planning and safe to interact with the world, even if it’s just answering questions.
I agree these are legitimate concerns… these are the kind of “deep” arguments I find more persuasive.
In that thread, johnswentworth wrote:
In particular, even if we have a reward signal which is “close” to incentivizing alignment in some sense, the actual-process-which-generates-the-reward-signal is likely to be at least as simple/natural as actual alignment.
I’d solve this by maintaining uncertainty about the “reward signal”, so the AI tries to find a plan which looks good under both alignment and the actual-process-which-generates-the-reward-signal. (It doesn’t know which is which, but it tries to learn a sufficiently diverse set of reward signals such that alignment is in there somewhere. I don’t think we can do any better than this, because the entire point is that there is no way to disambiguate between alignment and the actual-process-which-generates-the-reward-signal by gathering more data. Well, I guess maybe you could do it with interpretability or the right set of priors, but I would hesitate to make those load-bearing.)
(BTW, potentially interesting point I just thought of. I’m gonna refer to actual-process-which-generates-the-reward-signal as “approval”. Supposing for a second that it’s possible to disambiguate between alignment and approval somehow, and we successfully aim at alignment and ignore approval. Then we’ve got an AI which might deliberately do aligned things we disapprove of. I think this is not ideal, because from the outside this behavior is also consistent with an AI which has learned approval incorrectly. So we’d want to flip the off switch for the sake of caution. Therefore, as a practical matter, I’d say that you should aim to satisfy both alignment and approval anyways. I suppose you could argue that on the basis of the argument I just gave, satisfying approval is therefore part of alignment and thus this is an unneeded measure, but overall the point is that aiming to satisfy both alignment and approval seems to have pretty low costs.)
(I suppose technically you can disambiguate between alignment and approval if there are unaligned things that humans would approve of—I figure you solve this problem by making your learning algorithm robust against mislabeled data.)
Anyway, you could use a similar approach for the nice plans problem, or you could formalize a notion of “manipulation” which is something like: conditional on the operator viewing this plan, does their predicted favorability towards subsequent plans change on expectation?
Edit: Another thought is that the delta between “approval” and “alignment” seems like the delta between me and my CEV. So to get from “approval” to “alignment”, you could ask your AI to locate the actual-process-which-generates-the-labels, and then ask it about how those labels would be different if we “knew more, thought faster, were more the people we wished we were” etc. (I’m also unclear why you couldn’t ask a hyper-advanced language model what some respected moral philosophers would think if they were able to spend decades contemplating your question or whatever.)
Another edit: You could also just manually filter through all the icky plans until you find one which is non-icky.
(Very interested in hearing objections to all of these ideas.)
The main problem is that “acting via plans that are passed to humans” is not much different from “acting via plans that are passed to robots” when the AI is good enough at modelling humans.
I don’t think this needs an in-depth explanation, does it?
In my mind, decoupling plan generation from plan execution, if done well, accomplishes something like this: [...]
I don’t think the given scenario is realistic for any sort of competent AI. There are two sub-cases:
If step 1 won’t fail due to being read, then the scenario is unrealistic at the “you stop reading the plan at that point” stage. This might be possible for a sufficiently intelligent AI, but that’s already a game over case.
If step 1 will fail due to the plan being read, a competent AI should be able to predict that step 1 will fail due to being read. The scenario is then unrealistic at the “your AGI generates a plan …” stage, because it should be assumed that the AI won’t produce plans that it predicts won’t work.
So this leaves only the assumption that the AI is terrible at modelling humans, but can still make plans that should work well in the real world where humans currently dominate. Maybe there is some tiny corner of possibility space where that can happen, but I don’t think it contributes much to the overall likelihood unless we can find a way to eliminate everything else.
The main problem is that “acting via plans that are passed to humans” is not much different from “acting via plans that are passed to robots” when the AI is good enough at modelling humans.
I agree this is true. But I don’t see why “acting via plans that are passed to humans” is what would happen.
I mean, that might be a component of the plan which is generated. But the assumption here is that we’ve decoupled plan generation from plan execution successfully, no?
So we therefore know that the plan we’re looking at (at least at the top level) is the result of plan generation, not the first step of plan execution (as you seem to be implicitly assuming?)
The AI is searching for plans which score highly according to some criteria. The criteria of “plans which lead to lots of paperclips if implemented” is not the same as the criteria of “plans which lead to lots of paperclips if shown to humans”.
My point is that plan execution can’t be decoupled successfully from plan generation in this way. “Outputting a plan” is in itself an action that affects the world, and an unfriendly superintelligence restricted to only producing plans will still win.
Also, I think the last sentence is literally true, but misleading. Yes, it is possible for plans to score highly under the first criterion but not the second. However, in this scenario the humans are presumably going to discourage such plans, so they effectively score the same as the second criterion.
My point is that plan execution can’t be decoupled successfully from plan generation in this way. “Outputting a plan” is in itself an action that affects the world, and an unfriendly superintelligence restricted to only producing plans will still win.
“Outputting a plan” may technically constitute an action, but a superintelligent system (defining “superintelligent” as being able to search large spaces quickly) might not evaluate its effects as such.
Yes, it is possible for plans to score highly under the first criterion but not the second. However, in this scenario the humans are presumably going to discourage such plans, so they effectively score the same as the second criterion.
I think you’re making a lot of assumptions here. For example, let’s say I’ve just created my planner AI, and I want to test it out by having it generate a paperclip-maximizing plan, just for fun. Is there any meaningful sense in which the displayed plan will be optimized for the criteria “plans which lead to lots of paperclips if shown to humans”? If not, I’d say there’s an important effective difference.
If the superintelligent search system also has an outer layer that attempts to collect data about my plan preferences and model them, then I agree there’s the possibility of incorrect modeling, as discussed in this subthread. But it seems anthropomorphic to assume that such a search system must have some kind of inherent real-world objective that it’s trying to shift me towards with the plans it displays.
Yes, if you’ve just created it, then the criteria are meaningfully different in that case for a very limited time.
But we’re getting a long way off track here, since the original question was about what the flaw is with separating plan generation from plan execution as a general principle for achieving AI safety. Are you clearer about my position on that now?
Yes, if you’ve just created it, then the criteria are meaningfully different in that case for a very limited time.
It’s not obvious to me that this is only true right after creation for a very limited time. What is supposed to change after that?
I don’t see how we’re getting off track. (Your original statement was: ‘One such “clever designer” idea is decoupling plan generation from plan execution, which really just means that the plan generator has humans as part of the initial plan executing hardware.’ If we’re discussing situations where that claim may be false, it seems to me we’re still on track.) But you shouldn’t feel obligated to reply if you don’t want to. Thanks for your replies so far, btw.
What changes is that the human sees that the AI is producing plans that try to manipulate humans. It is very likely that the human does not want the AI to produce such plans, and so applies some corrective action against it happening in future.
For what it’s worth, I often find Eliezer’s arguments unpersuasive because they seem shallow. For example:
This seem like a fuzzy “outside view” sort of argument. (Compare with: “A loaded gun is one trigger pull away from killing someone and is therefore liable to be deadly in many ways.” On the other hand, a causal model of a gun lets you explain which specific gun operations can be deadly and why.)
I’m not saying Eliezer’s conclusion is false. I find other arguments for that conclusion much more persuasive, e.g. involving mesa-optimizers, because there is a proposed failure type which I understand in causal/mechanistic terms.
(I can provide other examples of shallow-seeming arguments if desired.)
I agree that it’s a shallow argument presentation, but that’s not the same thing as being based on shallow ideas. The context provided more depth, and in general a fair few of the shallowly presented arguments seem to be counters to even more shallow arguments.
In general one of the deeper concepts underlying all these shallow arguments appears to be some sort of thesis of “AGI-completeness”, in which any single system that can reach or exceed human mental capability on most tasks, will almost certainly reach or exceed on all mental tasks, including deceiving and manipulating humans. Combining that with potentially very much greater flexibility and extensibility of computing substrate means you get an incredibly dangerous situation no matter how clever the designers think they’ve been.
One such “clever designer” idea is decoupling plan generation from plan execution, which really just means that the plan generator has humans as part of the initial plan executing hardware. You don’t need a deep argument to point out an obvious flaw there. Talking about mesa-optimizers in a such a context is just missing the point from a view in which humans can potentially be used as part of a toolchain in much the same way as robot arms or protein factories.
I don’t see the “obvious flaw” you’re pointing at and would appreciate a more in-depth explanation.
In my mind, decoupling plan generation from plan execution, if done well, accomplishes something like this:
You ask your AGI to generate a plan for how it could maximize paperclips.
Your AGI generates a plan. “Step 1: Manipulate human operator into thinking that paperclips are the best thing ever, using the following argument...”
You stop reading the plan at that point, and don’t click “execute” for it.
I had the same view as you, and was persuaded out of it in this thread. Maybe to shift focus a little, one interesting question here is about training. How do you train a plan-generating AI? If you reward plans that sound like they’d succeed, regardless of how icky they seem, then the AI will become useless to you by outputting effective-sounding but icky plans. But if you reward only plans that look nice enough to execute, that tempts the AI to make plans that manipulate whoever is reading them, and we’re back at square one.
Maybe that’s a good way to look at the general problem. Instead of talking about AI architecture, just say we don’t know any training methods that would make AI better than humans at real world planning and safe to interact with the world, even if it’s just answering questions.
I agree these are legitimate concerns… these are the kind of “deep” arguments I find more persuasive.
In that thread, johnswentworth wrote:
I’d solve this by maintaining uncertainty about the “reward signal”, so the AI tries to find a plan which looks good under both alignment and the actual-process-which-generates-the-reward-signal. (It doesn’t know which is which, but it tries to learn a sufficiently diverse set of reward signals such that alignment is in there somewhere. I don’t think we can do any better than this, because the entire point is that there is no way to disambiguate between alignment and the actual-process-which-generates-the-reward-signal by gathering more data. Well, I guess maybe you could do it with interpretability or the right set of priors, but I would hesitate to make those load-bearing.)
(BTW, potentially interesting point I just thought of. I’m gonna refer to actual-process-which-generates-the-reward-signal as “approval”. Supposing for a second that it’s possible to disambiguate between alignment and approval somehow, and we successfully aim at alignment and ignore approval. Then we’ve got an AI which might deliberately do aligned things we disapprove of. I think this is not ideal, because from the outside this behavior is also consistent with an AI which has learned approval incorrectly. So we’d want to flip the off switch for the sake of caution. Therefore, as a practical matter, I’d say that you should aim to satisfy both alignment and approval anyways. I suppose you could argue that on the basis of the argument I just gave, satisfying approval is therefore part of alignment and thus this is an unneeded measure, but overall the point is that aiming to satisfy both alignment and approval seems to have pretty low costs.)
(I suppose technically you can disambiguate between alignment and approval if there are unaligned things that humans would approve of—I figure you solve this problem by making your learning algorithm robust against mislabeled data.)
Anyway, you could use a similar approach for the nice plans problem, or you could formalize a notion of “manipulation” which is something like: conditional on the operator viewing this plan, does their predicted favorability towards subsequent plans change on expectation?
Edit: Another thought is that the delta between “approval” and “alignment” seems like the delta between me and my CEV. So to get from “approval” to “alignment”, you could ask your AI to locate the actual-process-which-generates-the-labels, and then ask it about how those labels would be different if we “knew more, thought faster, were more the people we wished we were” etc. (I’m also unclear why you couldn’t ask a hyper-advanced language model what some respected moral philosophers would think if they were able to spend decades contemplating your question or whatever.)
Another edit: You could also just manually filter through all the icky plans until you find one which is non-icky.
(Very interested in hearing objections to all of these ideas.)
The main problem is that “acting via plans that are passed to humans” is not much different from “acting via plans that are passed to robots” when the AI is good enough at modelling humans.
I don’t think this needs an in-depth explanation, does it?
I don’t think the given scenario is realistic for any sort of competent AI. There are two sub-cases:
If step 1 won’t fail due to being read, then the scenario is unrealistic at the “you stop reading the plan at that point” stage. This might be possible for a sufficiently intelligent AI, but that’s already a game over case.
If step 1 will fail due to the plan being read, a competent AI should be able to predict that step 1 will fail due to being read. The scenario is then unrealistic at the “your AGI generates a plan …” stage, because it should be assumed that the AI won’t produce plans that it predicts won’t work.
So this leaves only the assumption that the AI is terrible at modelling humans, but can still make plans that should work well in the real world where humans currently dominate. Maybe there is some tiny corner of possibility space where that can happen, but I don’t think it contributes much to the overall likelihood unless we can find a way to eliminate everything else.
I agree this is true. But I don’t see why “acting via plans that are passed to humans” is what would happen.
I mean, that might be a component of the plan which is generated. But the assumption here is that we’ve decoupled plan generation from plan execution successfully, no?
So we therefore know that the plan we’re looking at (at least at the top level) is the result of plan generation, not the first step of plan execution (as you seem to be implicitly assuming?)
The AI is searching for plans which score highly according to some criteria. The criteria of “plans which lead to lots of paperclips if implemented” is not the same as the criteria of “plans which lead to lots of paperclips if shown to humans”.
My point is that plan execution can’t be decoupled successfully from plan generation in this way. “Outputting a plan” is in itself an action that affects the world, and an unfriendly superintelligence restricted to only producing plans will still win.
Also, I think the last sentence is literally true, but misleading. Yes, it is possible for plans to score highly under the first criterion but not the second. However, in this scenario the humans are presumably going to discourage such plans, so they effectively score the same as the second criterion.
“Outputting a plan” may technically constitute an action, but a superintelligent system (defining “superintelligent” as being able to search large spaces quickly) might not evaluate its effects as such.
I think you’re making a lot of assumptions here. For example, let’s say I’ve just created my planner AI, and I want to test it out by having it generate a paperclip-maximizing plan, just for fun. Is there any meaningful sense in which the displayed plan will be optimized for the criteria “plans which lead to lots of paperclips if shown to humans”? If not, I’d say there’s an important effective difference.
If the superintelligent search system also has an outer layer that attempts to collect data about my plan preferences and model them, then I agree there’s the possibility of incorrect modeling, as discussed in this subthread. But it seems anthropomorphic to assume that such a search system must have some kind of inherent real-world objective that it’s trying to shift me towards with the plans it displays.
Yes, if you’ve just created it, then the criteria are meaningfully different in that case for a very limited time.
But we’re getting a long way off track here, since the original question was about what the flaw is with separating plan generation from plan execution as a general principle for achieving AI safety. Are you clearer about my position on that now?
It’s not obvious to me that this is only true right after creation for a very limited time. What is supposed to change after that?
I don’t see how we’re getting off track. (Your original statement was: ‘One such “clever designer” idea is decoupling plan generation from plan execution, which really just means that the plan generator has humans as part of the initial plan executing hardware.’ If we’re discussing situations where that claim may be false, it seems to me we’re still on track.) But you shouldn’t feel obligated to reply if you don’t want to. Thanks for your replies so far, btw.
What changes is that the human sees that the AI is producing plans that try to manipulate humans. It is very likely that the human does not want the AI to produce such plans, and so applies some corrective action against it happening in future.
My comment on that post asks more-or-less the same question, and also ventures an answer.