The main problem is that “acting via plans that are passed to humans” is not much different from “acting via plans that are passed to robots” when the AI is good enough at modelling humans.
I don’t think this needs an in-depth explanation, does it?
In my mind, decoupling plan generation from plan execution, if done well, accomplishes something like this: [...]
I don’t think the given scenario is realistic for any sort of competent AI. There are two sub-cases:
If step 1 won’t fail due to being read, then the scenario is unrealistic at the “you stop reading the plan at that point” stage. This might be possible for a sufficiently intelligent AI, but that’s already a game over case.
If step 1 will fail due to the plan being read, a competent AI should be able to predict that step 1 will fail due to being read. The scenario is then unrealistic at the “your AGI generates a plan …” stage, because it should be assumed that the AI won’t produce plans that it predicts won’t work.
So this leaves only the assumption that the AI is terrible at modelling humans, but can still make plans that should work well in the real world where humans currently dominate. Maybe there is some tiny corner of possibility space where that can happen, but I don’t think it contributes much to the overall likelihood unless we can find a way to eliminate everything else.
The main problem is that “acting via plans that are passed to humans” is not much different from “acting via plans that are passed to robots” when the AI is good enough at modelling humans.
I agree this is true. But I don’t see why “acting via plans that are passed to humans” is what would happen.
I mean, that might be a component of the plan which is generated. But the assumption here is that we’ve decoupled plan generation from plan execution successfully, no?
So we therefore know that the plan we’re looking at (at least at the top level) is the result of plan generation, not the first step of plan execution (as you seem to be implicitly assuming?)
The AI is searching for plans which score highly according to some criteria. The criteria of “plans which lead to lots of paperclips if implemented” is not the same as the criteria of “plans which lead to lots of paperclips if shown to humans”.
My point is that plan execution can’t be decoupled successfully from plan generation in this way. “Outputting a plan” is in itself an action that affects the world, and an unfriendly superintelligence restricted to only producing plans will still win.
Also, I think the last sentence is literally true, but misleading. Yes, it is possible for plans to score highly under the first criterion but not the second. However, in this scenario the humans are presumably going to discourage such plans, so they effectively score the same as the second criterion.
My point is that plan execution can’t be decoupled successfully from plan generation in this way. “Outputting a plan” is in itself an action that affects the world, and an unfriendly superintelligence restricted to only producing plans will still win.
“Outputting a plan” may technically constitute an action, but a superintelligent system (defining “superintelligent” as being able to search large spaces quickly) might not evaluate its effects as such.
Yes, it is possible for plans to score highly under the first criterion but not the second. However, in this scenario the humans are presumably going to discourage such plans, so they effectively score the same as the second criterion.
I think you’re making a lot of assumptions here. For example, let’s say I’ve just created my planner AI, and I want to test it out by having it generate a paperclip-maximizing plan, just for fun. Is there any meaningful sense in which the displayed plan will be optimized for the criteria “plans which lead to lots of paperclips if shown to humans”? If not, I’d say there’s an important effective difference.
If the superintelligent search system also has an outer layer that attempts to collect data about my plan preferences and model them, then I agree there’s the possibility of incorrect modeling, as discussed in this subthread. But it seems anthropomorphic to assume that such a search system must have some kind of inherent real-world objective that it’s trying to shift me towards with the plans it displays.
Yes, if you’ve just created it, then the criteria are meaningfully different in that case for a very limited time.
But we’re getting a long way off track here, since the original question was about what the flaw is with separating plan generation from plan execution as a general principle for achieving AI safety. Are you clearer about my position on that now?
Yes, if you’ve just created it, then the criteria are meaningfully different in that case for a very limited time.
It’s not obvious to me that this is only true right after creation for a very limited time. What is supposed to change after that?
I don’t see how we’re getting off track. (Your original statement was: ‘One such “clever designer” idea is decoupling plan generation from plan execution, which really just means that the plan generator has humans as part of the initial plan executing hardware.’ If we’re discussing situations where that claim may be false, it seems to me we’re still on track.) But you shouldn’t feel obligated to reply if you don’t want to. Thanks for your replies so far, btw.
What changes is that the human sees that the AI is producing plans that try to manipulate humans. It is very likely that the human does not want the AI to produce such plans, and so applies some corrective action against it happening in future.
The main problem is that “acting via plans that are passed to humans” is not much different from “acting via plans that are passed to robots” when the AI is good enough at modelling humans.
I don’t think this needs an in-depth explanation, does it?
I don’t think the given scenario is realistic for any sort of competent AI. There are two sub-cases:
If step 1 won’t fail due to being read, then the scenario is unrealistic at the “you stop reading the plan at that point” stage. This might be possible for a sufficiently intelligent AI, but that’s already a game over case.
If step 1 will fail due to the plan being read, a competent AI should be able to predict that step 1 will fail due to being read. The scenario is then unrealistic at the “your AGI generates a plan …” stage, because it should be assumed that the AI won’t produce plans that it predicts won’t work.
So this leaves only the assumption that the AI is terrible at modelling humans, but can still make plans that should work well in the real world where humans currently dominate. Maybe there is some tiny corner of possibility space where that can happen, but I don’t think it contributes much to the overall likelihood unless we can find a way to eliminate everything else.
I agree this is true. But I don’t see why “acting via plans that are passed to humans” is what would happen.
I mean, that might be a component of the plan which is generated. But the assumption here is that we’ve decoupled plan generation from plan execution successfully, no?
So we therefore know that the plan we’re looking at (at least at the top level) is the result of plan generation, not the first step of plan execution (as you seem to be implicitly assuming?)
The AI is searching for plans which score highly according to some criteria. The criteria of “plans which lead to lots of paperclips if implemented” is not the same as the criteria of “plans which lead to lots of paperclips if shown to humans”.
My point is that plan execution can’t be decoupled successfully from plan generation in this way. “Outputting a plan” is in itself an action that affects the world, and an unfriendly superintelligence restricted to only producing plans will still win.
Also, I think the last sentence is literally true, but misleading. Yes, it is possible for plans to score highly under the first criterion but not the second. However, in this scenario the humans are presumably going to discourage such plans, so they effectively score the same as the second criterion.
“Outputting a plan” may technically constitute an action, but a superintelligent system (defining “superintelligent” as being able to search large spaces quickly) might not evaluate its effects as such.
I think you’re making a lot of assumptions here. For example, let’s say I’ve just created my planner AI, and I want to test it out by having it generate a paperclip-maximizing plan, just for fun. Is there any meaningful sense in which the displayed plan will be optimized for the criteria “plans which lead to lots of paperclips if shown to humans”? If not, I’d say there’s an important effective difference.
If the superintelligent search system also has an outer layer that attempts to collect data about my plan preferences and model them, then I agree there’s the possibility of incorrect modeling, as discussed in this subthread. But it seems anthropomorphic to assume that such a search system must have some kind of inherent real-world objective that it’s trying to shift me towards with the plans it displays.
Yes, if you’ve just created it, then the criteria are meaningfully different in that case for a very limited time.
But we’re getting a long way off track here, since the original question was about what the flaw is with separating plan generation from plan execution as a general principle for achieving AI safety. Are you clearer about my position on that now?
It’s not obvious to me that this is only true right after creation for a very limited time. What is supposed to change after that?
I don’t see how we’re getting off track. (Your original statement was: ‘One such “clever designer” idea is decoupling plan generation from plan execution, which really just means that the plan generator has humans as part of the initial plan executing hardware.’ If we’re discussing situations where that claim may be false, it seems to me we’re still on track.) But you shouldn’t feel obligated to reply if you don’t want to. Thanks for your replies so far, btw.
What changes is that the human sees that the AI is producing plans that try to manipulate humans. It is very likely that the human does not want the AI to produce such plans, and so applies some corrective action against it happening in future.