My point is that plan execution can’t be decoupled successfully from plan generation in this way. “Outputting a plan” is in itself an action that affects the world, and an unfriendly superintelligence restricted to only producing plans will still win.
Also, I think the last sentence is literally true, but misleading. Yes, it is possible for plans to score highly under the first criterion but not the second. However, in this scenario the humans are presumably going to discourage such plans, so they effectively score the same as the second criterion.
My point is that plan execution can’t be decoupled successfully from plan generation in this way. “Outputting a plan” is in itself an action that affects the world, and an unfriendly superintelligence restricted to only producing plans will still win.
“Outputting a plan” may technically constitute an action, but a superintelligent system (defining “superintelligent” as being able to search large spaces quickly) might not evaluate its effects as such.
Yes, it is possible for plans to score highly under the first criterion but not the second. However, in this scenario the humans are presumably going to discourage such plans, so they effectively score the same as the second criterion.
I think you’re making a lot of assumptions here. For example, let’s say I’ve just created my planner AI, and I want to test it out by having it generate a paperclip-maximizing plan, just for fun. Is there any meaningful sense in which the displayed plan will be optimized for the criteria “plans which lead to lots of paperclips if shown to humans”? If not, I’d say there’s an important effective difference.
If the superintelligent search system also has an outer layer that attempts to collect data about my plan preferences and model them, then I agree there’s the possibility of incorrect modeling, as discussed in this subthread. But it seems anthropomorphic to assume that such a search system must have some kind of inherent real-world objective that it’s trying to shift me towards with the plans it displays.
Yes, if you’ve just created it, then the criteria are meaningfully different in that case for a very limited time.
But we’re getting a long way off track here, since the original question was about what the flaw is with separating plan generation from plan execution as a general principle for achieving AI safety. Are you clearer about my position on that now?
Yes, if you’ve just created it, then the criteria are meaningfully different in that case for a very limited time.
It’s not obvious to me that this is only true right after creation for a very limited time. What is supposed to change after that?
I don’t see how we’re getting off track. (Your original statement was: ‘One such “clever designer” idea is decoupling plan generation from plan execution, which really just means that the plan generator has humans as part of the initial plan executing hardware.’ If we’re discussing situations where that claim may be false, it seems to me we’re still on track.) But you shouldn’t feel obligated to reply if you don’t want to. Thanks for your replies so far, btw.
What changes is that the human sees that the AI is producing plans that try to manipulate humans. It is very likely that the human does not want the AI to produce such plans, and so applies some corrective action against it happening in future.
My point is that plan execution can’t be decoupled successfully from plan generation in this way. “Outputting a plan” is in itself an action that affects the world, and an unfriendly superintelligence restricted to only producing plans will still win.
Also, I think the last sentence is literally true, but misleading. Yes, it is possible for plans to score highly under the first criterion but not the second. However, in this scenario the humans are presumably going to discourage such plans, so they effectively score the same as the second criterion.
“Outputting a plan” may technically constitute an action, but a superintelligent system (defining “superintelligent” as being able to search large spaces quickly) might not evaluate its effects as such.
I think you’re making a lot of assumptions here. For example, let’s say I’ve just created my planner AI, and I want to test it out by having it generate a paperclip-maximizing plan, just for fun. Is there any meaningful sense in which the displayed plan will be optimized for the criteria “plans which lead to lots of paperclips if shown to humans”? If not, I’d say there’s an important effective difference.
If the superintelligent search system also has an outer layer that attempts to collect data about my plan preferences and model them, then I agree there’s the possibility of incorrect modeling, as discussed in this subthread. But it seems anthropomorphic to assume that such a search system must have some kind of inherent real-world objective that it’s trying to shift me towards with the plans it displays.
Yes, if you’ve just created it, then the criteria are meaningfully different in that case for a very limited time.
But we’re getting a long way off track here, since the original question was about what the flaw is with separating plan generation from plan execution as a general principle for achieving AI safety. Are you clearer about my position on that now?
It’s not obvious to me that this is only true right after creation for a very limited time. What is supposed to change after that?
I don’t see how we’re getting off track. (Your original statement was: ‘One such “clever designer” idea is decoupling plan generation from plan execution, which really just means that the plan generator has humans as part of the initial plan executing hardware.’ If we’re discussing situations where that claim may be false, it seems to me we’re still on track.) But you shouldn’t feel obligated to reply if you don’t want to. Thanks for your replies so far, btw.
What changes is that the human sees that the AI is producing plans that try to manipulate humans. It is very likely that the human does not want the AI to produce such plans, and so applies some corrective action against it happening in future.