The preceding sentences in the OP were (emphasis added):
Then humans will not be competent to use their own knowledge of the world to figure out all the results of that action sequence. An AI whose action sequence you can fully understand all the effects of, before it executes, is much weaker than humans in that domain; you couldn’t make the same guarantee about an unaligned human as smart as yourself and trying to fool you.
I took Eliezer to be saying something like:
’If you’re confident that your AGI system is directing its optimization at the target task, is doing no adversarial optimization, and is otherwise aligned, then shrug, maybe there’s some role to be played by checking a few aspects of the system’s output to confirm certain facts.
‘But in this scenario, the work is almost entirely being done by the AGI’s alignment, not by the post facto checking. If you screwed up and the system is doing open-ended optimization of the world that includes thinking about its developers and planning to take control from them, then it’s plausible that your checking will completely fail to notice the trap; and it’s ~certain that your checking, if it does notice the trap, won’t thereby give you trap-free nanosystems that you can use to end the acute risk period.’
(One thing to keep in mind is that an adversarial AGI with knowledge of its operators would want to obfuscate its plans, making it harder for humans to productively analyze designs it produces; and it might also want to obscure the fact that the plans are obfuscated, making them look easier-to-check than they are.)
-- The AI isn’t trying to deceive you, but is trying to produce plans that would, if executed, have consequences X, and X is not something you want.
-- The AI is trying to produce plans that would, if executed, have consequences you want.
The first case is hopeless, and the third case is about an already aligned AI. The second case might not really make sense, because deception is a convergent instrumental goal especially if the AI is trying to cause X and you’re trying to cause not X, and generally because an AI that smart probably has inner optimizers that don’t care about this “make a plan, don’t execute plans” thing you thought you’d set up. But if, arguendo, we have a superintelligently optimized plan which doesn’t already contain, in its current description as a plan, a mindhack (e.g. by some surprising way of domaining an AI to care about producing plans but not about making anything happen), then there’s a question whether it could help to have humans think about the consequences of the plan. I thought Eliezer was answering that question “No, even in this hypothetical, pivotal acts are too complicated and can’t be understood fully in detail by humans, so you’d still have to trust the AI, so the AI has to have understood and applied a whole lot about your values in order to have any shot that the plan doesn’t have huge unpleasantly surprising consequences”, and I was questioning that.
Not a response to your actual point but
I think that hypothetical example probably doesn’t make sense (as in making the ai not “care” doesn’t prevent it from including mindhacks in its plan)
If you have a plan that is “superingently optimized” for some misaligned goal then that plan will have to take into account the effect of outputing the plan itself and will by default contain deception or mindhacks even if the AI doesn’t in some sense “care” about executing plans.
(or if you setup some complicated scheme whith conterfactuals so the model ignores the effects of the plans in humans that will make your plans less useful or inscrutable)
The plan that produces the most paperclips is going to be one that deceives or mindhacks humans instead of one that humans wouldn’t accept in the first place.
Maybe it’s posible to use some kind of scheme that avoids the model taking the consecueces of ouputing the plan itself into account but the model kind of has to be modeling humans reading its plan to write a realistic plan that humans will understand, accept and be able to put into practice, and the plan might only work in the fake conterfactual universe whith no plan it was written for.
So I doubt it’s actually feasible to have any such scheme that avoids mindhacks and still produces usefull plans.
I think I agree, but also, people say things like “the AI should if possible be prevented from not modeling humans”, which if possible would imply that the hypothetical example makes more sense.
The second case might not really make sense, because deception is a convergent instrumental goal especially if the AI is trying to cause X and you’re trying to cause not X, and generally because an AI that smart probably has inner optimizers that don’t care about this “make a plan, don’t execute plans” thing you thought you’d set up.
I believe the second case is a subcase of the problem of ELK. Maybe the AI isn’t trying to deceive you, and actually do what you asked it to do (e.g., I want to see “the diamond” on the main detector), yet the plans it produces has consequence X that you don’t want (in the ELK example, the diamond is stolen but you see something that looks like the diamond on the main detector). The problem is: how can you be sure the plans proposed have consequence X? Especially if you don’t even know X is a possible consequence of the plans?
The preceding sentences in the OP were (emphasis added):
I took Eliezer to be saying something like:
’If you’re confident that your AGI system is directing its optimization at the target task, is doing no adversarial optimization, and is otherwise aligned, then shrug, maybe there’s some role to be played by checking a few aspects of the system’s output to confirm certain facts.
‘But in this scenario, the work is almost entirely being done by the AGI’s alignment, not by the post facto checking. If you screwed up and the system is doing open-ended optimization of the world that includes thinking about its developers and planning to take control from them, then it’s plausible that your checking will completely fail to notice the trap; and it’s ~certain that your checking, if it does notice the trap, won’t thereby give you trap-free nanosystems that you can use to end the acute risk period.’
(One thing to keep in mind is that an adversarial AGI with knowledge of its operators would want to obfuscate its plans, making it harder for humans to productively analyze designs it produces; and it might also want to obscure the fact that the plans are obfuscated, making them look easier-to-check than they are.)
We can distinguish:
-- The AI is trying to deceive you.
-- The AI isn’t trying to deceive you, but is trying to produce plans that would, if executed, have consequences X, and X is not something you want.
-- The AI is trying to produce plans that would, if executed, have consequences you want.
The first case is hopeless, and the third case is about an already aligned AI. The second case might not really make sense, because deception is a convergent instrumental goal especially if the AI is trying to cause X and you’re trying to cause not X, and generally because an AI that smart probably has inner optimizers that don’t care about this “make a plan, don’t execute plans” thing you thought you’d set up. But if, arguendo, we have a superintelligently optimized plan which doesn’t already contain, in its current description as a plan, a mindhack (e.g. by some surprising way of domaining an AI to care about producing plans but not about making anything happen), then there’s a question whether it could help to have humans think about the consequences of the plan. I thought Eliezer was answering that question “No, even in this hypothetical, pivotal acts are too complicated and can’t be understood fully in detail by humans, so you’d still have to trust the AI, so the AI has to have understood and applied a whole lot about your values in order to have any shot that the plan doesn’t have huge unpleasantly surprising consequences”, and I was questioning that.
Not a response to your actual point but I think that hypothetical example probably doesn’t make sense (as in making the ai not “care” doesn’t prevent it from including mindhacks in its plan) If you have a plan that is “superingently optimized” for some misaligned goal then that plan will have to take into account the effect of outputing the plan itself and will by default contain deception or mindhacks even if the AI doesn’t in some sense “care” about executing plans. (or if you setup some complicated scheme whith conterfactuals so the model ignores the effects of the plans in humans that will make your plans less useful or inscrutable)
The plan that produces the most paperclips is going to be one that deceives or mindhacks humans instead of one that humans wouldn’t accept in the first place. Maybe it’s posible to use some kind of scheme that avoids the model taking the consecueces of ouputing the plan itself into account but the model kind of has to be modeling humans reading its plan to write a realistic plan that humans will understand, accept and be able to put into practice, and the plan might only work in the fake conterfactual universe whith no plan it was written for.
So I doubt it’s actually feasible to have any such scheme that avoids mindhacks and still produces usefull plans.
I think I agree, but also, people say things like “the AI should if possible be prevented from not modeling humans”, which if possible would imply that the hypothetical example makes more sense.
I believe the second case is a subcase of the problem of ELK. Maybe the AI isn’t trying to deceive you, and actually do what you asked it to do (e.g., I want to see “the diamond” on the main detector), yet the plans it produces has consequence X that you don’t want (in the ELK example, the diamond is stolen but you see something that looks like the diamond on the main detector). The problem is: how can you be sure the plans proposed have consequence X? Especially if you don’t even know X is a possible consequence of the plans?