Not a response to your actual point but
I think that hypothetical example probably doesn’t make sense (as in making the ai not “care” doesn’t prevent it from including mindhacks in its plan)
If you have a plan that is “superingently optimized” for some misaligned goal then that plan will have to take into account the effect of outputing the plan itself and will by default contain deception or mindhacks even if the AI doesn’t in some sense “care” about executing plans.
(or if you setup some complicated scheme whith conterfactuals so the model ignores the effects of the plans in humans that will make your plans less useful or inscrutable)
The plan that produces the most paperclips is going to be one that deceives or mindhacks humans instead of one that humans wouldn’t accept in the first place.
Maybe it’s posible to use some kind of scheme that avoids the model taking the consecueces of ouputing the plan itself into account but the model kind of has to be modeling humans reading its plan to write a realistic plan that humans will understand, accept and be able to put into practice, and the plan might only work in the fake conterfactual universe whith no plan it was written for.
So I doubt it’s actually feasible to have any such scheme that avoids mindhacks and still produces usefull plans.
I think I agree, but also, people say things like “the AI should if possible be prevented from not modeling humans”, which if possible would imply that the hypothetical example makes more sense.
Not a response to your actual point but I think that hypothetical example probably doesn’t make sense (as in making the ai not “care” doesn’t prevent it from including mindhacks in its plan) If you have a plan that is “superingently optimized” for some misaligned goal then that plan will have to take into account the effect of outputing the plan itself and will by default contain deception or mindhacks even if the AI doesn’t in some sense “care” about executing plans. (or if you setup some complicated scheme whith conterfactuals so the model ignores the effects of the plans in humans that will make your plans less useful or inscrutable)
The plan that produces the most paperclips is going to be one that deceives or mindhacks humans instead of one that humans wouldn’t accept in the first place. Maybe it’s posible to use some kind of scheme that avoids the model taking the consecueces of ouputing the plan itself into account but the model kind of has to be modeling humans reading its plan to write a realistic plan that humans will understand, accept and be able to put into practice, and the plan might only work in the fake conterfactual universe whith no plan it was written for.
So I doubt it’s actually feasible to have any such scheme that avoids mindhacks and still produces usefull plans.
I think I agree, but also, people say things like “the AI should if possible be prevented from not modeling humans”, which if possible would imply that the hypothetical example makes more sense.