If I only take counterfactuals over a single AI’s decision then I can have this problem with just two AIs: each of them tries to manipulate me, and if one of them fails the other will succeed and so I see no variation in my preferences.
In that case the hope is to take counterfactuals over all the decisions. I don’t know if this is realistic, but I think it probably either fails in mundane cases or works in this slightly exotic case. Also honestly it doesn’t seem that much harder than taking counterfactuals over one decision, which is already tough.
(I think that many manipulators wanting to push me in the same direction isn’t too exotic though.)
ETA: I think I misunderstood your comment and there’s actually a more basic miscommunication. I’m imagining the counterfactual over different ads that the AI considered running, before settling on the paperclip-maximizing one (having realized that the others wouldn’t lead to me loving paperclips). I’m not imagining the counterfactual over different values that AI might have.
ETA: I think I misunderstood your comment and there’s actually a more basic miscommunication. I’m imagining the counterfactual over different ads that the AI considered running, before settling on the paperclip-maximizing one (having realized that the others wouldn’t lead to me loving paperclips). I’m not imagining the counterfactual over different values that AI might have.
Oh I see. Why doesn’t this symmetrically cause you to filter out good arguments for changing your values (told to you by a friend, say) as well as bad ones?
If all works well, this would filter out anything from the environment that significantly changes your values that you don’t specifically want. (E.g. you don’t filter out food vs “random configuration of atoms I could eat” because you specifically want to figure out food.) We normally think of the hard case where correct deliberation is dependent on some aspects of the environment staying “on distribution” but you don’t recognize which (discussed a bit here). But correct arguments from your friend are the same: you can have preferences over which arguments you hear, but if you can’t decide or even define whether your friend is “being helpful” or “being manipulative” then we don’t think the kind of regularization-based approach discussed in this document will plausibly incentivize your AI to clarify that distinction, so you’re on your own.
We’ve discussed this basic dilemma before, you could split and reflect separately until you become wise enough to decide whether people are safe (perhaps in light of their histories) or you could only interact with people you trust, or you could make early commitments to e.g. not use powerful AI advisors (though the time for such commitments rapidly approaches and passes). But nothing in this document will help you with that, and we’re a bit skeptical about any hope that the same mechanism would address both that problem and ELK (other than solving both by solving alignment in some way that doesn’t require ELK, such that it was a silly subproblem).
Ok, this all makes sense now. I guess when I first read that section I got the impression that you were trying to do something more ambitious. You may want to consider adding some clarification that you’re not describing a scheme designed to block only manipulation while letting helpful arguments through, or that “letting helpful arguments through” would require additional ideas outside of that section.
If I only take counterfactuals over a single AI’s decision then I can have this problem with just two AIs: each of them tries to manipulate me, and if one of them fails the other will succeed and so I see no variation in my preferences.
In that case the hope is to take counterfactuals over all the decisions. I don’t know if this is realistic, but I think it probably either fails in mundane cases or works in this slightly exotic case. Also honestly it doesn’t seem that much harder than taking counterfactuals over one decision, which is already tough.
(I think that many manipulators wanting to push me in the same direction isn’t too exotic though.)
ETA: I think I misunderstood your comment and there’s actually a more basic miscommunication. I’m imagining the counterfactual over different ads that the AI considered running, before settling on the paperclip-maximizing one (having realized that the others wouldn’t lead to me loving paperclips). I’m not imagining the counterfactual over different values that AI might have.
Oh I see. Why doesn’t this symmetrically cause you to filter out good arguments for changing your values (told to you by a friend, say) as well as bad ones?
If all works well, this would filter out anything from the environment that significantly changes your values that you don’t specifically want. (E.g. you don’t filter out food vs “random configuration of atoms I could eat” because you specifically want to figure out food.) We normally think of the hard case where correct deliberation is dependent on some aspects of the environment staying “on distribution” but you don’t recognize which (discussed a bit here). But correct arguments from your friend are the same: you can have preferences over which arguments you hear, but if you can’t decide or even define whether your friend is “being helpful” or “being manipulative” then we don’t think the kind of regularization-based approach discussed in this document will plausibly incentivize your AI to clarify that distinction, so you’re on your own.
We’ve discussed this basic dilemma before, you could split and reflect separately until you become wise enough to decide whether people are safe (perhaps in light of their histories) or you could only interact with people you trust, or you could make early commitments to e.g. not use powerful AI advisors (though the time for such commitments rapidly approaches and passes). But nothing in this document will help you with that, and we’re a bit skeptical about any hope that the same mechanism would address both that problem and ELK (other than solving both by solving alignment in some way that doesn’t require ELK, such that it was a silly subproblem).
Ok, this all makes sense now. I guess when I first read that section I got the impression that you were trying to do something more ambitious. You may want to consider adding some clarification that you’re not describing a scheme designed to block only manipulation while letting helpful arguments through, or that “letting helpful arguments through” would require additional ideas outside of that section.