I don’t think this is right. The agent is optimized to choose actions which, when shown to a human, receive high approval. It’s not optimized to pick actions which, when executed, cause the agent to receive high approval in the future.
I think optimizing for high approval now leaves a huge number of variables unconstrained. For example, I could absolutely imagine a corrigible AI with ISIS suicide bomber values that consistently receives high approval from its operator and eventually turns its operator into an ISIS suicide bomber. (Maybe not all operators, but definitely some operators.)
Given the constraint of optimizing for high approval now, in what other directions would our corrigible AI try to optimize? Some natural guesses would be optimizing for future approval, or optimizing for its model of its operator’s extrapolated values (which I would distrust unless I had good reason to trust its extrapolation process). If it were doing either, I’d be very scared about getting corrupted. But you’re right that it may not optimize for us turning into yes-men in particular.
I suspect this disagreement is related to our disagreement about the robustness of human reflection. Actually, the robustness of human reflection is a crux for me—if I thought human reflection were robust, then I think an AI that continously optimizes for high approval now leaves few important variables unconstrained, and would lead us to very good outcomes by default. Is this a crux for you too?
What do you mean by leaving variables unconstrained? Optimizing for X is basically a complete description.
(Of course, if I optimize my system for X, I may get a system that optimizes for Y != X, but that doesn’t seem like what you are talking about. I don’t think that the Y’s you described—future approval, or idealized preferences—are especially likely Y’s. More likely is something totally alien, or even reproductive fitness.)
Oops, I do think that’s what I meant. To explain my wording: when I imagined a “system optimizing for X”, I didn’t imagine that system trying its hardest to do X, I imagined “a system for which the variable Z it can best be described as optimizing is X”.
To say it all concretely another way, I mean that there are a bunch of different systems S1,S2,...,Sn that, when “trying to optimize for X as hard as possible” all look to us like they optimize for X successfully, but do so via methods M1,M2,...,Mn that lead to vastly different (and generally undesirable) endstates Y1,Y2,..,Yn like the one described in this post, or one where the operators become ISIS suicide bombers. In this light, it seem more accurate to describe Si as optimizing for Yi instead of X, even though Si is trying to optimize for X and optimizes it pretty successfully. But I really don’t want a superintelligent system optimizing for some Y that is not my values.
As a possibly related general intuition, I think the space of outcomes that can result from having a human follow a sequence of suggestions, each of which they’d enthusiastically endorse, is massive, and that most of these outcomes are undesirable. (It’s possible that one crisp articulation of “sufficient metaphilosophical competence” is that following a sequence of suggestions, each of which you’d enthusiastically endorse, is actually good for you.)
On reflection, I agree that neither future approval nor idealized preferences are particularly likely, and that whatever Y is would actually look very alien.
I think optimizing for high approval now leaves a huge number of variables unconstrained. For example, I could absolutely imagine a corrigible AI with ISIS suicide bomber values that consistently receives high approval from its operator and eventually turns its operator into an ISIS suicide bomber. (Maybe not all operators, but definitely some operators.)
Given the constraint of optimizing for high approval now, in what other directions would our corrigible AI try to optimize? Some natural guesses would be optimizing for future approval, or optimizing for its model of its operator’s extrapolated values (which I would distrust unless I had good reason to trust its extrapolation process). If it were doing either, I’d be very scared about getting corrupted. But you’re right that it may not optimize for us turning into yes-men in particular.
I suspect this disagreement is related to our disagreement about the robustness of human reflection. Actually, the robustness of human reflection is a crux for me—if I thought human reflection were robust, then I think an AI that continously optimizes for high approval now leaves few important variables unconstrained, and would lead us to very good outcomes by default. Is this a crux for you too?
What do you mean by leaving variables unconstrained? Optimizing for X is basically a complete description.
(Of course, if I optimize my system for X, I may get a system that optimizes for Y != X, but that doesn’t seem like what you are talking about. I don’t think that the Y’s you described—future approval, or idealized preferences—are especially likely Y’s. More likely is something totally alien, or even reproductive fitness.)
Oops, I do think that’s what I meant. To explain my wording: when I imagined a “system optimizing for X”, I didn’t imagine that system trying its hardest to do X, I imagined “a system for which the variable Z it can best be described as optimizing is X”.
To say it all concretely another way, I mean that there are a bunch of different systems S1,S2,...,Sn that, when “trying to optimize for X as hard as possible” all look to us like they optimize for X successfully, but do so via methods M1,M2,...,Mn that lead to vastly different (and generally undesirable) endstates Y1,Y2,..,Yn like the one described in this post, or one where the operators become ISIS suicide bombers. In this light, it seem more accurate to describe Si as optimizing for Yi instead of X, even though Si is trying to optimize for X and optimizes it pretty successfully. But I really don’t want a superintelligent system optimizing for some Y that is not my values.
As a possibly related general intuition, I think the space of outcomes that can result from having a human follow a sequence of suggestions, each of which they’d enthusiastically endorse, is massive, and that most of these outcomes are undesirable. (It’s possible that one crisp articulation of “sufficient metaphilosophical competence” is that following a sequence of suggestions, each of which you’d enthusiastically endorse, is actually good for you.)
On reflection, I agree that neither future approval nor idealized preferences are particularly likely, and that whatever Y is would actually look very alien.