Oops, I do think that’s what I meant. To explain my wording: when I imagined a “system optimizing for X”, I didn’t imagine that system trying its hardest to do X, I imagined “a system for which the variable Z it can best be described as optimizing is X”.
To say it all concretely another way, I mean that there are a bunch of different systems S1,S2,...,Sn that, when “trying to optimize for X as hard as possible” all look to us like they optimize for X successfully, but do so via methods M1,M2,...,Mn that lead to vastly different (and generally undesirable) endstates Y1,Y2,..,Yn like the one described in this post, or one where the operators become ISIS suicide bombers. In this light, it seem more accurate to describe Si as optimizing for Yi instead of X, even though Si is trying to optimize for X and optimizes it pretty successfully. But I really don’t want a superintelligent system optimizing for some Y that is not my values.
As a possibly related general intuition, I think the space of outcomes that can result from having a human follow a sequence of suggestions, each of which they’d enthusiastically endorse, is massive, and that most of these outcomes are undesirable. (It’s possible that one crisp articulation of “sufficient metaphilosophical competence” is that following a sequence of suggestions, each of which you’d enthusiastically endorse, is actually good for you.)
On reflection, I agree that neither future approval nor idealized preferences are particularly likely, and that whatever Y is would actually look very alien.
Oops, I do think that’s what I meant. To explain my wording: when I imagined a “system optimizing for X”, I didn’t imagine that system trying its hardest to do X, I imagined “a system for which the variable Z it can best be described as optimizing is X”.
To say it all concretely another way, I mean that there are a bunch of different systems S1,S2,...,Sn that, when “trying to optimize for X as hard as possible” all look to us like they optimize for X successfully, but do so via methods M1,M2,...,Mn that lead to vastly different (and generally undesirable) endstates Y1,Y2,..,Yn like the one described in this post, or one where the operators become ISIS suicide bombers. In this light, it seem more accurate to describe Si as optimizing for Yi instead of X, even though Si is trying to optimize for X and optimizes it pretty successfully. But I really don’t want a superintelligent system optimizing for some Y that is not my values.
As a possibly related general intuition, I think the space of outcomes that can result from having a human follow a sequence of suggestions, each of which they’d enthusiastically endorse, is massive, and that most of these outcomes are undesirable. (It’s possible that one crisp articulation of “sufficient metaphilosophical competence” is that following a sequence of suggestions, each of which you’d enthusiastically endorse, is actually good for you.)
On reflection, I agree that neither future approval nor idealized preferences are particularly likely, and that whatever Y is would actually look very alien.