The backflip example does not strike me as very complex, but the crucial difference and the answer to your question is that training procedures do not teach a robot to do every kind of backflip, just a subset. This is important because when we reverse it, we want non-manipulation to cover the entire set of manipulations. I think it’s probably feasible to have AI not manipulate us using one particular type of manipulation.
On a separate note, could you clarify what you mean by “anti-natural”? I’ll keep in mind your previous caveat that it’s not definitive.
More directly, I would say that general competence/intelligence is connected with certain ways of thinking. For example, modes of thinking that focus on tracking scarce resources and bottlenecks are generally useful. If we think about processes that select for intelligence, those processes are naturally[1] going to select these ways of thinking. Some properties we might imagine a mind having, such as only thinking locally, are the opposite of this—if we select for them, we are fighting the intelligence gradient. To say that a goal is anti-natural means that accomplishing that goal involves learning to think in anti-natural ways, and thus training a mind to have that goal is like swimming against the current, and we should expect it to potentially break if the training processes puts too much weight on competence compared to alignment. Minds with anti-natural goals are possible, but harder to produce using known methods, for the most part.
(AFAIK this is the way that Nate Soares uses the term, and I assume the way Eliezer Yudkowsky thinks about it as well, but I’m also probably missing big parts of their perspectives, and generally don’t trust myself to pass their ITT.)
The term “anti-natural” is bad in that it seems to be the opposite of “natural,” but is not a general opposite of natural. While I do believe that the ways-of-thinking-that-are-generally-useful are the sorts of things that naturally emerge when selecting for intelligence, there are clearly plenty of things which the word “natural” describes besides these ways of thinking. The more complete version of “anti-natural” according to me would be “anti-the-useful-cognitive-strategies-that-naturally-emerge-when-selecting-for-intelligence” but obviously we need a shorthand term, and ideally one that doesn’t breed confusion.
The backflip example does not strike me as very complex, but the crucial difference and the answer to your question is that training procedures do not teach a robot to do every kind of backflip, just a subset. This is important because when we reverse it, we want non-manipulation to cover the entire set of manipulations. I think it’s probably feasible to have AI not manipulate us using one particular type of manipulation.
On a separate note, could you clarify what you mean by “anti-natural”? I’ll keep in mind your previous caveat that it’s not definitive.
Sure, let’s talk about anti-naturality. I wrote some about my perspective on it here: https://www.alignmentforum.org/s/KfCjeconYRdFbMxsy/p/3HMh7ES4ACpeDKtsW#_Anti_Naturality__and_Hardness
More directly, I would say that general competence/intelligence is connected with certain ways of thinking. For example, modes of thinking that focus on tracking scarce resources and bottlenecks are generally useful. If we think about processes that select for intelligence, those processes are naturally[1] going to select these ways of thinking. Some properties we might imagine a mind having, such as only thinking locally, are the opposite of this—if we select for them, we are fighting the intelligence gradient. To say that a goal is anti-natural means that accomplishing that goal involves learning to think in anti-natural ways, and thus training a mind to have that goal is like swimming against the current, and we should expect it to potentially break if the training processes puts too much weight on competence compared to alignment. Minds with anti-natural goals are possible, but harder to produce using known methods, for the most part.
(AFAIK this is the way that Nate Soares uses the term, and I assume the way Eliezer Yudkowsky thinks about it as well, but I’m also probably missing big parts of their perspectives, and generally don’t trust myself to pass their ITT.)
The term “anti-natural” is bad in that it seems to be the opposite of “natural,” but is not a general opposite of natural. While I do believe that the ways-of-thinking-that-are-generally-useful are the sorts of things that naturally emerge when selecting for intelligence, there are clearly plenty of things which the word “natural” describes besides these ways of thinking. The more complete version of “anti-natural” according to me would be “anti-the-useful-cognitive-strategies-that-naturally-emerge-when-selecting-for-intelligence” but obviously we need a shorthand term, and ideally one that doesn’t breed confusion.
Thanks for the clarification, I’ll think more about it that way and how it relates to corrigibility