Rubi J. Hudson comments on Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Rubi J. Hudson 21 Jul 2024 22:40 UTC
LW: 2 AF: 2
0
AF
Great questions!
When I say straightforwardly, I mean when using end states that only include the information available at the time. If we define the end state to also include the history that lead to it, then there exists a set of preferences over them that ranks all end states with histories that include manipulation below the ones that don’t. The issue, of course, is that we don’t know how to specify all the types of manipulation that a superintelligent AI could conceive of.
The gridworld example is a great demonstration of this, because while we can’t reflect the preferences as a ranking of just the end states, the environment is simple enough that you can specify all the paths you don’t want to take to them. I don’t think it really matters whether you call that “anti-naturality that can be overcome with brute force in a simple environment” or just “not anti-naturality”.
- Max Harms 22 Jul 2024 16:32 UTC
  LW: 1 AF: 1
  0
  AF Parent
  Cool. Thanks for the clarification. I think what you call “anti-naturality” you should be calling “non-end-state consequentialism,” but I’m not very interested in linguistic turf-wars.
  It seems to me that while the gridworld is very simple, the ability to train agents to optimize for historical facts is not restricted to simple environments. For example, I think one can train an AI to cause a robot to do backflips by rewarding it every time it completes a backflip. In this context the environment and goal are significantly more complex^[1] than the gridworld and cannot be solved by brute-force. But number of backflips performed is certainly not something that can be measured at any given timeslice, including the “end-state.”
  If caring about historical facts is easy and common, why is it important to split this off and distinguish it?
  1. ^
    Though admittedly this situation is still selected for being simple enough to reason about. If needed I believe this point holds through AGI-level complexity, but things tend to get more muddled as things get more complex, and I’d prefer sticking to the minimal demonstration.
  - Rubi J. Hudson 24 Jul 2024 6:54 UTC
    LW: 1 AF: 1
    0
    AF Parent
    The backflip example does not strike me as very complex, but the crucial difference and the answer to your question is that training procedures do not teach a robot to do every kind of backflip, just a subset. This is important because when we reverse it, we want non-manipulation to cover the entire set of manipulations. I think it’s probably feasible to have AI not manipulate us using one particular type of manipulation.
    On a separate note, could you clarify what you mean by “anti-natural”? I’ll keep in mind your previous caveat that it’s not definitive.
    - Max Harms 29 Jul 2024 16:53 UTC
      LW: 4 AF: 4
      2
      AF Parent
      Sure, let’s talk about anti-naturality. I wrote some about my perspective on it here: https://www.alignmentforum.org/s/KfCjeconYRdFbMxsy/p/3HMh7ES4ACpeDKtsW#_Anti_Naturality__and_Hardness
      More directly, I would say that general competence/intelligence is connected with certain ways of thinking. For example, modes of thinking that focus on tracking scarce resources and bottlenecks are generally useful. If we think about processes that select for intelligence, those processes are naturally^[1] going to select these ways of thinking. Some properties we might imagine a mind having, such as only thinking locally, are the opposite of this—if we select for them, we are fighting the intelligence gradient. To say that a goal is anti-natural means that accomplishing that goal involves learning to think in anti-natural ways, and thus training a mind to have that goal is like swimming against the current, and we should expect it to potentially break if the training processes puts too much weight on competence compared to alignment. Minds with anti-natural goals are possible, but harder to produce using known methods, for the most part.
      (AFAIK this is the way that Nate Soares uses the term, and I assume the way Eliezer Yudkowsky thinks about it as well, but I’m also probably missing big parts of their perspectives, and generally don’t trust myself to pass their ITT.)
      ^
      The term “anti-natural” is bad in that it seems to be the opposite of “natural,” but is not a general opposite of natural. While I do believe that the ways-of-thinking-that-are-generally-useful are the sorts of things that naturally emerge when selecting for intelligence, there are clearly plenty of things which the word “natural” describes besides these ways of thinking. The more complete version of “anti-natural” according to me would be “anti-the-useful-cognitive-strategies-that-naturally-emerge-when-selecting-for-intelligence” but obviously we need a shorthand term, and ideally one that doesn’t breed confusion.
      - Rubi J. Hudson 1 Aug 2024 21:51 UTC
        LW: 2 AF: 2
        0
        AF Parent
        Thanks for the clarification, I’ll think more about it that way and how it relates to corrigibility