Vladimir_Nesov comments on Control and security

Vladimir_Nesov Oct 16, 2016, 3:44 PM
0 points
AF
This works as a subtle argument for security mindset in AI control (while not being framed as such). One issue is that it might deemphasize some AI control problems that are not analogous to practical security problems, like detailed value elicitation (where in security you formulate a few general principles and then give up). That is, the concept of {AI control problems that are analogous to security problems} might be close enough to the concept of {all AI control problems} to replace it in some people’s minds.
- paulfchristiano Oct 16, 2016, 6:55 PM
  0 points
  AF Parent
  It seems to me like failures of value learning can also be a security problem: if some gap between the AI’s values and the human values is going to cause trouble, the trouble is most likely to show up in some adversarially-crafted setting.
  
  I do agree that this is not closely analogous to security problems that cause trouble today.
  
  I also agree that sorting out how to do value elicitation in the long-run is not really a short-term security problem, but I am also somewhat skeptical that it is a critical control problem. I think that the main important thing is that our AI systems learn to behave effectively in the world while allowing us to maintain effective control over their future behavior, and a failure of this property (e.g. because the AI has a bad conception of “effective control”) is likely to be a security problem.
  - Vladimir_Nesov Oct 17, 2016, 9:34 AM
    0 points
    AF Parent
    I think that the main important thing is that our AI systems learn to behave effectively in the world while allowing us to maintain effective control over their future behavior
    
    This does seem sufficient to solve the immediate problem of AI risk, without compromising the potential for optimizing the world with our detailed values, provided
    
    The line between “us” that maintain control and the AI design is sufficiently blurred (via learning, uploading, prediction etc., to remove the overhead of dealing with physical humans)
    “Behave effectively” includes capability to disable potential misaligned AIs in the wild
    “Effective control” allows replacing whatever the AI is doing with something else at any level of detail.
    
    The advantage of introducing the concept of detailed values of the AI in the initial design is that it protects the setup from manipulation by the AI. If we don’t do that, the control problem becomes much more complicated. In the approach you are talking about, initially there are no explicitly formulated detailed values, only instrumental skills and humans.
    
    So it’s a tradeoff: solving the value elicitation/use problem makes AIs easier to control, but if it’s possible to control an AI anyway, the problem could initially remain unsolved. I’m skeptical that it’s possible to control an AI other than by giving it completely defined values (so that it learns further details by further examining the fixed definition), if that AI is capable enough to prevent AI risk from other AIs.