TurnTrout comments on How much can value learning be disentangled?

TurnTrout 30 Jan 2019 21:43 UTC
LW: 3 AF: 2
AF
Take a friendly AI that does stuff. Then there is a utility function for which that “does stuff” is the single worst thing the AI could have done.

The fact that no course of action is universally friendly doesn’t mean it can’t be friendly for us.

As I understand it, the impact version of this argument is flawed in the same way (but less blatantly so): something being high impact according to a contrived utility function doesn’t mean we can’t induce behavior that is, with high probability, low impact for the vast majority of reasonable utility functions.
- Stuart_Armstrong 30 Jan 2019 22:12 UTC
  LW: 2 AF: 1
  AF Parent
  
  The fact that no course of action is universally friendly doesn’t mean it can’t be friendly for us.
  
  Indeed, by “friendly AI” I meant “an AI friendly for us”. So yes, I was showing a contrived example of an AI that was friendly, and low impact, from our perspective, but that was not, as you said, universally friendly (or universally low impact).
  
  something being high impact according to a contrived utility function doesn’t mean we can’t induce behavior that is, with high probability, low impact for the vast majority of reasonable utility functions.
  
  In my experience so far, we need to include our values, in part, to define “reasonable” utility functions.
  - TurnTrout 30 Jan 2019 22:37 UTC
    LW: 2 AF: 1
    AF Parent
    
    In my experience so far, we need to include our values, in part, to define “reasonable” utility functions.
    
    It seems that an extremely broad set of input attainable functions suffice to capture the “reasonable“ functions with respect to which we want to be low impact. For example, “remaining on”, “reward linear in how many blue pixels are observed each time step”, etc. All thanks to instrumental convergence and opportunity cost.