Stuart_Armstrong comments on Learning human preferences: black-box, white-box, and structured white-box access

Stuart_Armstrong 26 Aug 2020 9:28 UTC
LW: 2 AF: 1
AF

modularization is super helpful for simplifying things.

The best modularization for simplification will not likely correspond to the best modularization for distinguishing preferences from other parts of the agent’s algorithm (that’s the “Occam’s razor” result).
- John_Maxwell 26 Aug 2020 11:02 UTC
  LW: 2 AF: 1
  AF Parent
  Let’s say I’m trying to describe a hockey game. Modularizing the preferences from other aspects of the team algorithm makes it much easier to describe what happens at the start of the second period, when the two teams switch sides.
  
  The fact that humans find an abstraction useful is evidence that an AI will as well. The notion that agents have preferences helps us predict how people will change their plans for achieving their goals when they receive new information. Same for an AI.
  - Stuart_Armstrong 26 Aug 2020 11:54 UTC
    LW: 2 AF: 1
    AF Parent
    Humans have a theory of mind, that makes certain types of modularizations easier. That doesn’t mean that the same modularization is simple for an agent that doesn’t share that theory of mind.
    
    Then again, it might be. This is worth digging into empirically. See my post on the optimistic and pessimistic scenarios; in the optimistic scenario, preferences, human theory of mind, and all the other elements, are easy to deduce (there’s an informal equivalence result; if one of those is easy to deduce, all the others are).
    
    So we need to figure out if we’re in the optimistic or the pessimistic scenario.