Gordon Seidoh Worley comments on Learning human preferences: black-box, white-box, and structured white-box access

Gordon Seidoh Worley 25 Aug 2020 18:54 UTC
LW: 2 AF: 1
AF
Any model is going to be in the head of some onlooker. This is the tough part about the white box approach: it’s always an inference about what’s “really” going on. Of course, this is true even of the boundaries of black boxes, so it’s a fully general problem. And I think that suggests it’s not a problem except insofar as we have normal problems setting up correspondence between map and territory.
- Steven Byrnes 25 Aug 2020 20:31 UTC
  LW: 4 AF: 2
  AF Parent
  My understanding of the OP was that there is a robot, and the robot has source code, and “black box” means we don’t see the source code but get an impenetrable binary and can do tests of what its input-output behavior is, and “white box” means we get the source code and run it step-by-step in debugging mode but the names of variables, functions, modules, etc. are replaced by random strings. We can still see the structure of the code, like “module A calls module B”. And “labeled white box” means we get the source code along with well-chosen names of variables, functions, etc.
  Then my question was: what if none of the variables, functions, etc. corresponds to “preferences”? What if “preferences” is a way that we try to interpret the robot, but not a natural subsystem or abstraction or function or anything else that would be useful for the robot’s programmer?
  But now this conversation is suggesting that I’m not quite understanding it right. “Black box” is what I thought, but “white box” is any source code that produces the same input-output behavior—not necessarily the robot’s actual source code—and that includes source code that does extra pointless calculations internally. And then my question doesn’t really make sense, because whatever “preferences” is, I can come up a white-box model wherein “preferences” is calculated and then immediately deleted, such that it’s not part of the input-output behavior.
  Something like that?
  - Stuart_Armstrong 26 Aug 2020 9:35 UTC
    LW: 4 AF: 2
    AF Parent
    
    My understanding of the OP was that there is a robot [...]
    
    That understanding is correct.
    
    Then my question was: what if none of the variables, functions, etc. corresponds to “preferences”? What if “preferences” is a way that we try to interpret the robot, but not a natural subsystem or abstraction or function or anything else that would be useful for the robot’s programmer?
    
    I agree that preferences is a way we try to interpret the robot (and how we humans try to interpret each other). The programmer themselves could label the variables; but its also possible that another labelling would be clearer or more useful for our purposes. It might be a “natural” abstraction, once we’ve put some effort into defining what preferences “naturally” are.
    
    but “white box” is any source code that produces the same input-output behavior
    
    What that section is saying is that there are multiple white boxes that produce the same black box behaviour (hence we cannot read the white box simply from the black box).