Stuart_Armstrong comments on Learning human preferences: black-box, white-box, and structured white-box access

Stuart_Armstrong 26 Aug 2020 9:35 UTC
LW: 4 AF: 2
AF

My understanding of the OP was that there is a robot [...]

That understanding is correct.

Then my question was: what if none of the variables, functions, etc. corresponds to “preferences”? What if “preferences” is a way that we try to interpret the robot, but not a natural subsystem or abstraction or function or anything else that would be useful for the robot’s programmer?

I agree that preferences is a way we try to interpret the robot (and how we humans try to interpret each other). The programmer themselves could label the variables; but its also possible that another labelling would be clearer or more useful for our purposes. It might be a “natural” abstraction, once we’ve put some effort into defining what preferences “naturally” are.

but “white box” is any source code that produces the same input-output behavior

What that section is saying is that there are multiple white boxes that produce the same black box behaviour (hence we cannot read the white box simply from the black box).