You could imagine examining a human brain and seeing how it models other humans. This would let you get some normative assumptions out that could inform a value learning technique.
I would think of this as extracting an algorithm that could infer human preferences out of a human brain. You could run this algorithm for a long time, in which case it would eventually output Y values, even if you would currently judge the person as having X values.
Human brains likely model other humans by simulating them. The simple normative assumption used is something like humans are humans, which will not really help you in the way you want, but leads to this interesting problem
Imagine a group of five closely interacting humans. Learning values just from person A may run into the problem that big part of A’s motivation is based on A simulating B,C,D,E (on the same “human” hardware, just incorporating individual differences). In that case, learning the “values” just from A’s actions could be in principle more difficult than observing the whole group, trying to learn some “human universals” and some “human specifics”. A different way of thinking about this could be by making a parallel with meta-learning algorithms (e.g. REPTILE) but in IRL frame.
You could imagine examining a human brain and seeing how it models other humans. This would let you get some normative assumptions out that could inform a value learning technique.
I would think of this as extracting an algorithm that could infer human preferences out of a human brain. You could run this algorithm for a long time, in which case it would eventually output Y values, even if you would currently judge the person as having X values.
Human brains likely model other humans by simulating them. The simple normative assumption used is something like humans are humans, which will not really help you in the way you want, but leads to this interesting problem