I agree that the fact that humans are quite good at inferring preferences should give us optimism about value learning. In the framework of rationality with a mistake model, I interpret this post as trying to infer the mistake model from the way that humans infer preferences about other humans. I’m not sure whether this sidesteps the impossibility result, but it seems plausible that it does.
What would be the source of data for learning a mistake model? It seems like we have to make some assumption about how the data source leads to a mistake model, since probably the data source is going to be a subset of the full human policy, and the impossibility result already allows you to have access to the full human policy.
In the example in https://www.lesswrong.com/posts/rcXaY3FgoobMkH2jc/figuring-out-what-alice-wants-part-ii , I give examples of two algorithms with the same outputs but where we would attribute different preferences to them. This sidesteps the impossibility result, since it allows us to consider extra information, namely the internal structure of the algorithm, in a way relevant to value-computing.
I agree that the fact that humans are quite good at inferring preferences should give us optimism about value learning. In the framework of rationality with a mistake model, I interpret this post as trying to infer the mistake model from the way that humans infer preferences about other humans. I’m not sure whether this sidesteps the impossibility result, but it seems plausible that it does.
What would be the source of data for learning a mistake model? It seems like we have to make some assumption about how the data source leads to a mistake model, since probably the data source is going to be a subset of the full human policy, and the impossibility result already allows you to have access to the full human policy.
In the example in https://www.lesswrong.com/posts/rcXaY3FgoobMkH2jc/figuring-out-what-alice-wants-part-ii , I give examples of two algorithms with the same outputs but where we would attribute different preferences to them. This sidesteps the impossibility result, since it allows us to consider extra information, namely the internal structure of the algorithm, in a way relevant to value-computing.