You mean that you can ask the agent if it wants just X, and it will say “I want Y also,” but it will never act to do those things? That sounds like what Robin Hanson discusses in Elephant in the Brain—and he largely dismisses the claimed preferences, in favor of the caring about the actual desire.
I’m confused about why we think this is a case that would occur in a way that Y is a real goal we should pursue, instead of a false pretense. And if it was the case, how would brain inspection (without manipulation) allow us to know it?
(There was a longer comment here but I found a way to make it shorter)
I think people can be reluctant to reveal some of their desires, by word or deed. So looking at policy isn’t the most natural way to learn these desires; looking inside the black box makes more sense.
Fair point, but I don’t think that addresses the final claim, which is that even if you are correct, analyzing the black box isn’t enough without actually playing out counterfactuals.
You mean that you can ask the agent if it wants just X, and it will say “I want Y also,” but it will never act to do those things? That sounds like what Robin Hanson discusses in Elephant in the Brain—and he largely dismisses the claimed preferences, in favor of the caring about the actual desire.
I’m confused about why we think this is a case that would occur in a way that Y is a real goal we should pursue, instead of a false pretense. And if it was the case, how would brain inspection (without manipulation) allow us to know it?
(There was a longer comment here but I found a way to make it shorter)
I think people can be reluctant to reveal some of their desires, by word or deed. So looking at policy isn’t the most natural way to learn these desires; looking inside the black box makes more sense.
Fair point, but I don’t think that addresses the final claim, which is that even if you are correct, analyzing the black box isn’t enough without actually playing out counterfactuals.