I don’t think you can learn an agent’s desires from policy, because an agent can have “loose wires”—faulty connections between the desire part and the policy part. Extreme case: imagine agent A with desires X and Y, locked inside a dumb unfeeling agent B which only allows actions maximizing X to affect behavior, while actions maximizing Y get ignored. Then desire Y can’t be learned from the policy of agent B. Humans could be like that: we have filters to stop ourselves from acting on, or even talking about, certain desires. Behind these filters we have “private desires”, which can be learned from brain structure but not from policy. Even if these desires aren’t perfectly “private”, the fastest way to learn them still shouldn’t rely on policy alone.
You mean that you can ask the agent if it wants just X, and it will say “I want Y also,” but it will never act to do those things? That sounds like what Robin Hanson discusses in Elephant in the Brain—and he largely dismisses the claimed preferences, in favor of the caring about the actual desire.
I’m confused about why we think this is a case that would occur in a way that Y is a real goal we should pursue, instead of a false pretense. And if it was the case, how would brain inspection (without manipulation) allow us to know it?
(There was a longer comment here but I found a way to make it shorter)
I think people can be reluctant to reveal some of their desires, by word or deed. So looking at policy isn’t the most natural way to learn these desires; looking inside the black box makes more sense.
Fair point, but I don’t think that addresses the final claim, which is that even if you are correct, analyzing the black box isn’t enough without actually playing out counterfactuals.
Just to make sure I understand: You’re arguing that even if we somehow solve the easy goal inference problem, there will still be some aspect of values we don’t capture?
Yeah. I think a creature behaving just like me doesn’t necessarily have the exact same internal experiences. Across all possible creatures, there are degrees of freedom in internal experiences that aren’t captured by actions. Some of these might be value-relevant.
Yeah, in ML language, you’re describing the unidentifiability problem in inverse reinforcement learning—for any behavior, there are typically many reward functions for which that behavior is optimal.
Though another way this could be true is if “internal experience” depends on what algorithm you use to generate your behavior, and “optimize a learned reward” doesn’t meet the bar. (For example, I don’t think a giant lookup table that emulates my behavior is having the same experience that I am.)
I don’t think you can learn an agent’s desires from policy, because an agent can have “loose wires”—faulty connections between the desire part and the policy part. Extreme case: imagine agent A with desires X and Y, locked inside a dumb unfeeling agent B which only allows actions maximizing X to affect behavior, while actions maximizing Y get ignored. Then desire Y can’t be learned from the policy of agent B. Humans could be like that: we have filters to stop ourselves from acting on, or even talking about, certain desires. Behind these filters we have “private desires”, which can be learned from brain structure but not from policy. Even if these desires aren’t perfectly “private”, the fastest way to learn them still shouldn’t rely on policy alone.
You mean that you can ask the agent if it wants just X, and it will say “I want Y also,” but it will never act to do those things? That sounds like what Robin Hanson discusses in Elephant in the Brain—and he largely dismisses the claimed preferences, in favor of the caring about the actual desire.
I’m confused about why we think this is a case that would occur in a way that Y is a real goal we should pursue, instead of a false pretense. And if it was the case, how would brain inspection (without manipulation) allow us to know it?
(There was a longer comment here but I found a way to make it shorter)
I think people can be reluctant to reveal some of their desires, by word or deed. So looking at policy isn’t the most natural way to learn these desires; looking inside the black box makes more sense.
Fair point, but I don’t think that addresses the final claim, which is that even if you are correct, analyzing the black box isn’t enough without actually playing out counterfactuals.
Just to make sure I understand: You’re arguing that even if we somehow solve the easy goal inference problem, there will still be some aspect of values we don’t capture?
Yeah. I think a creature behaving just like me doesn’t necessarily have the exact same internal experiences. Across all possible creatures, there are degrees of freedom in internal experiences that aren’t captured by actions. Some of these might be value-relevant.
Yeah, in ML language, you’re describing the unidentifiability problem in inverse reinforcement learning—for any behavior, there are typically many reward functions for which that behavior is optimal.
Though another way this could be true is if “internal experience” depends on what algorithm you use to generate your behavior, and “optimize a learned reward” doesn’t meet the bar. (For example, I don’t think a giant lookup table that emulates my behavior is having the same experience that I am.)