Just to make sure I understand: You’re arguing that even if we somehow solve the easy goal inference problem, there will still be some aspect of values we don’t capture?
Yeah. I think a creature behaving just like me doesn’t necessarily have the exact same internal experiences. Across all possible creatures, there are degrees of freedom in internal experiences that aren’t captured by actions. Some of these might be value-relevant.
Yeah, in ML language, you’re describing the unidentifiability problem in inverse reinforcement learning—for any behavior, there are typically many reward functions for which that behavior is optimal.
Though another way this could be true is if “internal experience” depends on what algorithm you use to generate your behavior, and “optimize a learned reward” doesn’t meet the bar. (For example, I don’t think a giant lookup table that emulates my behavior is having the same experience that I am.)
Just to make sure I understand: You’re arguing that even if we somehow solve the easy goal inference problem, there will still be some aspect of values we don’t capture?
Yeah. I think a creature behaving just like me doesn’t necessarily have the exact same internal experiences. Across all possible creatures, there are degrees of freedom in internal experiences that aren’t captured by actions. Some of these might be value-relevant.
Yeah, in ML language, you’re describing the unidentifiability problem in inverse reinforcement learning—for any behavior, there are typically many reward functions for which that behavior is optimal.
Though another way this could be true is if “internal experience” depends on what algorithm you use to generate your behavior, and “optimize a learned reward” doesn’t meet the bar. (For example, I don’t think a giant lookup table that emulates my behavior is having the same experience that I am.)