re: #5, that doesn’t seem to claim that we can infer U given their actions, which is what the impossibility of deducing preferences is actually claiming. That is, assuming 5, we still cannot show that there isn’t some U1≠U2 such that π∗(U1,ζ)=π∗(U2,ζ).
(And as pointed out elsewhere, it isn’t Stuart’s thesis, it’s a well known and basic result in the decision theory / economics / philosophy literature.)
re: #5, that doesn’t seem to claim that we can infer U given their actions, which is what the impossibility of deducing preferences is actually claiming.
You misunderstand the intent. We’re talking about inverse reinforcement learning. The goal is not necessarily inferring the unknown U, but producing some behavior that optimizes the unknown U. Ofc if the policy you’re observing is optimal then it’s trivial to do so by following the same policy. But, using my approach we might be able to extend it into results like “the policy you’re observing is optimal w.r.t. certain computational complexity, and your goal is to produce an optimal policy w.r.t. higher computational complexity.”
(Btw I think the formal statement I gave for 5 is false, but there might be an alternative version that works.)
(And as pointed out elsewhere, it isn’t Stuart’s thesis, it’s a well known and basic result in the decision theory / economics / philosophy literature.)
I am referring to this and related work by Armstrong.
re: #5, that doesn’t seem to claim that we can infer U given their actions, which is what the impossibility of deducing preferences is actually claiming. That is, assuming 5, we still cannot show that there isn’t some U1≠U2 such that π∗(U1,ζ)=π∗(U2,ζ).
(And as pointed out elsewhere, it isn’t Stuart’s thesis, it’s a well known and basic result in the decision theory / economics / philosophy literature.)
You misunderstand the intent. We’re talking about inverse reinforcement learning. The goal is not necessarily inferring the unknown U, but producing some behavior that optimizes the unknown U. Ofc if the policy you’re observing is optimal then it’s trivial to do so by following the same policy. But, using my approach we might be able to extend it into results like “the policy you’re observing is optimal w.r.t. certain computational complexity, and your goal is to produce an optimal policy w.r.t. higher computational complexity.”
(Btw I think the formal statement I gave for 5 is false, but there might be an alternative version that works.)
I am referring to this and related work by Armstrong.