Bunthut comments on Occam’s Razor May Be Sufficient to Infer the Preferences of Irrational Agents: A reply to Armstrong & Mindermann

Bunthut 6 Nov 2019 17:35 UTC
3 points
When we decompose the sequence of events E into laws L and initial conditions C, the laws dont just calculate E from C. Rather, $L (C) = E_{1}, L (E_{1}) = E_{2}, . . .$ L is a function form events->events, and the sequence E contains many input-output pairs of L.
By contrast, when we decompose a policy $π$ into a planner P and a reward R, P is a function from rewards->policy. With the setup of the problem as-is, we have data on many instances of $(s; a)$ pairs (behaviour), so we can infer $π$ with high accuracy. But we only get to see one policy, and we never get to explicitly see rewards. In such a case, indeed we will get the empty reward and $\forall r [P (r) = π]$ . To correctly infer R and P, we would have to see our P applied to some other rewards, and the policies resulting from that.