When we decompose the sequence of events E into laws L and initial conditions C, the laws dont just calculate E from C. Rather, L(C)=E1,L(E1)=E2,... L is a function form events->events, and the sequence E contains many input-output pairs of L.
By contrast, when we decompose a policy π into a planner P and a reward R, P is a function from rewards->policy. With the setup of the problem as-is, we have data on many instances of (s;a) pairs (behaviour), so we can infer π with high accuracy. But we only get to see one policy, and we never get to explicitly see rewards. In such a case, indeed we will get the empty reward and ∀r[P(r)=π]. To correctly infer R and P, we would have to see our P applied to some other rewards, and the policies resulting from that.
When we decompose the sequence of events E into laws L and initial conditions C, the laws dont just calculate E from C. Rather, L(C)=E1,L(E1)=E2,... L is a function form events->events, and the sequence E contains many input-output pairs of L.
By contrast, when we decompose a policy π into a planner P and a reward R, P is a function from rewards->policy. With the setup of the problem as-is, we have data on many instances of (s;a) pairs (behaviour), so we can infer π with high accuracy. But we only get to see one policy, and we never get to explicitly see rewards. In such a case, indeed we will get the empty reward and ∀r[P(r)=π]. To correctly infer R and P, we would have to see our P applied to some other rewards, and the policies resulting from that.