Stuart, by ”Prt(R|D1;j) is complex” are you referring to their using R=R(.,E[ΘR∗|D1;j]) as the estimated reward function?
Also, what did you think of their arguement that their agents have no incentive to manipulate their beliefs because they evaluate future trajectories based of their current beliefs about how likely they are? Does that suffice to implement eq. 1) from your motivated value selection paper?
Suart, by ”Prt(R∣D1:j) is complex” are you referring to...
I mean that that defining Prt can be done in many different ways, and hence has a lot of contingent structure. In contrast, in Plp(R∣D1:j,ρ), the $\rho is a complex distribution on R, conditional on D1:j; hence Plp itself is trivial and just encodes “apply ρ to R and D1:j in the obvious way.
Stuart, by ”Prt(R|D1;j) is complex” are you referring to their using R=R(.,E[ΘR∗|D1;j]) as the estimated reward function?
Also, what did you think of their arguement that their agents have no incentive to manipulate their beliefs because they evaluate future trajectories based of their current beliefs about how likely they are? Does that suffice to implement eq. 1) from your motivated value selection paper?
I mean that that defining Prt can be done in many different ways, and hence has a lot of contingent structure. In contrast, in Plp(R∣D1:j,ρ), the $\rho is a complex distribution on R, conditional on D1:j; hence Plp itself is trivial and just encodes “apply ρ to R and D1:j in the obvious way.