adamShimi comments on adamShimi’s Shortform

adamShimi 11 Jan 2021 16:30 UTC
2 points
I can’t find a reference for how to test if an inferred (or just given) reward function for a system can be used to predict decently enough what the system actually does.
What I can find are references about the usual IRL/preference learning setting, where there is a true reward function, known to us but unknown to the inference system; then the inferred reward function is evaluated by training a policy on it and seeing how much reward (or how much regret) it gets on the true reward function.
But that’s a good setting for checking if the reward is good for learning to do the task, not for checking if the reward is good for predicting what this specific system will do.
The best thing I have in mind right know is to find a bunch of different initial conditions, train on the reward from these conditions, and mix all policies together to get a distribution on the action at each state, and compare that with what the system actually does. It seems decent enough, but I would really like to know if someone has done something similar in the literature.
(Agents and Devices points in the right direction, but it’s focused on prediction which of the agent mixture or the device mixture is more probable in the posterior, which is a different problem.)