I had idea for a prior for planners (the ‘p’ part of (p, R)) that I think would remove the no-free-lunch result. For a given planner, let its “score” be the average reward the agent gets for a randomly selected reward function (with a simplicity prior over reward functions). Let the prior probability for a particular planner be a function of this score, perhaps by applying a Boltzmann distribution over it. I would call this an evolutionary prior—planners that typically get higher reward given a randomly assigned reward function are more likely to exist. One could also randomize the transition function to see how planners do for arbitrary world-dynamics, but it doesn’t seem particularly problematic, and maybe even beneficial, if we place a higher prior probability on planners that are unusually well-adapted to generate good policies given the particular dynamics of the world we’re in.
I had idea for a prior for planners (the ‘p’ part of (p, R)) that I think would remove the no-free-lunch result. For a given planner, let its “score” be the average reward the agent gets for a randomly selected reward function (with a simplicity prior over reward functions). Let the prior probability for a particular planner be a function of this score, perhaps by applying a Boltzmann distribution over it. I would call this an evolutionary prior—planners that typically get higher reward given a randomly assigned reward function are more likely to exist. One could also randomize the transition function to see how planners do for arbitrary world-dynamics, but it doesn’t seem particularly problematic, and maybe even beneficial, if we place a higher prior probability on planners that are unusually well-adapted to generate good policies given the particular dynamics of the world we’re in.