This monograph by Bertsekas on the interrelationship between offline RL and online MCTS/search might be interesting—http://www.athenasc.com/Frontmatter_LESSONS.pdf—since it argues that we can conceptualise the contribution of MCTS as essentially that of a single Newton step from the offline start point towards the solution of the Bellman equation. If this is actually the case (I haven’t worked through all details yet) then this seems to be able to be used to provide some kind of bound on the improvement / divergence you can get once you add online planning to a model-free policy.
This monograph by Bertsekas on the interrelationship between offline RL and online MCTS/search might be interesting—http://www.athenasc.com/Frontmatter_LESSONS.pdf—since it argues that we can conceptualise the contribution of MCTS as essentially that of a single Newton step from the offline start point towards the solution of the Bellman equation. If this is actually the case (I haven’t worked through all details yet) then this seems to be able to be used to provide some kind of bound on the improvement / divergence you can get once you add online planning to a model-free policy.