Yeah, it’s interesting that this works so well, but I think that best way to think of this is as a middle ground between full model based RL and model free RL. Their data efficiency isn’t going to be optimal, because they’re effectively throwing away the information carried by the observations. However, by making that choice, they don’t need to model irrelevant details, so they end up with a very accurate and effective MCTS. As a result, I’d wager that with smaller neural networks or more experience, completely model-free RL would out-preform this agent, because all the modelling power can be focused on representing the policy. Likewise, with larger networks or less experience, I would expect this to fall behind MBRL that also predicts observations because the latter would be more data efficient.
I would have liked it if they had done more investigation into why they were able to outperform AZ in go. At the moment, they seem to have left it to one line of speculation.
Yeah, it’s interesting that this works so well, but I think that best way to think of this is as a middle ground between full model based RL and model free RL. Their data efficiency isn’t going to be optimal, because they’re effectively throwing away the information carried by the observations. However, by making that choice, they don’t need to model irrelevant details, so they end up with a very accurate and effective MCTS. As a result, I’d wager that with smaller neural networks or more experience, completely model-free RL would out-preform this agent, because all the modelling power can be focused on representing the policy. Likewise, with larger networks or less experience, I would expect this to fall behind MBRL that also predicts observations because the latter would be more data efficient.
I would have liked it if they had done more investigation into why they were able to outperform AZ in go. At the moment, they seem to have left it to one line of speculation.