The shallow lookahead seems consistent with the observation in “Grandmaster-Level Chess Without Search”, Ruoss et al 2024, that the gains seem to stop after a few layers and a relatively shallow 8-layer NN saturates. I take that as suggesting that there are optimization/architecture difficulties here for learning better lookahead / planning as an unrolled feedforward NN with no weight-sharing or explicit search scaffolding like a MuZero.
It seems like a NN ought to be able to at least learn a sort of ‘beam search’ by examining multiple possible lines of play in parallel during the feedforward (because NNs tend to have way more computational power than they need and we can see in LLMs that you can easily ask them to compute multiple responses in parallel as ‘multiplexed’ computations, so if it can do one lookahead then it ought to be able to do multiple lookaheads in parallel), so that might be something to consider looking for: can you find evidence of multiple moves being considered in parallel? And if not, does changing the arch to use tied weights potentially add that internally?
(A possible corollary of this would be per Jones’s smooth RL scaling laws of train vs search: either those scaling laws break down at some point where the internal lookahead breaks down at 8-layers or so and performance then saturates as the NN can no longer improve, or those scaling laws already incorporate the benefits of internal lookahead and so could be made much better by any improvements to the internal amortized search.)
The shallow lookahead seems consistent with the observation in “Grandmaster-Level Chess Without Search”, Ruoss et al 2024, that the gains seem to stop after a few layers and a relatively shallow 8-layer NN saturates. I take that as suggesting that there are optimization/architecture difficulties here for learning better lookahead / planning as an unrolled feedforward NN with no weight-sharing or explicit search scaffolding like a MuZero.
It seems like a NN ought to be able to at least learn a sort of ‘beam search’ by examining multiple possible lines of play in parallel during the feedforward (because NNs tend to have way more computational power than they need and we can see in LLMs that you can easily ask them to compute multiple responses in parallel as ‘multiplexed’ computations, so if it can do one lookahead then it ought to be able to do multiple lookaheads in parallel), so that might be something to consider looking for: can you find evidence of multiple moves being considered in parallel? And if not, does changing the arch to use tied weights potentially add that internally?
(A possible corollary of this would be per Jones’s smooth RL scaling laws of train vs search: either those scaling laws break down at some point where the internal lookahead breaks down at 8-layers or so and performance then saturates as the NN can no longer improve, or those scaling laws already incorporate the benefits of internal lookahead and so could be made much better by any improvements to the internal amortized search.)