For a somewhat loose conception of “planning,” it actually is conceivable that a feedforward network could plan — but (I think) only if it had a really huge number of layers.
Search doesn’t buy you that much, remember. After relatively few nodes, you’ve already gotten much of the benefit from finetuning the value estimates (eg. AlphaZero, or the MuZero appendix). And you can do weight-tying to repeat feedforward layers, or just repeat layers/the model. (Is AlphaFold2 recurrent? Is ALBERT recurrent? A VIN? Or a diffusion model? Or a neural ODE?) This is probably why Jones finds that distilling MCTS runtime search into search-less feedforward parameters comes, empirically, at a favorable exchange rate I wouldn’t call ‘really huge’.
Search doesn’t buy you that much, remember. After relatively few nodes, you’ve already gotten much of the benefit from finetuning the value estimates (eg. AlphaZero, or the MuZero appendix). And you can do weight-tying to repeat feedforward layers, or just repeat layers/the model. (Is AlphaFold2 recurrent? Is ALBERT recurrent? A VIN? Or a diffusion model? Or a neural ODE?) This is probably why Jones finds that distilling MCTS runtime search into search-less feedforward parameters comes, empirically, at a favorable exchange rate I wouldn’t call ‘really huge’.