Think of it like AlphaGo—if it only ever could train itself by playing Go against actual humans, it would never have become superintelligent at Go.
This is obviously untrue in both the model-free and model-based RL senses. There are something like 30 million human Go players who can play a game in two hours. AlphaGo was trained on policy gradients from, as it happens, on the order of 30m games; so it could accumulate a similar order of games in under a day; the subset of pro games can be upweighted to provide most of the signal—and when they stop providing signal, well then, it must have reached superhuman… (For perspective, a good 0.5m or whatever professional games used to imitation-train AG came from a single Go server, which was not the most popular, and that’s why AlphaGo Master ran its pro matches on a different larger server.) Do this for a few days or weeks, and you will likely have exactly that, in a good deal less time than ‘never’, which is a rather long time. More relevantly, because you’re not making a claim about the AG architecture specifically but about all learning agents in general: with no exploration, MuZero can bootstrap its model-based self-play from somewhere in the neighborhood of hundreds/thousands of ‘real’ games (as should not be a surprise as Go rules are simple), and achieves superhuman gameplay easily by self-play inside the learned model, with little need for any good human opponents at all; even if that is 3 orders of magnitude off, it’s still within a day of human gameplay sample-size. Or consider meta-learning sim2real like Dactyl which are trained exclusively in silico on unrealistic simulations, and adapt within seconds to reality. So either way. The sample-inefficiency of DL robotics, DL, or R&D, is more of a fact about our compute-poverty than it is about the inherent necessity of interacting with the real world (which is both highly parallelizable, learnable offline, and far smaller than existing methods).
This is obviously untrue in both the model-free and model-based RL senses. There are something like 30 million human Go players who can play a game in two hours. AlphaGo was trained on policy gradients from, as it happens, on the order of 30m games; so it could accumulate a similar order of games in under a day; the subset of pro games can be upweighted to provide most of the signal—and when they stop providing signal, well then, it must have reached superhuman… (For perspective, a good 0.5m or whatever professional games used to imitation-train AG came from a single Go server, which was not the most popular, and that’s why AlphaGo Master ran its pro matches on a different larger server.) Do this for a few days or weeks, and you will likely have exactly that, in a good deal less time than ‘never’, which is a rather long time. More relevantly, because you’re not making a claim about the AG architecture specifically but about all learning agents in general: with no exploration, MuZero can bootstrap its model-based self-play from somewhere in the neighborhood of hundreds/thousands of ‘real’ games (as should not be a surprise as Go rules are simple), and achieves superhuman gameplay easily by self-play inside the learned model, with little need for any good human opponents at all; even if that is 3 orders of magnitude off, it’s still within a day of human gameplay sample-size. Or consider meta-learning sim2real like Dactyl which are trained exclusively in silico on unrealistic simulations, and adapt within seconds to reality. So either way. The sample-inefficiency of DL robotics, DL, or R&D, is more of a fact about our compute-poverty than it is about the inherent necessity of interacting with the real world (which is both highly parallelizable, learnable offline, and far smaller than existing methods).