Looking qualitatively at our agents, we often see general, heuristic behaviours emerge — rather than highly optimised, specific behaviours for individual tasks. Instead of agents knowing exactly the “best thing” to do in a new situation, we see evidence of agents experimenting and changing the state of the world until they’ve achieved a rewarding state.
The blessings of scale strike again. People have been remarkably quick to dismiss the lack of “GPT-4” models as indicating that the scaling hypothesis is dead already. (The only scaling hypothesis refuted by the past year is the ‘budget scaling hypothesis’, if you will. All the other research continues to confirm it.)
The blessings of scale strike again. People have been remarkably quick to dismiss the lack of “GPT-4” models as indicating that the scaling hypothesis is dead already. (The only scaling hypothesis refuted by the past year is the ‘budget scaling hypothesis’, if you will. All the other research continues to confirm it.)
Incidentally, it’s well worth reading the previous papers from DM on using populations to learn ever more complex and general tasks: AlphaStar, Quake, rats, VR/language robotics, team soccer.
We’ve all read about AlpaFold 2, but I’d also highlight VQ-VAE being used increasingly pervasively as a drop-in generative model; “Multimodal Few-Shot Learning with Frozen Language Models”, Tsimpoukelli et al 2021, further demonstrating the power of large self-supervised models for fast human-like learning; and the always-underappreciated line of work on MuZero for doing sample-efficient & continuous-action model-based RL: “MuZero Unplugged: Online and Offline Reinforcement Learning by Planning with a Learned Model”, Schrittwieser et al 2021; “Sampled MuZero: Learning and Planning in Complex Action Spaces”, Hubert et al 2021 (benefiting from better use of existing compute, like Podracer).