Alpha Zero scales with more computing power, I think AlphaFold 2 scales with more computing power, Mu Zero scales with more computing power. Precisely because GPT-3 doesn’t scale, I’d expect an AGI to look more like Mu Zero and particularly with respect to the fact that it has some way of scaling.
I think this view dovetails quite strongly with the view expressed in this comment by maximkazhenkov:
Progress in model-based RL is far more relevant to getting us closer to AGI than other fields like NLP or image recognition or neuroscience or ML hardware. I worry that once the research community shifts its focus towards RL, the AGI timeline will collapse—not necessarily because there are no more critical insights left to be discovered, but because it’s fundamentally the right path to work on and whatever obstacles remain will buckle quickly once we throw enough warm bodies at them. I think—and this is highly controversial—that the focus on NLP and Vision Transformer has served as a distraction for a couple of years and actually delayed progress towards AGI.
If curiosity-driven exploration gets thrown into the mix and Starcraft/Dota gets solved (for real this time) with comparable data efficiency as humans, that would be a shrieking fire alarm to me (but not to many other people I imagine, as “this has all been done before”).
I was definitely one of those on the Transformer “train”, so to speak, after the initial GPT-3 preprint, and I think this was further reinforced by gwern’s highlighting of the scaling hypothesis—which, while not necessarily unique to Transformers, was framed using Transformers and GPT-3 as relevant examples, in a way that suggested [to me] that Transformers themselves might scale to AGI.
I think, after thinking about this more, looking more into EfficientZero/MuZero/related work in model-based reinforcement learning etc., that I have updated in favor of the view espoused by Eliezer and maximkazhenkov, wherein model-based RL (and RL-type training in general, which concerns dynamics and action spaces in a way that sequence prediction simply does not) seems more likely to me to produce AGI than a scaled-up version of a sequence predictor.
(I still have a weak belief that Transformers and their ilk may be capable of producing AGI, but if so I think it will probably be a substantially longer, harder path than RL-based systems, and will probably involve much more work than just throwing more compute at models.)
I don’t think it scales to superintelligence, not without architectural changes that would permit more like a deliberate intelligence inside to do the predictive work. You don’t get things that want only the loss function, any more than humans explicitly want inclusive fitness.
(in response to)
what’s the primary risk you would expect from a GPT-like architecture scaled up to the regime of superintelligence? (Obviously such architectures are unaligned, but they’re also not incentivized by default to take action in the real world.)
I think this view dovetails quite strongly with the view expressed in this comment by maximkazhenkov:
I was definitely one of those on the Transformer “train”, so to speak, after the initial GPT-3 preprint, and I think this was further reinforced by gwern’s highlighting of the scaling hypothesis—which, while not necessarily unique to Transformers, was framed using Transformers and GPT-3 as relevant examples, in a way that suggested [to me] that Transformers themselves might scale to AGI.
I think, after thinking about this more, looking more into EfficientZero/MuZero/related work in model-based reinforcement learning etc., that I have updated in favor of the view espoused by Eliezer and maximkazhenkov, wherein model-based RL (and RL-type training in general, which concerns dynamics and action spaces in a way that sequence prediction simply does not) seems more likely to me to produce AGI than a scaled-up version of a sequence predictor.
(I still have a weak belief that Transformers and their ilk may be capable of producing AGI, but if so I think it will probably be a substantially longer, harder path than RL-based systems, and will probably involve much more work than just throwing more compute at models.)
Also, here is a relevant tweet from Eliezer:
(in response to)