Scaling Just Works: as blase as we may now be at seeing ‘lines go straight’, I continue to be shocked in my gut that they do just keep going straight and something like Gato can be as straightforward as ‘just train a 1.2b-param Transformer on half a thousand different tasks, homes, nbd’ and it works exactly like you’d think and the scaling curve looks exactly like you’d expect. It is shocking how unshocking the results are conditional on a shocking thesis (the scaling hypothesis). So many S-curves and paradigms hit an exponential wall and explode, but DL/DRL still have not. We should keep this in mind that every time we have an opportunity to observe scaling explode in a giant fireball, and we don’t.
Multi-task learning is indeed just another blessing of scale: as they note, it used to be that learning multiple Atari games in parallel was really hard. It did not work, at all. You got negative transfer even within ALE. People thought very hard and ran lots of experiments to try to create things like Popart less than 4 years ago where it was a triumph that, due to careful engineering a single checkpoint could play just the ALE-57 games with mediocre performance.
Decision Transformer definitely made ‘multi-task learning is a blessing of scale’ the default hypothesis, but no one had actually shown this, the past DT and other work (aside from MetaMimic) were all rather low n and k; you could wonder if they would interfere at a certain point or break down, and require fancy engineering like MoEs to enable learning at all. (Similarly, Andy Jones showed nice scaling laws for DRL and I scraped together a few examples like Ms Pacman, but nothing across really complicated tasks or many tasks.)
Now you can throw in not just ALE, but DMLab, Metaworld, Procgen, hell, let’s just throw in a bunch of random Internet scraped text and images and captions and treat those as ‘reinforcement learning tasks’ too why not, and to make them all play together you do… nothing, really, you just train on them all simultaneously with a big model in about the dumbest way possible and it works fine.
(Also, if one had any doubts, DM is now fully scale-pilled.)
I share the gut-shock. This is really thumping me with the high dimensional world perspective. The dimensionality problem is all:
They had to exercise unrealistic levels of standardization and control over a dozen different variables. Presumably their results will not generalize to real sleds on real hills in the wild.
But stop for a moment to consider the implications of the result. A consistent sled-speed can be achieved while controlling only a dozen variables. Out of literally billions.
But then the scaling hypothesis be like “Just allow for billions of variables, lol.”
Are we actually any farther from game over than just feeding this thing the Decision Transformer papers and teaching it to play GitHub Copilot?
Are we actually any farther from game over than just feeding this thing the Decision Transformer papers and teaching it to play GitHub Copilot?
Gonna preregister a prediction here so I can bring it up whenever someone asks for when naysayer predictions have turned out right:
There’s absolutely no way we’d game over just from this. Gato (or Gato architecture/configuration + training on GitHub code and ML papers) does not have any integrated world-model that it can loop on to do novel long-term planning, only an absurdly large amount of bits and pieces of world model that are not integrated and therefore not sufficient to foom.
I agree that Gato won’t go FOOM if we train it on GitHub Copilot. However, naysayer predictions about “game over” have always turned out right and will continue to do so right up until it’s too late. So you won’t win any points in my book.
I’d be interested to hear predictions about what a 10x bigger Gato trained on 10x more data 10x more diverse would and wouldn’t be capable of, and ditto for 100x and 1000x.
I’d be interested to hear predictions about what a 10x bigger Gato trained on 10x more data 10x more diverse would and wouldn’t be capable of, and ditto for 100x and 1000x.
Prediction 1: I don’t think we’re going to get a 100x or 1000x Gato, as it would be way too difficult to produce. They would have to instead come up with derived method that gets training data in a different way from the current Gato. (95% for 100x, 99.9% for 1000x.)
Prediction 2: 1000x sounds like it would be really really capable by current standards. I would not be surprised if it could extrapolate or at least quickly get fine-tuned to just about any problem within the scale it’s been trained on. (50%)
Prediction 3: I want to say that 10x Gato “wouldn’t suck at the tasks” (look at e.g. the image captions for an example of it sucking), but that’s not very specific, though I feel more confident in this than in any specific prediction. (80%) I think people might be more interested in specific predictions if I was to predict that it would suck than that it wouldn’t? So I’m just gonna leave it vague unless there’s particular interest. I don’t even find it all that likely that a 10x Gato will be made.
Given your Prediction 2, it seems like maybe we are on the same page? You seem to be saying that a 1000x Gato would be AGI-except-limited-by-scale-of-training, so if we could just train it for a sufficiently long scale that it could learn to do lots of AI R&D, then we’d get full AGI shortly thereafter, and if we could train it for a sufficiently long scale that it could learn to strategically accumulate power and steer the world away from human control and towards something else, then we’d (potentially) cross an AI-induced-point-of-no-return shortly thereafter. This is about what I think. (I also think that merely 10x or even 100x probably wouldn’t be enough; 1000x is maybe my median.) What scale of training is sufficiently long? Well, that’s a huge open question IMO. I think probably 1000x the scale of current-Gato would be enough, but I’m very unsure.
This is a very weird question to me because I feel like we have all sorts of promising AI techniques that could be readily incorporated into Gato to make it much more generally capable. But I can try to answer it.
There’s sort of two ways we could imagine training it to produce an AGI. We could give it sufficiently long-term training data that it learns to pursue AGI in a goal-directed manner, or we could give it a bunch of AGI researcher training data which it learns to imitate such that it ends up flailing and sort of vaguely making progress but also just doing a bunch of random stuff.
Creating an AGI is probably one of the longest-scale activities one can imagine, because one is basically creating a persistent successor agent. So in order to pursue this properly in a persistent goal-directed manner, I’d think one needs very long-scale illustrations. For instance one could have an illustration of someone who starts a religion which then persists for long after their death, or similar for starting a company, a trust fund, etc., or creatures evolving to persist in an environment.
This is not very viable in practice. But maybe if you trained an AGI to imitate AI researchers on a shorter scale like weeks or months, then it could produce a lot of AI research that is weakly but not strongly directed towards AGI. This could of course partly be interpolated from imitation of non-AI researchers and programmers and such, which would get you some part of the way.
In both cases I’m skeptical about the viability of getting training data for it. How many people can you really get to record their research activities in sufficient detail for this to work? Probably not enough. And again I don’t think this will be prioritized because there are numerous obvious improvements that can be made on Gato to make it less dependent on training data.
does not have any integrated world-model that it can loop on to do novel long-term planning
I am interested in more of your thoughts on this part, because I do not grok the necessity of a single world-model or long-term planning (though I’m comfortable granting that they would make it much more effective). Are these independent requirements, or are they linked somehow? Would an explanation look like:
Because the chunks of the world model are small, foom won’t meaningfully increase capabilities past a certain point.
Or maybe:
Without long-term planning, the disproportionate investment in increasing capabilities that leads to foom never makes sense.
The second one. The logic for increasing capabilities is “if I increase my capabilities, then I’ll better reach my goal”. But Gato does not implement the dynamic of “if I infer [if X then I’ll better reach my goal] then promote X to an instrumental goal”. Nor does it particularly pursue goals by any other means. Gato just acts similar to how its training examples acted in similar situations to the ones it finds itself in.
So many S-curves and paradigms hit an exponential wall and explode, but DL/DRL still have not.
Don’t the scaling laws use logarithmic axis? That would suggest that the phenomenon is indeed exponential in it nature. If we need to get X times more compute with X times more data for additional improvements, we will hit the wall quite soon. There is only that much useful text on the Web and only that much compute that labs are willing to spend on this considering the diminishing returns.
There is a lot more useful data on YouTube (by several orders of magnitude at least? idk), I think the next wave of such breakthrough models will train on video.
The two major points I take away:
Scaling Just Works: as blase as we may now be at seeing ‘lines go straight’, I continue to be shocked in my gut that they do just keep going straight and something like Gato can be as straightforward as ‘just train a 1.2b-param Transformer on half a thousand different tasks, homes, nbd’ and it works exactly like you’d think and the scaling curve looks exactly like you’d expect. It is shocking how unshocking the results are conditional on a shocking thesis (the scaling hypothesis). So many S-curves and paradigms hit an exponential wall and explode, but DL/DRL still have not. We should keep this in mind that every time we have an opportunity to observe scaling explode in a giant fireball, and we don’t.
Multi-task learning is indeed just another blessing of scale: as they note, it used to be that learning multiple Atari games in parallel was really hard. It did not work, at all. You got negative transfer even within ALE. People thought very hard and ran lots of experiments to try to create things like Popart less than 4 years ago where it was a triumph that, due to careful engineering a single checkpoint could play just the ALE-57 games with mediocre performance.
Decision Transformer definitely made ‘multi-task learning is a blessing of scale’ the default hypothesis, but no one had actually shown this, the past DT and other work (aside from MetaMimic) were all rather low n and k; you could wonder if they would interfere at a certain point or break down, and require fancy engineering like MoEs to enable learning at all. (Similarly, Andy Jones showed nice scaling laws for DRL and I scraped together a few examples like Ms Pacman, but nothing across really complicated tasks or many tasks.)
Now you can throw in not just ALE, but DMLab, Metaworld, Procgen, hell, let’s just throw in a bunch of random Internet scraped text and images and captions and treat those as ‘reinforcement learning tasks’ too why not, and to make them all play together you do… nothing, really, you just train on them all simultaneously with a big model in about the dumbest way possible and it works fine.
(Also, if one had any doubts, DM is now fully scale-pilled.)
I share the gut-shock. This is really thumping me with the high dimensional world perspective. The dimensionality problem is all:
But then the scaling hypothesis be like “Just allow for billions of variables, lol.”
Are we actually any farther from game over than just feeding this thing the Decision Transformer papers and teaching it to play GitHub Copilot?
Gonna preregister a prediction here so I can bring it up whenever someone asks for when naysayer predictions have turned out right:
There’s absolutely no way we’d game over just from this. Gato (or Gato architecture/configuration + training on GitHub code and ML papers) does not have any integrated world-model that it can loop on to do novel long-term planning, only an absurdly large amount of bits and pieces of world model that are not integrated and therefore not sufficient to foom.
I agree that Gato won’t go FOOM if we train it on GitHub Copilot. However, naysayer predictions about “game over” have always turned out right and will continue to do so right up until it’s too late. So you won’t win any points in my book.
I’d be interested to hear predictions about what a 10x bigger Gato trained on 10x more data 10x more diverse would and wouldn’t be capable of, and ditto for 100x and 1000x.
Prediction 1: I don’t think we’re going to get a 100x or 1000x Gato, as it would be way too difficult to produce. They would have to instead come up with derived method that gets training data in a different way from the current Gato. (95% for 100x, 99.9% for 1000x.)
Prediction 2: 1000x sounds like it would be really really capable by current standards. I would not be surprised if it could extrapolate or at least quickly get fine-tuned to just about any problem within the scale it’s been trained on. (50%)
Prediction 3: I want to say that 10x Gato “wouldn’t suck at the tasks” (look at e.g. the image captions for an example of it sucking), but that’s not very specific, though I feel more confident in this than in any specific prediction. (80%) I think people might be more interested in specific predictions if I was to predict that it would suck than that it wouldn’t? So I’m just gonna leave it vague unless there’s particular interest. I don’t even find it all that likely that a 10x Gato will be made.
Given your Prediction 2, it seems like maybe we are on the same page? You seem to be saying that a 1000x Gato would be AGI-except-limited-by-scale-of-training, so if we could just train it for a sufficiently long scale that it could learn to do lots of AI R&D, then we’d get full AGI shortly thereafter, and if we could train it for a sufficiently long scale that it could learn to strategically accumulate power and steer the world away from human control and towards something else, then we’d (potentially) cross an AI-induced-point-of-no-return shortly thereafter. This is about what I think. (I also think that merely 10x or even 100x probably wouldn’t be enough; 1000x is maybe my median.) What scale of training is sufficiently long? Well, that’s a huge open question IMO. I think probably 1000x the scale of current-Gato would be enough, but I’m very unsure.
This is a very weird question to me because I feel like we have all sorts of promising AI techniques that could be readily incorporated into Gato to make it much more generally capable. But I can try to answer it.
There’s sort of two ways we could imagine training it to produce an AGI. We could give it sufficiently long-term training data that it learns to pursue AGI in a goal-directed manner, or we could give it a bunch of AGI researcher training data which it learns to imitate such that it ends up flailing and sort of vaguely making progress but also just doing a bunch of random stuff.
Creating an AGI is probably one of the longest-scale activities one can imagine, because one is basically creating a persistent successor agent. So in order to pursue this properly in a persistent goal-directed manner, I’d think one needs very long-scale illustrations. For instance one could have an illustration of someone who starts a religion which then persists for long after their death, or similar for starting a company, a trust fund, etc., or creatures evolving to persist in an environment.
This is not very viable in practice. But maybe if you trained an AGI to imitate AI researchers on a shorter scale like weeks or months, then it could produce a lot of AI research that is weakly but not strongly directed towards AGI. This could of course partly be interpolated from imitation of non-AI researchers and programmers and such, which would get you some part of the way.
In both cases I’m skeptical about the viability of getting training data for it. How many people can you really get to record their research activities in sufficient detail for this to work? Probably not enough. And again I don’t think this will be prioritized because there are numerous obvious improvements that can be made on Gato to make it less dependent on training data.
I am interested in more of your thoughts on this part, because I do not grok the necessity of a single world-model or long-term planning (though I’m comfortable granting that they would make it much more effective). Are these independent requirements, or are they linked somehow? Would an explanation look like:
Because the chunks of the world model are small, foom won’t meaningfully increase capabilities past a certain point.
Or maybe:
Without long-term planning, the disproportionate investment in increasing capabilities that leads to foom never makes sense.
The second one. The logic for increasing capabilities is “if I increase my capabilities, then I’ll better reach my goal”. But Gato does not implement the dynamic of “if I infer [if X then I’ll better reach my goal] then promote X to an instrumental goal”. Nor does it particularly pursue goals by any other means. Gato just acts similar to how its training examples acted in similar situations to the ones it finds itself in.
If anyone was wondering whether DM planned to follow it up in the obvious way because of the obvious implications of its obvious generality and obvious scalability, Hassabis says on the Fridman podcast: ” it’s just the beginning really, it’s our most general agent one could call it so far but um you know that itself can be scaled up massively more than we’ve done so far obviously we’re in the in the middle of doing that.”
Don’t the scaling laws use logarithmic axis? That would suggest that the phenomenon is indeed exponential in it nature. If we need to get X times more compute with X times more data for additional improvements, we will hit the wall quite soon. There is only that much useful text on the Web and only that much compute that labs are willing to spend on this considering the diminishing returns.
There is a lot more useful data on YouTube (by several orders of magnitude at least? idk), I think the next wave of such breakthrough models will train on video.