For what it’s worth, I was thoroughly underwhelmed by Gato, to the point of feeling confused what the paper was even trying to demonstrate.
I’m not the only ML researcher who had this reaction. In the Eleuther discord server, I said “i don’t get what i’m supposed to take away from this gato paper,” and responses from regulars included
“nothing, this was 3 years over-due”
“Yep. I didn’t update much on this paper. I think the ‘general’ in the title is making people panic lol” (with two “this” reacts)
Or see this tweet. I’m not trying to convince you by saying “lots of people agree with me!”, but I think this may be useful context.
A key thing to remember when evaluating Gato is that it was trained on data from many RL models that were themselves very impressive. So there are 2 very different questions we can ask:
Does Gato successively distill a large number of learned RL policies into a single, small collection of params?
Does Gato do anything except distillation? Is there significant beneficial transfer between tasks or data types? Is Gato any more of a “generalist agent” than, like, a big cloud storage bucket with all of those RL models in it, and a little script that lets you pick which one to load and run?
And the answers are a pretty clear, stark “yes” and “no,” respectively.
For #2, note that every time the paper investigates transfer, it gets results that are mostly or entirely negative (see Figs 9 and 17). For example, including stuff like text data makes Gato seem more sexily “generalist” but does not actually seem to help anything—it’s like uploading a (low-quality) LM to the same cloud bucket as the RL policies. It just sits there.
In the particular case of the robot stacking experiment, I don’t think your read is accurate, for reasons related to the above. Neither the transfer to real robotics, nor the effectiveness of offline finetuning, are new to Gato—the researchers are sticking as close as they can to what was done in Lee et al 2022, which used the same stacking task + offline finetuning + real robots, and getting (I think?) broadly similar results. That is, this is yet another success of distillation, without a clear value-add beyond distillation.
In the specific case of Lee et al’s “Skill Generalization” task, it’s important to note that the “expert” line is not reflective of “SOTA RL expert models.”
The stacking task is partitioned here (over object shapes/colors) into train and test subsets. The “expert” is trained only on the train subset, and then Lee et al (and the Gato authors) investigate models that are additionally tuned on the test subset in some way or other. So the “expert” is really a baseline here, and the task consists of trying to beat it.
(This distinction made somewhat clearer in an appendix of the Gato paper—see Fig. 17, and note that the “expert” lines there match the “Dataset” lines from Fig. 3 in Lee et al 2022.)
FWIW I agree with this take & basically said as much in my post; Gato is about what I would have expected given past progress. I think people are right to freak out now about oncoming AGI, but I think they should have been freaking out already, and Gato just had a sufficiently sexy title and abstract. It’s like how people should have been freaking out about COVID early on but only actually started freaking out when hospitals started getting crowded in their own country.
As for the transfer, I would actually have been a bit surprised if there was significant positive transfer given the small number of tasks trained on and the small model size. I’m curious to hear if there was negative transfer though and if so how much. I think the reason why it being a unified agent matters is that we should expect significant positive transfer to happen eventually as we scale up the model and train it longer on more tasks. Do you not?
I think the reason why it being a unified agent matters is that we should expect significant positive transfer to happen eventually as we scale up the model and train it longer on more tasks. Do you not?
Sure, this might happen.
But remember, to train “a Gato,” we have to first train all the RL policies that generate its training data. So we have access to all of them too. Instead of training Gato, we could just find the one policy that seems closest to the target task, and spend all our compute on just finetuning it. (Yes, related tasks transfer—and the most related tasks transfer most!)
This approach doesn’t have to spend any compute on the “train Gato” step before finetuning, which gives it a head start. Plus, the individual policy models are generally much smaller than Gato, so they take less compute per step.
Would this work? In the case of the Lee et al robot problem, yes (this is roughly what Lee et al originally did, albeit with various caveats). In general, I don’t know, but this is the baseline that Gato should be comparing itself against.
The question isn’t “will it improve with scale?”—it’s 2022, anything worth doing improves with scale—but “will it ever reach the Pareto frontier? will I ever have a reason to do it?”
As an ML practitioner, it feels like the paper is telling me, “hey, think of a thing you can already do. What if I told you a way to do the same thing, equally well, with an extra step in the middle?” Like, uh, sure, but . . . why?
By contrast, when I papers like AlphaGo, BERT, CLIP, OpenAI diffusion, Chinchilla . . . this is a type of paper where I say, “holy shit, this Fucking Works™, this moves the Pareto frontier.” In several of these cases I went out and immediately used the method in the real world and reaped great rewards.
IMO, the “generalist agent” framing is misleading, insofar as it obscures this second-best quality of Gato. It’s not really any more an “agent” than my hypothetical cloud drive with a bunch of SOTA models on it. Prompting GATO is the equivalent of picking a file from the drive; if I want to do a novel task, I still have to finetune, just as I would with the drive. (A real AGI, even a weak one, would know how to finetune itself, or do the equivalent.)
We are not talking about an autonomous thing; we’re still in the world where there’s a human practitioner and “Gato” is one method they can use or not use. And I don’t see why I would want to use it.
But remember, to train “a Gato,” we have to first train all the RL policies that generate its training data. So we have access to all of them too.
No, you don’t have to, nor do you have guaranteed access, nor would you necessarily want to use them rather than Gato if you did. As Daniel points out, this is obviously untrue of all of the datasets it’s simply doing self-supervised learning on (how did we ‘train the RL policy’ for photographs?). It is also not true of it because it’s off-policy and offline: the experts could be human, or they could be the output of non-RL algorithms which are infeasible to run much like large search processes (eg chess endgame tables) or brittle non-generalizable expert-hand-engineered algorithms, or they could be RL policies you don’t have direct access to (because they’ve bitrotten or their owners won’t let you), or even RL policies which no longer exist because the agents were deleted but their data remains, or they could be RL policies from an oracle setting where you can’t run the original policy in the meaningful real world context (eg in robotics sim2real where you train the expert with oracle access to the simulation’s ground truth to get a good source of demonstrations, but at the end you need a policy which doesn’t use that oracle so you can run it in a real robot) or more broadly any kind of meta-learning context where you have data from RL policies for some problems in a family of problems and want to induce general solving, or they are filtered high-reward episodes from large numbers of attempts by brute force dumb (even random) agents where you trivially have ‘access to all of them’ but that is useless, or… Those RL policies may also not be better than a Gato or DT to begin with, because imitation learning can exceed observed experts and the ‘RL policies’ here might be, say, random baselines which merely have good coverage of the state-space. Plus, nothing at all stops Decision Transformer from doing its own exploration (planning was already demonstrated by DT/Trajectory Transformer, and there’s been work afterwards like Online Decision Transformer).
I thought some of the “experts” Gato was trained on were not from-scratch models but rather humans—e.g. images and text generated by humans.
Relatedly, instead of using a model as the “expert” couldn’t you use a human demonstrator? Like, suppose you are training it to control a drone flying through a warehouse. Couldn’t you have humans fly the drones for a bit and then have it train on those demonstrations?
It’s not really any more an “agent” than my hypothetical cloud drive with a bunch of SOTA models on it. Prompting GATO is the equivalent of picking a file from the drive; if I want to do a novel task, I still have to finetune, just as I would with the drive. (A real AGI, even a weak one, would know how to finetune itself, or do the equivalent.)
This is false if significant transfer/generalization starts to happen, right? A drive full of a bunch of SOTA models, plus a rule for deciding what to use, is worse than Gato to the extent that Gato is able to generalize few-shot or zero-shot to new tasks and/or insofar as Gato gets gains from transfer.
EDIT: Meta-comment: I think we are partially just talking past each other here. For example, you think that the question is ‘will it ever reach the Pareto frontier,’ which is definitely not the question I care about.
Meta-comment of my own: I’m going to have to tap out of this conversation after this comment. I appreciate that you’re asking questions in good faith, and this isn’t your fault, but I find this type of exchange stressful and tiring to conduct.
Specifically, I’m writing at the level of exactness/explicitness that I normally expect in research conversations, but it seems like that is not enough here to avoid misunderstandings. It’s tough for me to find the right level of explicitness while avoiding the urge to put thousands of very pedantic words in every comment, just in case.
Re: non-RL training data.
Above, I used “RL policies” as a casual synecdoche for “sources of Gato training data,” for reasons similar to the reasons that this post by Oliver Sourbut focuses on RL/control.
Yes, Gato had other sources of training data, but (1) the RL/control results are the ones everyone is talking about, and (2) the paper shows that the RL/control training data is driving those results (they get even better RL/control outcomes when they drop the other data sources).
Re: gains from transfer..
Yes, if Gato outperforms a particular RL/control policy that generated training data for it, then having Gato is better than merely having that policy, in the case where you want to do its target task.
However, training a Gato is not the only way of reaping gains from transfer. Every time we finetune any model, or use multi-task training, we are reaping gains from transfer. The literature (incl. this paper) robustly shows that we get the biggest gains from transfer when transferring between similar tasks, while distant or unrelated tasks yield no transfer or even negative transfer.
So you can imagine a spectrum ranging from
“pretrain only on one very related task” (i.e. finetuning a single narrow task model), to
“pretraining on a collection of similar tasks” (i.e. multi-task pretraining followed by finetuning), to
“pretrain on every task, even those where you expect no or negative transfer” (i.e. Gato)
The difference between Gato (3) and ordinary multi-task pretraining (2) is that, where the latter would only train with a few closely related tasks, Gato also trains on many other less related tasks.
It would be cool if this helped, and sometimes it does help, as in this paper about training on many modalities at once for multi-modal learning with small transformers. But this is not what the Gato authors found—indeed it’s basically the opposite of what they found.
We could use a bigger model in the hope that will get us some gains from distant transfer (and there is some evidence that this will help), but with the same resources, we could also restrict ourselves to less-irrelevant data and then train a smaller (or same-sized) model on more of it. Gato is at one extreme end of this spectrum, and everything suggests the optimum is somewhere in the interior.
Oliver’s post, which I basically I agree with, has more details on the transfer results.
A single network is solving 600 different tasks spanning different areas. 100+ of the tasks are solved at 100% human performance. Let that sink in.
While not a breaktrough in arbitrary scalable generality, the fact that so many tasks can be fitted into one architecture is surprising and novel. For many real life applications, being good in 100-1000 tasks makes an AI general enough to be deployed as an error tollerant robot, say in a warehouse.
The main point imho is that this architecture may be enough to be scaled (10-1000x parameters) in few years to a useful proto-AGI product.
For what it’s worth, I was thoroughly underwhelmed by Gato, to the point of feeling confused what the paper was even trying to demonstrate.
I’m not the only ML researcher who had this reaction. In the Eleuther discord server, I said “i don’t get what i’m supposed to take away from this gato paper,” and responses from regulars included
“nothing, this was 3 years over-due”
“Yep. I didn’t update much on this paper. I think the ‘general’ in the title is making people panic lol” (with two “this” reacts)
Or see this tweet. I’m not trying to convince you by saying “lots of people agree with me!”, but I think this may be useful context.
A key thing to remember when evaluating Gato is that it was trained on data from many RL models that were themselves very impressive. So there are 2 very different questions we can ask:
Does Gato successively distill a large number of learned RL policies into a single, small collection of params?
Does Gato do anything except distillation? Is there significant beneficial transfer between tasks or data types? Is Gato any more of a “generalist agent” than, like, a big cloud storage bucket with all of those RL models in it, and a little script that lets you pick which one to load and run?
And the answers are a pretty clear, stark “yes” and “no,” respectively.
For #2, note that every time the paper investigates transfer, it gets results that are mostly or entirely negative (see Figs 9 and 17). For example, including stuff like text data makes Gato seem more sexily “generalist” but does not actually seem to help anything—it’s like uploading a (low-quality) LM to the same cloud bucket as the RL policies. It just sits there.
In the particular case of the robot stacking experiment, I don’t think your read is accurate, for reasons related to the above. Neither the transfer to real robotics, nor the effectiveness of offline finetuning, are new to Gato—the researchers are sticking as close as they can to what was done in Lee et al 2022, which used the same stacking task + offline finetuning + real robots, and getting (I think?) broadly similar results. That is, this is yet another success of distillation, without a clear value-add beyond distillation.
In the specific case of Lee et al’s “Skill Generalization” task, it’s important to note that the “expert” line is not reflective of “SOTA RL expert models.”
The stacking task is partitioned here (over object shapes/colors) into train and test subsets. The “expert” is trained only on the train subset, and then Lee et al (and the Gato authors) investigate models that are additionally tuned on the test subset in some way or other. So the “expert” is really a baseline here, and the task consists of trying to beat it.
(This distinction made somewhat clearer in an appendix of the Gato paper—see Fig. 17, and note that the “expert” lines there match the “Dataset” lines from Fig. 3 in Lee et al 2022.)
FWIW I agree with this take & basically said as much in my post; Gato is about what I would have expected given past progress. I think people are right to freak out now about oncoming AGI, but I think they should have been freaking out already, and Gato just had a sufficiently sexy title and abstract. It’s like how people should have been freaking out about COVID early on but only actually started freaking out when hospitals started getting crowded in their own country.
As for the transfer, I would actually have been a bit surprised if there was significant positive transfer given the small number of tasks trained on and the small model size. I’m curious to hear if there was negative transfer though and if so how much. I think the reason why it being a unified agent matters is that we should expect significant positive transfer to happen eventually as we scale up the model and train it longer on more tasks. Do you not?
Sure, this might happen.
But remember, to train “a Gato,” we have to first train all the RL policies that generate its training data. So we have access to all of them too. Instead of training Gato, we could just find the one policy that seems closest to the target task, and spend all our compute on just finetuning it. (Yes, related tasks transfer—and the most related tasks transfer most!)
This approach doesn’t have to spend any compute on the “train Gato” step before finetuning, which gives it a head start. Plus, the individual policy models are generally much smaller than Gato, so they take less compute per step.
Would this work? In the case of the Lee et al robot problem, yes (this is roughly what Lee et al originally did, albeit with various caveats). In general, I don’t know, but this is the baseline that Gato should be comparing itself against.
The question isn’t “will it improve with scale?”—it’s 2022, anything worth doing improves with scale—but “will it ever reach the Pareto frontier? will I ever have a reason to do it?”
As an ML practitioner, it feels like the paper is telling me, “hey, think of a thing you can already do. What if I told you a way to do the same thing, equally well, with an extra step in the middle?” Like, uh, sure, but . . . why?
By contrast, when I papers like AlphaGo, BERT, CLIP, OpenAI diffusion, Chinchilla . . . this is a type of paper where I say, “holy shit, this Fucking Works™, this moves the Pareto frontier.” In several of these cases I went out and immediately used the method in the real world and reaped great rewards.
IMO, the “generalist agent” framing is misleading, insofar as it obscures this second-best quality of Gato. It’s not really any more an “agent” than my hypothetical cloud drive with a bunch of SOTA models on it. Prompting GATO is the equivalent of picking a file from the drive; if I want to do a novel task, I still have to finetune, just as I would with the drive. (A real AGI, even a weak one, would know how to finetune itself, or do the equivalent.)
We are not talking about an autonomous thing; we’re still in the world where there’s a human practitioner and “Gato” is one method they can use or not use. And I don’t see why I would want to use it.
No, you don’t have to, nor do you have guaranteed access, nor would you necessarily want to use them rather than Gato if you did. As Daniel points out, this is obviously untrue of all of the datasets it’s simply doing self-supervised learning on (how did we ‘train the RL policy’ for photographs?). It is also not true of it because it’s off-policy and offline: the experts could be human, or they could be the output of non-RL algorithms which are infeasible to run much like large search processes (eg chess endgame tables) or brittle non-generalizable expert-hand-engineered algorithms, or they could be RL policies you don’t have direct access to (because they’ve bitrotten or their owners won’t let you), or even RL policies which no longer exist because the agents were deleted but their data remains, or they could be RL policies from an oracle setting where you can’t run the original policy in the meaningful real world context (eg in robotics sim2real where you train the expert with oracle access to the simulation’s ground truth to get a good source of demonstrations, but at the end you need a policy which doesn’t use that oracle so you can run it in a real robot) or more broadly any kind of meta-learning context where you have data from RL policies for some problems in a family of problems and want to induce general solving, or they are filtered high-reward episodes from large numbers of attempts by brute force dumb (even random) agents where you trivially have ‘access to all of them’ but that is useless, or… Those RL policies may also not be better than a Gato or DT to begin with, because imitation learning can exceed observed experts and the ‘RL policies’ here might be, say, random baselines which merely have good coverage of the state-space. Plus, nothing at all stops Decision Transformer from doing its own exploration (planning was already demonstrated by DT/Trajectory Transformer, and there’s been work afterwards like Online Decision Transformer).
I thought some of the “experts” Gato was trained on were not from-scratch models but rather humans—e.g. images and text generated by humans.
Relatedly, instead of using a model as the “expert” couldn’t you use a human demonstrator? Like, suppose you are training it to control a drone flying through a warehouse. Couldn’t you have humans fly the drones for a bit and then have it train on those demonstrations?
This is false if significant transfer/generalization starts to happen, right? A drive full of a bunch of SOTA models, plus a rule for deciding what to use, is worse than Gato to the extent that Gato is able to generalize few-shot or zero-shot to new tasks and/or insofar as Gato gets gains from transfer.
EDIT: Meta-comment: I think we are partially just talking past each other here. For example, you think that the question is ‘will it ever reach the Pareto frontier,’ which is definitely not the question I care about.
Meta-comment of my own: I’m going to have to tap out of this conversation after this comment. I appreciate that you’re asking questions in good faith, and this isn’t your fault, but I find this type of exchange stressful and tiring to conduct.
Specifically, I’m writing at the level of exactness/explicitness that I normally expect in research conversations, but it seems like that is not enough here to avoid misunderstandings. It’s tough for me to find the right level of explicitness while avoiding the urge to put thousands of very pedantic words in every comment, just in case.
Re: non-RL training data.
Above, I used “RL policies” as a casual synecdoche for “sources of Gato training data,” for reasons similar to the reasons that this post by Oliver Sourbut focuses on RL/control.
Yes, Gato had other sources of training data, but (1) the RL/control results are the ones everyone is talking about, and (2) the paper shows that the RL/control training data is driving those results (they get even better RL/control outcomes when they drop the other data sources).
Re: gains from transfer..
Yes, if Gato outperforms a particular RL/control policy that generated training data for it, then having Gato is better than merely having that policy, in the case where you want to do its target task.
However, training a Gato is not the only way of reaping gains from transfer. Every time we finetune any model, or use multi-task training, we are reaping gains from transfer. The literature (incl. this paper) robustly shows that we get the biggest gains from transfer when transferring between similar tasks, while distant or unrelated tasks yield no transfer or even negative transfer.
So you can imagine a spectrum ranging from
“pretrain only on one very related task” (i.e. finetuning a single narrow task model), to
“pretraining on a collection of similar tasks” (i.e. multi-task pretraining followed by finetuning), to
“pretrain on every task, even those where you expect no or negative transfer” (i.e. Gato)
The difference between Gato (3) and ordinary multi-task pretraining (2) is that, where the latter would only train with a few closely related tasks, Gato also trains on many other less related tasks.
It would be cool if this helped, and sometimes it does help, as in this paper about training on many modalities at once for multi-modal learning with small transformers. But this is not what the Gato authors found—indeed it’s basically the opposite of what they found.
We could use a bigger model in the hope that will get us some gains from distant transfer (and there is some evidence that this will help), but with the same resources, we could also restrict ourselves to less-irrelevant data and then train a smaller (or same-sized) model on more of it. Gato is at one extreme end of this spectrum, and everything suggests the optimum is somewhere in the interior.
Oliver’s post, which I basically I agree with, has more details on the transfer results.
A single network is solving 600 different tasks spanning different areas. 100+ of the tasks are solved at 100% human performance. Let that sink in.
While not a breaktrough in arbitrary scalable generality, the fact that so many tasks can be fitted into one architecture is surprising and novel. For many real life applications, being good in 100-1000 tasks makes an AI general enough to be deployed as an error tollerant robot, say in a warehouse.
The main point imho is that this architecture may be enough to be scaled (10-1000x parameters) in few years to a useful proto-AGI product.
But the mere fact that one network may be useful for many tasks at once has been extensively investigated since 1990s.