I think the reason why it being a unified agent matters is that we should expect significant positive transfer to happen eventually as we scale up the model and train it longer on more tasks. Do you not?
Sure, this might happen.
But remember, to train “a Gato,” we have to first train all the RL policies that generate its training data. So we have access to all of them too. Instead of training Gato, we could just find the one policy that seems closest to the target task, and spend all our compute on just finetuning it. (Yes, related tasks transfer—and the most related tasks transfer most!)
This approach doesn’t have to spend any compute on the “train Gato” step before finetuning, which gives it a head start. Plus, the individual policy models are generally much smaller than Gato, so they take less compute per step.
Would this work? In the case of the Lee et al robot problem, yes (this is roughly what Lee et al originally did, albeit with various caveats). In general, I don’t know, but this is the baseline that Gato should be comparing itself against.
The question isn’t “will it improve with scale?”—it’s 2022, anything worth doing improves with scale—but “will it ever reach the Pareto frontier? will I ever have a reason to do it?”
As an ML practitioner, it feels like the paper is telling me, “hey, think of a thing you can already do. What if I told you a way to do the same thing, equally well, with an extra step in the middle?” Like, uh, sure, but . . . why?
By contrast, when I papers like AlphaGo, BERT, CLIP, OpenAI diffusion, Chinchilla . . . this is a type of paper where I say, “holy shit, this Fucking Works™, this moves the Pareto frontier.” In several of these cases I went out and immediately used the method in the real world and reaped great rewards.
IMO, the “generalist agent” framing is misleading, insofar as it obscures this second-best quality of Gato. It’s not really any more an “agent” than my hypothetical cloud drive with a bunch of SOTA models on it. Prompting GATO is the equivalent of picking a file from the drive; if I want to do a novel task, I still have to finetune, just as I would with the drive. (A real AGI, even a weak one, would know how to finetune itself, or do the equivalent.)
We are not talking about an autonomous thing; we’re still in the world where there’s a human practitioner and “Gato” is one method they can use or not use. And I don’t see why I would want to use it.
But remember, to train “a Gato,” we have to first train all the RL policies that generate its training data. So we have access to all of them too.
No, you don’t have to, nor do you have guaranteed access, nor would you necessarily want to use them rather than Gato if you did. As Daniel points out, this is obviously untrue of all of the datasets it’s simply doing self-supervised learning on (how did we ‘train the RL policy’ for photographs?). It is also not true of it because it’s off-policy and offline: the experts could be human, or they could be the output of non-RL algorithms which are infeasible to run much like large search processes (eg chess endgame tables) or brittle non-generalizable expert-hand-engineered algorithms, or they could be RL policies you don’t have direct access to (because they’ve bitrotten or their owners won’t let you), or even RL policies which no longer exist because the agents were deleted but their data remains, or they could be RL policies from an oracle setting where you can’t run the original policy in the meaningful real world context (eg in robotics sim2real where you train the expert with oracle access to the simulation’s ground truth to get a good source of demonstrations, but at the end you need a policy which doesn’t use that oracle so you can run it in a real robot) or more broadly any kind of meta-learning context where you have data from RL policies for some problems in a family of problems and want to induce general solving, or they are filtered high-reward episodes from large numbers of attempts by brute force dumb (even random) agents where you trivially have ‘access to all of them’ but that is useless, or… Those RL policies may also not be better than a Gato or DT to begin with, because imitation learning can exceed observed experts and the ‘RL policies’ here might be, say, random baselines which merely have good coverage of the state-space. Plus, nothing at all stops Decision Transformer from doing its own exploration (planning was already demonstrated by DT/Trajectory Transformer, and there’s been work afterwards like Online Decision Transformer).
I thought some of the “experts” Gato was trained on were not from-scratch models but rather humans—e.g. images and text generated by humans.
Relatedly, instead of using a model as the “expert” couldn’t you use a human demonstrator? Like, suppose you are training it to control a drone flying through a warehouse. Couldn’t you have humans fly the drones for a bit and then have it train on those demonstrations?
It’s not really any more an “agent” than my hypothetical cloud drive with a bunch of SOTA models on it. Prompting GATO is the equivalent of picking a file from the drive; if I want to do a novel task, I still have to finetune, just as I would with the drive. (A real AGI, even a weak one, would know how to finetune itself, or do the equivalent.)
This is false if significant transfer/generalization starts to happen, right? A drive full of a bunch of SOTA models, plus a rule for deciding what to use, is worse than Gato to the extent that Gato is able to generalize few-shot or zero-shot to new tasks and/or insofar as Gato gets gains from transfer.
EDIT: Meta-comment: I think we are partially just talking past each other here. For example, you think that the question is ‘will it ever reach the Pareto frontier,’ which is definitely not the question I care about.
Meta-comment of my own: I’m going to have to tap out of this conversation after this comment. I appreciate that you’re asking questions in good faith, and this isn’t your fault, but I find this type of exchange stressful and tiring to conduct.
Specifically, I’m writing at the level of exactness/explicitness that I normally expect in research conversations, but it seems like that is not enough here to avoid misunderstandings. It’s tough for me to find the right level of explicitness while avoiding the urge to put thousands of very pedantic words in every comment, just in case.
Re: non-RL training data.
Above, I used “RL policies” as a casual synecdoche for “sources of Gato training data,” for reasons similar to the reasons that this post by Oliver Sourbut focuses on RL/control.
Yes, Gato had other sources of training data, but (1) the RL/control results are the ones everyone is talking about, and (2) the paper shows that the RL/control training data is driving those results (they get even better RL/control outcomes when they drop the other data sources).
Re: gains from transfer..
Yes, if Gato outperforms a particular RL/control policy that generated training data for it, then having Gato is better than merely having that policy, in the case where you want to do its target task.
However, training a Gato is not the only way of reaping gains from transfer. Every time we finetune any model, or use multi-task training, we are reaping gains from transfer. The literature (incl. this paper) robustly shows that we get the biggest gains from transfer when transferring between similar tasks, while distant or unrelated tasks yield no transfer or even negative transfer.
So you can imagine a spectrum ranging from
“pretrain only on one very related task” (i.e. finetuning a single narrow task model), to
“pretraining on a collection of similar tasks” (i.e. multi-task pretraining followed by finetuning), to
“pretrain on every task, even those where you expect no or negative transfer” (i.e. Gato)
The difference between Gato (3) and ordinary multi-task pretraining (2) is that, where the latter would only train with a few closely related tasks, Gato also trains on many other less related tasks.
It would be cool if this helped, and sometimes it does help, as in this paper about training on many modalities at once for multi-modal learning with small transformers. But this is not what the Gato authors found—indeed it’s basically the opposite of what they found.
We could use a bigger model in the hope that will get us some gains from distant transfer (and there is some evidence that this will help), but with the same resources, we could also restrict ourselves to less-irrelevant data and then train a smaller (or same-sized) model on more of it. Gato is at one extreme end of this spectrum, and everything suggests the optimum is somewhere in the interior.
Oliver’s post, which I basically I agree with, has more details on the transfer results.
Sure, this might happen.
But remember, to train “a Gato,” we have to first train all the RL policies that generate its training data. So we have access to all of them too. Instead of training Gato, we could just find the one policy that seems closest to the target task, and spend all our compute on just finetuning it. (Yes, related tasks transfer—and the most related tasks transfer most!)
This approach doesn’t have to spend any compute on the “train Gato” step before finetuning, which gives it a head start. Plus, the individual policy models are generally much smaller than Gato, so they take less compute per step.
Would this work? In the case of the Lee et al robot problem, yes (this is roughly what Lee et al originally did, albeit with various caveats). In general, I don’t know, but this is the baseline that Gato should be comparing itself against.
The question isn’t “will it improve with scale?”—it’s 2022, anything worth doing improves with scale—but “will it ever reach the Pareto frontier? will I ever have a reason to do it?”
As an ML practitioner, it feels like the paper is telling me, “hey, think of a thing you can already do. What if I told you a way to do the same thing, equally well, with an extra step in the middle?” Like, uh, sure, but . . . why?
By contrast, when I papers like AlphaGo, BERT, CLIP, OpenAI diffusion, Chinchilla . . . this is a type of paper where I say, “holy shit, this Fucking Works™, this moves the Pareto frontier.” In several of these cases I went out and immediately used the method in the real world and reaped great rewards.
IMO, the “generalist agent” framing is misleading, insofar as it obscures this second-best quality of Gato. It’s not really any more an “agent” than my hypothetical cloud drive with a bunch of SOTA models on it. Prompting GATO is the equivalent of picking a file from the drive; if I want to do a novel task, I still have to finetune, just as I would with the drive. (A real AGI, even a weak one, would know how to finetune itself, or do the equivalent.)
We are not talking about an autonomous thing; we’re still in the world where there’s a human practitioner and “Gato” is one method they can use or not use. And I don’t see why I would want to use it.
No, you don’t have to, nor do you have guaranteed access, nor would you necessarily want to use them rather than Gato if you did. As Daniel points out, this is obviously untrue of all of the datasets it’s simply doing self-supervised learning on (how did we ‘train the RL policy’ for photographs?). It is also not true of it because it’s off-policy and offline: the experts could be human, or they could be the output of non-RL algorithms which are infeasible to run much like large search processes (eg chess endgame tables) or brittle non-generalizable expert-hand-engineered algorithms, or they could be RL policies you don’t have direct access to (because they’ve bitrotten or their owners won’t let you), or even RL policies which no longer exist because the agents were deleted but their data remains, or they could be RL policies from an oracle setting where you can’t run the original policy in the meaningful real world context (eg in robotics sim2real where you train the expert with oracle access to the simulation’s ground truth to get a good source of demonstrations, but at the end you need a policy which doesn’t use that oracle so you can run it in a real robot) or more broadly any kind of meta-learning context where you have data from RL policies for some problems in a family of problems and want to induce general solving, or they are filtered high-reward episodes from large numbers of attempts by brute force dumb (even random) agents where you trivially have ‘access to all of them’ but that is useless, or… Those RL policies may also not be better than a Gato or DT to begin with, because imitation learning can exceed observed experts and the ‘RL policies’ here might be, say, random baselines which merely have good coverage of the state-space. Plus, nothing at all stops Decision Transformer from doing its own exploration (planning was already demonstrated by DT/Trajectory Transformer, and there’s been work afterwards like Online Decision Transformer).
I thought some of the “experts” Gato was trained on were not from-scratch models but rather humans—e.g. images and text generated by humans.
Relatedly, instead of using a model as the “expert” couldn’t you use a human demonstrator? Like, suppose you are training it to control a drone flying through a warehouse. Couldn’t you have humans fly the drones for a bit and then have it train on those demonstrations?
This is false if significant transfer/generalization starts to happen, right? A drive full of a bunch of SOTA models, plus a rule for deciding what to use, is worse than Gato to the extent that Gato is able to generalize few-shot or zero-shot to new tasks and/or insofar as Gato gets gains from transfer.
EDIT: Meta-comment: I think we are partially just talking past each other here. For example, you think that the question is ‘will it ever reach the Pareto frontier,’ which is definitely not the question I care about.
Meta-comment of my own: I’m going to have to tap out of this conversation after this comment. I appreciate that you’re asking questions in good faith, and this isn’t your fault, but I find this type of exchange stressful and tiring to conduct.
Specifically, I’m writing at the level of exactness/explicitness that I normally expect in research conversations, but it seems like that is not enough here to avoid misunderstandings. It’s tough for me to find the right level of explicitness while avoiding the urge to put thousands of very pedantic words in every comment, just in case.
Re: non-RL training data.
Above, I used “RL policies” as a casual synecdoche for “sources of Gato training data,” for reasons similar to the reasons that this post by Oliver Sourbut focuses on RL/control.
Yes, Gato had other sources of training data, but (1) the RL/control results are the ones everyone is talking about, and (2) the paper shows that the RL/control training data is driving those results (they get even better RL/control outcomes when they drop the other data sources).
Re: gains from transfer..
Yes, if Gato outperforms a particular RL/control policy that generated training data for it, then having Gato is better than merely having that policy, in the case where you want to do its target task.
However, training a Gato is not the only way of reaping gains from transfer. Every time we finetune any model, or use multi-task training, we are reaping gains from transfer. The literature (incl. this paper) robustly shows that we get the biggest gains from transfer when transferring between similar tasks, while distant or unrelated tasks yield no transfer or even negative transfer.
So you can imagine a spectrum ranging from
“pretrain only on one very related task” (i.e. finetuning a single narrow task model), to
“pretraining on a collection of similar tasks” (i.e. multi-task pretraining followed by finetuning), to
“pretrain on every task, even those where you expect no or negative transfer” (i.e. Gato)
The difference between Gato (3) and ordinary multi-task pretraining (2) is that, where the latter would only train with a few closely related tasks, Gato also trains on many other less related tasks.
It would be cool if this helped, and sometimes it does help, as in this paper about training on many modalities at once for multi-modal learning with small transformers. But this is not what the Gato authors found—indeed it’s basically the opposite of what they found.
We could use a bigger model in the hope that will get us some gains from distant transfer (and there is some evidence that this will help), but with the same resources, we could also restrict ourselves to less-irrelevant data and then train a smaller (or same-sized) model on more of it. Gato is at one extreme end of this spectrum, and everything suggests the optimum is somewhere in the interior.
Oliver’s post, which I basically I agree with, has more details on the transfer results.