We focus our training at the operating point of model scale that allows real-time control of real-world robots, currently around 1.2B parameters in the case of Gato. As hardware and model architectures improve, this operating point will naturally increase the feasible model size, pushing generalist models higher up the scaling law curve. For simplicity Gato was trained offline in a purely supervised manner; however, in principle, there is no reason it could not also be trained with either offline or online reinforcement learning (RL).
And there is, of course, absolutely no reason to think that it wouldn’t get as good as text/image models like Flamingo or the new ULM2 if it was trained & scaled as much as they were; the problem is that you can’t run such large dense models at the necessary low latency for realtime robotics… Perhaps finally a genuine application for MoEs to enable plugging in very large unimodal/multimodal models.
A principled solution would probably involve running different parts of the model at different frequencies. But you could also just scale breadth and see how far it goes. The human brain is not very deep—just recursive.
A friend pointed out on Facebook that Gato uses TPU-v3′s. Not sure why—I thought Google already had v4′s available for internal use a while ago? In any case, the TPU-v4 might potentially help a lot for the latency issue.
Indeed but to slightly counter balance this, at the same time, it looks like it was trained on ~500B tokens (while ~300B were used for GPT-3 and for GPT-2 something like ~50B).
Most of those tokens were spent on the RL tasks, which were 85% of the corpus. Looking at the table 1a/1b which, the pure text modeling tasks looks like they were 10% weight with the other 5% being the image caption datasets*; so if it did 5 x 1e11 tokens total (Figure 9), then presumably it only saw a tenth of that as actual pure text comparable to GPT-2, or 50b tokens. It’s also a small model so it is less sample-efficient and will get less than n billion tokens’ worth if you are mentally working back from “well, GPT-3 used x billion tokens”).
Considering further that it was not necessarily trained to convergence on the language modeling task (actually, come to think of it, how even did they decide when to stop training? they certainly didn’t derive scaling laws on the overall task mix & train Gato in a compute-optimal fashion… was Gato converged on any tasks?), and remembering just how dumb GPT-2 is by contemporary standards (which have been moving the goalposts at supersonic speed), the sample dialogues don’t look all that surprisingly dumb to me given its size & token count & training setup.
* image grounding is great and all that, but I don’t expect it to be all that useful for knowing ‘Marseilles is not the capital of France’.
Looking at the image captioning and text prediction responses, it doesn’t appear to be very good at either...
It’s smaller than GPT-2. Only 1.2B params.
And there is, of course, absolutely no reason to think that it wouldn’t get as good as text/image models like Flamingo or the new ULM2 if it was trained & scaled as much as they were; the problem is that you can’t run such large dense models at the necessary low latency for realtime robotics… Perhaps finally a genuine application for MoEs to enable plugging in very large unimodal/multimodal models.
A principled solution would probably involve running different parts of the model at different frequencies. But you could also just scale breadth and see how far it goes. The human brain is not very deep—just recursive.
I wouldn’t have connected breadth and recursion. (I’d have just thought, well, self-calling.)
A friend pointed out on Facebook that Gato uses TPU-v3′s. Not sure why—I thought Google already had v4′s available for internal use a while ago? In any case, the TPU-v4 might potentially help a lot for the latency issue.
Two main options:
* It was trained e.g. 1 year ago but published only now
* All TPU-v4 very busy with something even more important
They trained it on TPUv3s, however, the robot inference was run on a Geforce RTX 3090 (see section G).
TPUs are mostly designed for data centers and are not really usable for on-device inference.
Indeed but to slightly counter balance this, at the same time, it looks like it was trained on ~500B tokens (while ~300B were used for GPT-3 and for GPT-2 something like ~50B).
Most of those tokens were spent on the RL tasks, which were 85% of the corpus. Looking at the table 1a/1b which, the pure text modeling tasks looks like they were 10% weight with the other 5% being the image caption datasets*; so if it did 5 x 1e11 tokens total (Figure 9), then presumably it only saw a tenth of that as actual pure text comparable to GPT-2, or 50b tokens. It’s also a small model so it is less sample-efficient and will get less than n billion tokens’ worth if you are mentally working back from “well, GPT-3 used x billion tokens”).
Considering further that it was not necessarily trained to convergence on the language modeling task (actually, come to think of it, how even did they decide when to stop training? they certainly didn’t derive scaling laws on the overall task mix & train Gato in a compute-optimal fashion… was Gato converged on any tasks?), and remembering just how dumb GPT-2 is by contemporary standards (which have been moving the goalposts at supersonic speed), the sample dialogues don’t look all that surprisingly dumb to me given its size & token count & training setup.
* image grounding is great and all that, but I don’t expect it to be all that useful for knowing ‘Marseilles is not the capital of France’.