Indeed but to slightly counter balance this, at the same time, it looks like it was trained on ~500B tokens (while ~300B were used for GPT-3 and for GPT-2 something like ~50B).
Most of those tokens were spent on the RL tasks, which were 85% of the corpus. Looking at the table 1a/1b which, the pure text modeling tasks looks like they were 10% weight with the other 5% being the image caption datasets*; so if it did 5 x 1e11 tokens total (Figure 9), then presumably it only saw a tenth of that as actual pure text comparable to GPT-2, or 50b tokens. It’s also a small model so it is less sample-efficient and will get less than n billion tokens’ worth if you are mentally working back from “well, GPT-3 used x billion tokens”).
Considering further that it was not necessarily trained to convergence on the language modeling task (actually, come to think of it, how even did they decide when to stop training? they certainly didn’t derive scaling laws on the overall task mix & train Gato in a compute-optimal fashion… was Gato converged on any tasks?), and remembering just how dumb GPT-2 is by contemporary standards (which have been moving the goalposts at supersonic speed), the sample dialogues don’t look all that surprisingly dumb to me given its size & token count & training setup.
* image grounding is great and all that, but I don’t expect it to be all that useful for knowing ‘Marseilles is not the capital of France’.
Indeed but to slightly counter balance this, at the same time, it looks like it was trained on ~500B tokens (while ~300B were used for GPT-3 and for GPT-2 something like ~50B).
Most of those tokens were spent on the RL tasks, which were 85% of the corpus. Looking at the table 1a/1b which, the pure text modeling tasks looks like they were 10% weight with the other 5% being the image caption datasets*; so if it did 5 x 1e11 tokens total (Figure 9), then presumably it only saw a tenth of that as actual pure text comparable to GPT-2, or 50b tokens. It’s also a small model so it is less sample-efficient and will get less than n billion tokens’ worth if you are mentally working back from “well, GPT-3 used x billion tokens”).
Considering further that it was not necessarily trained to convergence on the language modeling task (actually, come to think of it, how even did they decide when to stop training? they certainly didn’t derive scaling laws on the overall task mix & train Gato in a compute-optimal fashion… was Gato converged on any tasks?), and remembering just how dumb GPT-2 is by contemporary standards (which have been moving the goalposts at supersonic speed), the sample dialogues don’t look all that surprisingly dumb to me given its size & token count & training setup.
* image grounding is great and all that, but I don’t expect it to be all that useful for knowing ‘Marseilles is not the capital of France’.