Having not read the detailed results yet, I would be quite surprised if [Gato] performed better on language-only tasks than a pretrained language model of the same size...
In general, from a “timelines to risky systems” perspective, I’m not that interested in these sorts of “generic agents” that can do all the things with one neural net; it seems like it will be far more economically useful to have separate neural nets doing each of the things and using each other as tools to accomplish particular tasks and so that’s what I expect to see.
Given these laws, we can now make predictions about what scale will be required to overcome modal competition and achieve synergy from training on each pair of modalities. By modality competition, we refer to the empirical phenomena of two modalities performing worse than if we trained two individual models on the same number of per-modality tokens. By synergy, we mean the inverse. We can define the notion of synergy formally through our scaling laws. [...]
We plot the ratio of the average of the Speech and Text models perplexity per timestep by Speech|Text perplexity, the competition barrier and predictions from our scaling laws in Figure 5. As we see, the prediction does hold, and we achieve a model that crosses the competition barrier. Further scaling is likely to further improve the synergy, but we leave this exploration to future work.
Sorry, I think that particular sentence of mine was poorly written (and got appropriate pushback at the time). I still endorse my followup comment, which includes this clarification:
The thing I’m not that interested in (from a “how scared should we be” or “timelines” perspective) is when you take a bunch of different tasks, shove them into a single “generic agent”, and the resulting agent is worse on most of the tasks and isn’t correspondingly better at some new task that none of the previous systems could do.
In particular, my impression with Gato is that it was not showing much synergy. I agree that synergy is possible and likely to increase with additional scale (and I’m pretty sure I would have said so at the time, especially since I cited a different example of positive transfer).
(Note I haven’t read the mixed-modal scaling laws paper in detail so I may be missing an important point about it.)
Do you still believe this in light of the paper on mixed-modal scaling laws?
From the paper,
Sorry, I think that particular sentence of mine was poorly written (and got appropriate pushback at the time). I still endorse my followup comment, which includes this clarification:
In particular, my impression with Gato is that it was not showing much synergy. I agree that synergy is possible and likely to increase with additional scale (and I’m pretty sure I would have said so at the time, especially since I cited a different example of positive transfer).
(Note I haven’t read the mixed-modal scaling laws paper in detail so I may be missing an important point about it.)