Not really? On timelines, I haven’t looked through the results so maybe they’re more surprising then they look on a brief skim, but “you can do multitask learning with a single network” feels totally unsurprising given past results. Like, if nothing else the network could allocate 10% of itself to each domain; 100M parameters are more than enough to show good performance in these domains (robotics often uses far fewer parameters iirc). But also I would have expected some transfer between tasks so that you’d do better than that would naively predict. I’ve seen this before—iirc there was a result (from Pieter Abbeel’s lab? EDIT: this one EDIT 2: see caveats on this paper, though it doesn’t affect my point) a couple of years ago that showed that pretraining a model on language would lead to improved sample efficiency in some nominally-totally-unrelated RL task, or something like that. Unfortunately I can’t find it on a quick Google now (and it’s possible it never made it into a paper and I heard it via word of mouth).
Having not read the detailed results yet, I would be quite surprised if it performed better on language-only tasks than a pretrained language model of the same size; I’d be a little surprised if it performed better on robotics / RL tasks than a specialized model of the same size given the same amount of robotics data.
In general, from a “timelines to risky systems” perspective, I’m not that interested in these sorts of “generic agents” that can do all the things with one neural net; it seems like it will be far more economically useful to have separate neural nets doing each of the things and using each other as tools to accomplish particular tasks and so that’s what I expect to see.
On pessimism, I’m not sure why I should update in any direction on this result, even if I thought this was surprisingly fast progress which I don’t. I guess shorter timelines would increase pessimism just by us having less time to prepare, but I don’t see any other good reason for increased pessimism.
there was a result (from Pieter Abbeel’s lab?) a couple of years ago that showed that pretraining a model on language would lead to improved sample efficiency in some nominally-totally-unrelated RL task
We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning – in particular [...] a variety of sequence classification tasks spanning numerical, computation, vision, and protein fold prediction
Maybe you already had better models here, but I was pretty unsure whether we needed any additional substantial advances to get good multimodal learning. As Gwern says in his comment:
it used to be that learning multiple Atari games in parallel was really hard. It did not work, at all. You got negative transfer even within ALE. People thought very hard and ran lots of experiments to try to create things like Popart less than 4 years ago where it was a triumph that, due to careful engineering a single checkpoint could play just the ALE-57 games with mediocre performance.
Decision Transformer definitely made ‘multi-task learning is a blessing of scale’ the default hypothesis, but no one had actually shown this, the past DT and other work (aside from MetaMimic) were all rather low n and k; you could wonder if they would interfere at a certain point or break down, and require fancy engineering like MoEs to enable learning at all. (Similarly, Andy Jones showed nice scaling laws for DRL and I scraped together a few examples like Ms Pacman, but nothing across really complicated tasks or many tasks.)
For me this paper wasn’t a massive update, but it did update me some that we really don’t need much of any fundamental advancement to build very big agents that learn on multimodal data, and that barrier was one of the last ones to making AGI just one big “scaling up” step away, which has implications for both my timeline models and my risk models.
it did update me some that we really don’t need much of any fundamental advancement to build very big agents that learn on multimodal data
I’ve been saying this for some time now. (I often feel like I’m more confident in the bio anchors approach to timelines than Open Phil is, because I’m more willing to say “yes we literally could scale up 2020 algorithms and get TAI, given some engineering effort and enough good data, without any fundamental advances”.)
My explanation for the negative transfer in ALE is that ALE isn’t sufficiently diverse / randomized; you can see this in CoinRun (see “the diversity hypothesis” in Understanding RL vision), where you only get interpretable vision features for aspects of the environment that were randomized. In contrast, image classifiers trained on real world images have interpretable vision features at all layers except perhaps some of the later ones, and often lead to positive transfer on new tasks.
A big part of my model predicting what kind of transfer does and doesn’t work in deep learning is figuring out to what extent I expect there to be large entangled variation in the features of the training data. If this variation is present, then I expect the neural network is forced to learn the real actual feature, and there isn’t some other simpler program that happens to get it correct in only the training situations. If you have the real actual feature, then you’re going to transfer better.
You usually don’t get sufficient diversity with programmatically generated data, but you do get it with real-world data, because reality is laced together very tightly. So I often expect transfer to be a lot harder with programmatically generated data (unless the transfer is only to things that “could have been” programmatically generated, as was the case in e.g. XLand).
(I was initially going to say I believed this in 2019, but looking back at my notes from the time, I wrote very related stuff but didn’t actually write this particular thing. I’ve definitely been saying it in public talks since about the beginning of 2022. So I probably started believing this some time in between then.)
I’m more willing to say “yes we literally could scale up 2020 algorithms and get TAI, given some engineering effort and enough good data, without any fundamental advances
Interesting, thanks, I thought you were much more in agreement with Ajeya’s view (and therefore similarly uncertain about the probability that 2020′s algorithms would scale up etc.) Do you in fact have shorter timelines than Ajeya now, or is there something else that pushes you towards longer timelines than her in a way that cancels out?
I… don’t particularly remember that as a major difference between us? Does she actually lengthen timelines significantly based on not knowing whether 2020 algorithms would scale up?
I do recall her talking about putting more weight on long horizons / evolution out of general uncertainty or “some problem will come up” type intuitions. I didn’t like this method of dealing with it, but I do agree with the intuition, though for me it’s a bit more precise, something like “deployment is difficult; you need to be extremely robust, much more so than humans, it’s a lot of work to iron out all such problems”. I incorporated it by taking the model’s output and pushing my timelines further out than the model said—see “accounting for challenges” in my opinion.
(Though looking back at that I notice that my intuitions say those timelines are slightly too long, like maybe the median should be 2045. I think the biggest change there is reflecting on how the bio anchors model doesn’t incorporate AI-driven acceleration of AI research before TAI happens.)
Maybe I misinterpreted you and/or her sorry. I guess I was eyeballing Ajeya’s final distribution and seeing how much of it is above the genome anchor / medium horizon anchor, and thinking that when someone says “we literally could scale up 2020 algorithms and get TAI” they are imagining something less expensive than that (since arguably medium/genome and above, especially evolution, represents doing a search for algorithms rather than scaling up an existing algorithm, and also takes such a ridiculously large amount of compute that it’s weird to say we “could” scale up to it.) So I was thinking that probability mass in “yes we could literally scale existing algorithms” is probability mass below +12 OOMs basically. Wheras Ajeya is at 50% by +12. I see I was probably misunderstanding you; you meant scaling up existing algorithms to include stuff like genome and long-horizon anchor? But you agree it doesn’t include evolution, right?)
All of the short-horizon, medium-horizon, or long-horizon paths would count as “scaling up 2020 algorithms”.
I mostly ignore the genome anchor (see “Ignoring the genome anchor” in my opinion).
I’m not entirely sure how you’re imagining redoing evolution. If you’re redoing it by creating a multiagent environment simulation, with the agents implemented via neural networks updated using some form of gradient descent, I think that’s “scaling up 2020 algorithms”.
If you instead imagine having a long string of parameters (analogous to DNA) that tells you how to build a brain for the agent, and then learning involves making a random change to the long string of parameters and seeing how that goes, and keeping it if it’s good—I agree that’s not “scaling up 2020 algorithms”.
thinking that when someone says “we literally could scale up 2020 algorithms and get TAI” they are imagining something less expensive than that
I just literally mean “there is some obscene amount of compute, such that if you use that much compute with 2020 algorithms, and you did some engineering to make sure you could use that compute effectively (things more like hyperparameter tuning and less like inventing Transformers), and you got the data that was needed (who knows what that is), then you get TAI”. That’s the belief that makes you take bio anchors more seriously. Pre-bio-anchors, it would have been hard for me to give you a specific number for the obscene compute that would be needed.
Pre bio-anchors couldn’t you have at least thought that recapitulating evolution would be enough? Or are you counting that as part of the bio anchors framework?
What exactly does “recapitulating evolution” mean? If you mean simulating our laws of physics in an initial state that is as big as the actual world and includes, say, a perfect simulation of bacteria, and then letting the simulation evolve for the equivalent of billions of years until some parts of the environment implement general intelligence, then sure, that would be enough, but also that’s way way more compute than the evolution anchor (and also we don’t have the knowledge to set up the initial state right). (You could even then be worried about anthropic arguments saying that this won’t work.)
If you instead mean that we have some simulated environment that we hope resembles the ancestral environment, and we put in simulated animal bodies with a neural network to control them, and then train those neural networks with current gradient descent or evolutionary algorithms, I would not then and do not now think that such an approach is clearly going to produce TAI given evolutionary anchor levels of compute.
In general I’m not that interested in these sorts of “generic agents” that can do all the things with one neural net and don’t think they affect the relevant timelines very much; it seems like it will be far more economically useful to have separate neural nets doing each of the things and using each other as tools to accomplish particular tasks and so that’s what I expect to see.
Aren’t you worried about agents that can leverage extremely complex knowledge of the world (like Flamingo has) that they gained via text, picture, video, etc inputs, on a robotic controller? Think of an RL agent that can learn how to play Montezuma’s Revenge extremely quickly, because it consumed so much internet data that it knows what a “key” and “rope” are, and that these in-game objects are analogous to those images it saw in pretraining. Something like that getting a malicious command in real life on a physical robot seems terrifying—it would be able to form extremely complex plans in order to achieve a malicious goal, given its environment—and at least from what I can tell from the Gato paper, the only missing ingredient at this point might be “more parameters/TPUs”
I agree that will happen eventually, and the more nuanced version of my position is the one I outlined in my comment on CAIS:
Now I would say that there is some level of data, model capacity, and compute at which an end-to-end / monolithic approach outperforms a structured approach on the training distribution (this is related to but not the same as the bitter lesson). However, at low levels of these three, the structured approach will typically perform better. The required levels at which the end-to-end approach works better depends on the particular task, and increases with task difficulty.
Since we expect all three of these factors to grow over time, I then expect that there will be an expanding Pareto frontier where at any given point the most complex tasks are performed by structured approaches, but as time progresses these are replaced by end-to-end / monolithic systems (but at the same time new, even more complex tasks are found, that can be done in a structured way).
I think when we are first in the situation where AI systems are sufficiently competent to wrest control away from humanity if they wanted to, we would plausibly have robots that take in audiovisual input and can flexibly perform tasks that a human says to them (think of e.g. a household robot butler). So in that sense I agree that eventually we’ll have agents that link together language, vision, and robotics.
The thing I’m not that interested in (from a “how scared should we be” or “timelines” perspective) is when you take a bunch of different tasks, shove them into a single “generic agent”, and the resulting agent is worse on most of the tasks and isn’t correspondingly better at some new task that none of the previous systems could do.
So if for example you could draw an arrow on an image showing what you wanted a robot to do, and the robot then did that, that would be a novel capability that couldn’t be done by previous specialized systems (probably), and I’d be interested in that. It doesn’t look like this agent does that.
Does that mean the socratic models result from a few weeks ago, which does involve connecting more specialised models together, is a better example of progress?
To me the relevant result/trend is that it seems like catastrophic forgetting is becoming less of an issue as it was maybe two to three years ago e.g. in meta-learning and that we can squeeze these diverse skills into a single model. Sure, the results seem to indicate that individual systems for different tasks would still be the way to go for now, but at least the published version was not trained with the same magnitude of compute that was e.g. used on the latest and greatest LLMs (I take this from Lennart Heim who did the math on this). So it is IMO hard to say if there are timeline-affecting surprises lurking if we either just trained longer or had faster hardware—at least not with certainty. I didn’t expect double descent and grokking so my prior is that unexpected stuff happens.
I definitely agree that your timelines should take into account “maybe there will be a surprise”.
“There can be surprises” cuts both ways; you can also see e.g. a surprise slowdown of scaling results.
I also didn’t expect double descent and grokking but it’s worth noting that afaict those have had ~zero effects on SOTA capabilities so far.
Regardless, the original question was about this particular result; this particular result was not surprising (given my very brief skim).
On catastrophic forgetting:
I agree that catastrophic forgetting is becoming less of an issue at larger scale but I already believed and expected that; it seemed like something like that had to be true for all of the previous big neural net results (OpenAI Five, AlphaStar, language models, etc) to be working as well as they were.
… Where is that impression coming from? If this is a widespread view, I could just be wrong about it; I have a cached belief that large language models and probably other models aren’t trained to the interpolation threshold and so aren’t leveraging double descent.
I haven’t kept track of dataset size vs model size, but things I’ve read on the double descent phenomenon have generally described it as a unified model of the “classic statistics” paradigm where you need to deal with the bias-variance tradeoff, versus the “modern ML” paradigm where bigger=better.
I guess it may depend on the domain? Generative tasks like language modelling or image encoding implicitly end up having a lot more bits/sample than discriminative tasks? So maybe generative tasks are usually not in the second descend regime while discriminative tasks usually are?
I like your point that “surprises cut both ways” and assume that this is why your timelines aren’t affected by the possibility of surprises, is that about right? I am confused about the ~zero effect though: Isn’t double descent basically what we see with giant language models lately? Disclaimer: I don’t work on LLMs myself, so my confusion isn’t necessarily meaningful
My timelines are affected by the possibility of surprises; it makes them wider on both ends.
My impression is that giant language models are not trained to the interpolation point (though I haven’t been keeping up with the literature for the last year or so). I believe the graphs in that post were created specifically to demonstrate that if you did train them past the interpolation point, then you would see double descent.
Having not read the detailed results yet, I would be quite surprised if [Gato] performed better on language-only tasks than a pretrained language model of the same size...
In general, from a “timelines to risky systems” perspective, I’m not that interested in these sorts of “generic agents” that can do all the things with one neural net; it seems like it will be far more economically useful to have separate neural nets doing each of the things and using each other as tools to accomplish particular tasks and so that’s what I expect to see.
Given these laws, we can now make predictions about what scale will be required to overcome modal competition and achieve synergy from training on each pair of modalities. By modality competition, we refer to the empirical phenomena of two modalities performing worse than if we trained two individual models on the same number of per-modality tokens. By synergy, we mean the inverse. We can define the notion of synergy formally through our scaling laws. [...]
We plot the ratio of the average of the Speech and Text models perplexity per timestep by Speech|Text perplexity, the competition barrier and predictions from our scaling laws in Figure 5. As we see, the prediction does hold, and we achieve a model that crosses the competition barrier. Further scaling is likely to further improve the synergy, but we leave this exploration to future work.
Sorry, I think that particular sentence of mine was poorly written (and got appropriate pushback at the time). I still endorse my followup comment, which includes this clarification:
The thing I’m not that interested in (from a “how scared should we be” or “timelines” perspective) is when you take a bunch of different tasks, shove them into a single “generic agent”, and the resulting agent is worse on most of the tasks and isn’t correspondingly better at some new task that none of the previous systems could do.
In particular, my impression with Gato is that it was not showing much synergy. I agree that synergy is possible and likely to increase with additional scale (and I’m pretty sure I would have said so at the time, especially since I cited a different example of positive transfer).
(Note I haven’t read the mixed-modal scaling laws paper in detail so I may be missing an important point about it.)
Like, if nothing else the network could allocate 10% of itself to each domain; 100M parameters are more than enough to show good performance in these domains (robotics often uses far fewer parameters iirc).
Are you suggesting that this isn’t really a “general” agent any more than the combination of several separate models trained independently would be? And that this is just several different agents that happened to be trained in a network that’s big enough to contain all of them?
I don’t really know what you mean by a “general” agent. Here are some properties that I would guess it has (caveating again that I haven’t read the paper in detail), which may or may not be related to what you mean by “generality”:
Given an input, it can tell which task it is supposed to do, and then do the relevant tasks.
Some of the tasks do benefit from the training done on other tasks (“positive transfer”), presumably because some of the basic building blocks of the needed programs are the same (“look at the token that was one place prior” is probably helpful for many tasks).
It has some neurons that are used in multiple different tasks (presumably).
It cannot learn new tasks particularly quickly (“few-shot learning”), except inasmuch as that could already be done with language models.
It does not do any “learning with frozen weights” (i.e. the sort of thing where you prompt a language model to define a new word, and then it can use that word later on, without any gradient descent), except inasmuch as the specialized models would also do that learning.
It is about as well-modeled as an expected utility maximizer as the specialized models would be.
Sorry to hijack your comment, but I’m curious as to whether this shortens your timelines at all or makes you any more pessimistic about alignment.
Not really? On timelines, I haven’t looked through the results so maybe they’re more surprising then they look on a brief skim, but “you can do multitask learning with a single network” feels totally unsurprising given past results. Like, if nothing else the network could allocate 10% of itself to each domain; 100M parameters are more than enough to show good performance in these domains (robotics often uses far fewer parameters iirc). But also I would have expected some transfer between tasks so that you’d do better than that would naively predict. I’ve seen this before—iirc there was a result (from Pieter Abbeel’s lab? EDIT: this one EDIT 2: see caveats on this paper, though it doesn’t affect my point) a couple of years ago that showed that pretraining a model on language would lead to improved sample efficiency in some nominally-totally-unrelated RL task, or something like that.
Unfortunately I can’t find it on a quick Google now (and it’s possible it never made it into a paper and I heard it via word of mouth).Having not read the detailed results yet, I would be quite surprised if it performed better on language-only tasks than a pretrained language model of the same size; I’d be a little surprised if it performed better on robotics / RL tasks than a specialized model of the same size given the same amount of robotics data.
In general, from a “timelines to risky systems” perspective, I’m not that interested in these sorts of “generic agents” that can do all the things with one neural net; it seems like it will be far more economically useful to have separate neural nets doing each of the things and using each other as tools to accomplish particular tasks and so that’s what I expect to see.
On pessimism, I’m not sure why I should update in any direction on this result, even if I thought this was surprisingly fast progress which I don’t. I guess shorter timelines would increase pessimism just by us having less time to prepare, but I don’t see any other good reason for increased pessimism.
Pretrained Transformers as Universal Computation Engines
From the abstract:
That’s the one, thanks!
Maybe you already had better models here, but I was pretty unsure whether we needed any additional substantial advances to get good multimodal learning. As Gwern says in his comment:
For me this paper wasn’t a massive update, but it did update me some that we really don’t need much of any fundamental advancement to build very big agents that learn on multimodal data, and that barrier was one of the last ones to making AGI just one big “scaling up” step away, which has implications for both my timeline models and my risk models.
I’ve been saying this for some time now. (I often feel like I’m more confident in the bio anchors approach to timelines than Open Phil is, because I’m more willing to say “yes we literally could scale up 2020 algorithms and get TAI, given some engineering effort and enough good data, without any fundamental advances”.)
My explanation for the negative transfer in ALE is that ALE isn’t sufficiently diverse / randomized; you can see this in CoinRun (see “the diversity hypothesis” in Understanding RL vision), where you only get interpretable vision features for aspects of the environment that were randomized. In contrast, image classifiers trained on real world images have interpretable vision features at all layers except perhaps some of the later ones, and often lead to positive transfer on new tasks.
A big part of my model predicting what kind of transfer does and doesn’t work in deep learning is figuring out to what extent I expect there to be large entangled variation in the features of the training data. If this variation is present, then I expect the neural network is forced to learn the real actual feature, and there isn’t some other simpler program that happens to get it correct in only the training situations. If you have the real actual feature, then you’re going to transfer better.
You usually don’t get sufficient diversity with programmatically generated data, but you do get it with real-world data, because reality is laced together very tightly. So I often expect transfer to be a lot harder with programmatically generated data (unless the transfer is only to things that “could have been” programmatically generated, as was the case in e.g. XLand).
(I was initially going to say I believed this in 2019, but looking back at my notes from the time, I wrote very related stuff but didn’t actually write this particular thing. I’ve definitely been saying it in public talks since about the beginning of 2022. So I probably started believing this some time in between then.)
“Multi-Game Decision Transformers”, Lee et al 2022 is worth a close look, especially for Gato critics.
Interesting, thanks, I thought you were much more in agreement with Ajeya’s view (and therefore similarly uncertain about the probability that 2020′s algorithms would scale up etc.) Do you in fact have shorter timelines than Ajeya now, or is there something else that pushes you towards longer timelines than her in a way that cancels out?
I… don’t particularly remember that as a major difference between us? Does she actually lengthen timelines significantly based on not knowing whether 2020 algorithms would scale up?
I do recall her talking about putting more weight on long horizons / evolution out of general uncertainty or “some problem will come up” type intuitions. I didn’t like this method of dealing with it, but I do agree with the intuition, though for me it’s a bit more precise, something like “deployment is difficult; you need to be extremely robust, much more so than humans, it’s a lot of work to iron out all such problems”. I incorporated it by taking the model’s output and pushing my timelines further out than the model said—see “accounting for challenges” in my opinion.
(Though looking back at that I notice that my intuitions say those timelines are slightly too long, like maybe the median should be 2045. I think the biggest change there is reflecting on how the bio anchors model doesn’t incorporate AI-driven acceleration of AI research before TAI happens.)
Maybe I misinterpreted you and/or her sorry. I guess I was eyeballing Ajeya’s final distribution and seeing how much of it is above the genome anchor / medium horizon anchor, and thinking that when someone says “we literally could scale up 2020 algorithms and get TAI” they are imagining something less expensive than that (since arguably medium/genome and above, especially evolution, represents doing a search for algorithms rather than scaling up an existing algorithm, and also takes such a ridiculously large amount of compute that it’s weird to say we “could” scale up to it.) So I was thinking that probability mass in “yes we could literally scale existing algorithms” is probability mass below +12 OOMs basically. Wheras Ajeya is at 50% by +12. I see I was probably misunderstanding you; you meant scaling up existing algorithms to include stuff like genome and long-horizon anchor? But you agree it doesn’t include evolution, right?)
All of the short-horizon, medium-horizon, or long-horizon paths would count as “scaling up 2020 algorithms”.
I mostly ignore the genome anchor (see “Ignoring the genome anchor” in my opinion).
I’m not entirely sure how you’re imagining redoing evolution. If you’re redoing it by creating a multiagent environment simulation, with the agents implemented via neural networks updated using some form of gradient descent, I think that’s “scaling up 2020 algorithms”.
If you instead imagine having a long string of parameters (analogous to DNA) that tells you how to build a brain for the agent, and then learning involves making a random change to the long string of parameters and seeing how that goes, and keeping it if it’s good—I agree that’s not “scaling up 2020 algorithms”.
I just literally mean “there is some obscene amount of compute, such that if you use that much compute with 2020 algorithms, and you did some engineering to make sure you could use that compute effectively (things more like hyperparameter tuning and less like inventing Transformers), and you got the data that was needed (who knows what that is), then you get TAI”. That’s the belief that makes you take bio anchors more seriously. Pre-bio-anchors, it would have been hard for me to give you a specific number for the obscene compute that would be needed.
Right, OK.
Pre bio-anchors couldn’t you have at least thought that recapitulating evolution would be enough? Or are you counting that as part of the bio anchors framework?
What exactly does “recapitulating evolution” mean? If you mean simulating our laws of physics in an initial state that is as big as the actual world and includes, say, a perfect simulation of bacteria, and then letting the simulation evolve for the equivalent of billions of years until some parts of the environment implement general intelligence, then sure, that would be enough, but also that’s way way more compute than the evolution anchor (and also we don’t have the knowledge to set up the initial state right). (You could even then be worried about anthropic arguments saying that this won’t work.)
If you instead mean that we have some simulated environment that we hope resembles the ancestral environment, and we put in simulated animal bodies with a neural network to control them, and then train those neural networks with current gradient descent or evolutionary algorithms, I would not then and do not now think that such an approach is clearly going to produce TAI given evolutionary anchor levels of compute.
Aren’t you worried about agents that can leverage extremely complex knowledge of the world (like Flamingo has) that they gained via text, picture, video, etc inputs, on a robotic controller? Think of an RL agent that can learn how to play Montezuma’s Revenge extremely quickly, because it consumed so much internet data that it knows what a “key” and “rope” are, and that these in-game objects are analogous to those images it saw in pretraining. Something like that getting a malicious command in real life on a physical robot seems terrifying—it would be able to form extremely complex plans in order to achieve a malicious goal, given its environment—and at least from what I can tell from the Gato paper, the only missing ingredient at this point might be “more parameters/TPUs”
I agree that will happen eventually, and the more nuanced version of my position is the one I outlined in my comment on CAIS:
I think when we are first in the situation where AI systems are sufficiently competent to wrest control away from humanity if they wanted to, we would plausibly have robots that take in audiovisual input and can flexibly perform tasks that a human says to them (think of e.g. a household robot butler). So in that sense I agree that eventually we’ll have agents that link together language, vision, and robotics.
The thing I’m not that interested in (from a “how scared should we be” or “timelines” perspective) is when you take a bunch of different tasks, shove them into a single “generic agent”, and the resulting agent is worse on most of the tasks and isn’t correspondingly better at some new task that none of the previous systems could do.
So if for example you could draw an arrow on an image showing what you wanted a robot to do, and the robot then did that, that would be a novel capability that couldn’t be done by previous specialized systems (probably), and I’d be interested in that. It doesn’t look like this agent does that.
Does that mean the socratic models result from a few weeks ago, which does involve connecting more specialised models together, is a better example of progress?
Yes
To me the relevant result/trend is that it seems like catastrophic forgetting is becoming less of an issue as it was maybe two to three years ago e.g. in meta-learning and that we can squeeze these diverse skills into a single model. Sure, the results seem to indicate that individual systems for different tasks would still be the way to go for now, but at least the published version was not trained with the same magnitude of compute that was e.g. used on the latest and greatest LLMs (I take this from Lennart Heim who did the math on this). So it is IMO hard to say if there are timeline-affecting surprises lurking if we either just trained longer or had faster hardware—at least not with certainty. I didn’t expect double descent and grokking so my prior is that unexpected stuff happens.
On surprises:
I definitely agree that your timelines should take into account “maybe there will be a surprise”.
“There can be surprises” cuts both ways; you can also see e.g. a surprise slowdown of scaling results.
I also didn’t expect double descent and grokking but it’s worth noting that afaict those have had ~zero effects on SOTA capabilities so far.
Regardless, the original question was about this particular result; this particular result was not surprising (given my very brief skim).
On catastrophic forgetting:
I agree that catastrophic forgetting is becoming less of an issue at larger scale but I already believed and expected that; it seemed like something like that had to be true for all of the previous big neural net results (OpenAI Five, AlphaStar, language models, etc) to be working as well as they were.
I was under the impression that basically all SOTA capabilities rely on double descent. Is that impression wrong?
… Where is that impression coming from? If this is a widespread view, I could just be wrong about it; I have a cached belief that large language models and probably other models aren’t trained to the interpolation threshold and so aren’t leveraging double descent.
I haven’t kept track of dataset size vs model size, but things I’ve read on the double descent phenomenon have generally described it as a unified model of the “classic statistics” paradigm where you need to deal with the bias-variance tradeoff, versus the “modern ML” paradigm where bigger=better.
I guess it may depend on the domain? Generative tasks like language modelling or image encoding implicitly end up having a lot more bits/sample than discriminative tasks? So maybe generative tasks are usually not in the second descend regime while discriminative tasks usually are?
I like your point that “surprises cut both ways” and assume that this is why your timelines aren’t affected by the possibility of surprises, is that about right? I am confused about the ~zero effect though: Isn’t double descent basically what we see with giant language models lately? Disclaimer: I don’t work on LLMs myself, so my confusion isn’t necessarily meaningful
My timelines are affected by the possibility of surprises; it makes them wider on both ends.
My impression is that giant language models are not trained to the interpolation point (though I haven’t been keeping up with the literature for the last year or so). I believe the graphs in that post were created specifically to demonstrate that if you did train them past the interpolation point, then you would see double descent.
Do you still believe this in light of the paper on mixed-modal scaling laws?
From the paper,
Sorry, I think that particular sentence of mine was poorly written (and got appropriate pushback at the time). I still endorse my followup comment, which includes this clarification:
In particular, my impression with Gato is that it was not showing much synergy. I agree that synergy is possible and likely to increase with additional scale (and I’m pretty sure I would have said so at the time, especially since I cited a different example of positive transfer).
(Note I haven’t read the mixed-modal scaling laws paper in detail so I may be missing an important point about it.)
Are you suggesting that this isn’t really a “general” agent any more than the combination of several separate models trained independently would be? And that this is just several different agents that happened to be trained in a network that’s big enough to contain all of them?
I don’t really know what you mean by a “general” agent. Here are some properties that I would guess it has (caveating again that I haven’t read the paper in detail), which may or may not be related to what you mean by “generality”:
Given an input, it can tell which task it is supposed to do, and then do the relevant tasks.
Some of the tasks do benefit from the training done on other tasks (“positive transfer”), presumably because some of the basic building blocks of the needed programs are the same (“look at the token that was one place prior” is probably helpful for many tasks).
It has some neurons that are used in multiple different tasks (presumably).
It cannot learn new tasks particularly quickly (“few-shot learning”), except inasmuch as that could already be done with language models.
It does not do any “learning with frozen weights” (i.e. the sort of thing where you prompt a language model to define a new word, and then it can use that word later on, without any gradient descent), except inasmuch as the specialized models would also do that learning.
It is about as well-modeled as an expected utility maximizer as the specialized models would be.