Maybe you already had better models here, but I was pretty unsure whether we needed any additional substantial advances to get good multimodal learning. As Gwern says in his comment:
it used to be that learning multiple Atari games in parallel was really hard. It did not work, at all. You got negative transfer even within ALE. People thought very hard and ran lots of experiments to try to create things like Popart less than 4 years ago where it was a triumph that, due to careful engineering a single checkpoint could play just the ALE-57 games with mediocre performance.
Decision Transformer definitely made ‘multi-task learning is a blessing of scale’ the default hypothesis, but no one had actually shown this, the past DT and other work (aside from MetaMimic) were all rather low n and k; you could wonder if they would interfere at a certain point or break down, and require fancy engineering like MoEs to enable learning at all. (Similarly, Andy Jones showed nice scaling laws for DRL and I scraped together a few examples like Ms Pacman, but nothing across really complicated tasks or many tasks.)
For me this paper wasn’t a massive update, but it did update me some that we really don’t need much of any fundamental advancement to build very big agents that learn on multimodal data, and that barrier was one of the last ones to making AGI just one big “scaling up” step away, which has implications for both my timeline models and my risk models.
it did update me some that we really don’t need much of any fundamental advancement to build very big agents that learn on multimodal data
I’ve been saying this for some time now. (I often feel like I’m more confident in the bio anchors approach to timelines than Open Phil is, because I’m more willing to say “yes we literally could scale up 2020 algorithms and get TAI, given some engineering effort and enough good data, without any fundamental advances”.)
My explanation for the negative transfer in ALE is that ALE isn’t sufficiently diverse / randomized; you can see this in CoinRun (see “the diversity hypothesis” in Understanding RL vision), where you only get interpretable vision features for aspects of the environment that were randomized. In contrast, image classifiers trained on real world images have interpretable vision features at all layers except perhaps some of the later ones, and often lead to positive transfer on new tasks.
A big part of my model predicting what kind of transfer does and doesn’t work in deep learning is figuring out to what extent I expect there to be large entangled variation in the features of the training data. If this variation is present, then I expect the neural network is forced to learn the real actual feature, and there isn’t some other simpler program that happens to get it correct in only the training situations. If you have the real actual feature, then you’re going to transfer better.
You usually don’t get sufficient diversity with programmatically generated data, but you do get it with real-world data, because reality is laced together very tightly. So I often expect transfer to be a lot harder with programmatically generated data (unless the transfer is only to things that “could have been” programmatically generated, as was the case in e.g. XLand).
(I was initially going to say I believed this in 2019, but looking back at my notes from the time, I wrote very related stuff but didn’t actually write this particular thing. I’ve definitely been saying it in public talks since about the beginning of 2022. So I probably started believing this some time in between then.)
I’m more willing to say “yes we literally could scale up 2020 algorithms and get TAI, given some engineering effort and enough good data, without any fundamental advances
Interesting, thanks, I thought you were much more in agreement with Ajeya’s view (and therefore similarly uncertain about the probability that 2020′s algorithms would scale up etc.) Do you in fact have shorter timelines than Ajeya now, or is there something else that pushes you towards longer timelines than her in a way that cancels out?
I… don’t particularly remember that as a major difference between us? Does she actually lengthen timelines significantly based on not knowing whether 2020 algorithms would scale up?
I do recall her talking about putting more weight on long horizons / evolution out of general uncertainty or “some problem will come up” type intuitions. I didn’t like this method of dealing with it, but I do agree with the intuition, though for me it’s a bit more precise, something like “deployment is difficult; you need to be extremely robust, much more so than humans, it’s a lot of work to iron out all such problems”. I incorporated it by taking the model’s output and pushing my timelines further out than the model said—see “accounting for challenges” in my opinion.
(Though looking back at that I notice that my intuitions say those timelines are slightly too long, like maybe the median should be 2045. I think the biggest change there is reflecting on how the bio anchors model doesn’t incorporate AI-driven acceleration of AI research before TAI happens.)
Maybe I misinterpreted you and/or her sorry. I guess I was eyeballing Ajeya’s final distribution and seeing how much of it is above the genome anchor / medium horizon anchor, and thinking that when someone says “we literally could scale up 2020 algorithms and get TAI” they are imagining something less expensive than that (since arguably medium/genome and above, especially evolution, represents doing a search for algorithms rather than scaling up an existing algorithm, and also takes such a ridiculously large amount of compute that it’s weird to say we “could” scale up to it.) So I was thinking that probability mass in “yes we could literally scale existing algorithms” is probability mass below +12 OOMs basically. Wheras Ajeya is at 50% by +12. I see I was probably misunderstanding you; you meant scaling up existing algorithms to include stuff like genome and long-horizon anchor? But you agree it doesn’t include evolution, right?)
All of the short-horizon, medium-horizon, or long-horizon paths would count as “scaling up 2020 algorithms”.
I mostly ignore the genome anchor (see “Ignoring the genome anchor” in my opinion).
I’m not entirely sure how you’re imagining redoing evolution. If you’re redoing it by creating a multiagent environment simulation, with the agents implemented via neural networks updated using some form of gradient descent, I think that’s “scaling up 2020 algorithms”.
If you instead imagine having a long string of parameters (analogous to DNA) that tells you how to build a brain for the agent, and then learning involves making a random change to the long string of parameters and seeing how that goes, and keeping it if it’s good—I agree that’s not “scaling up 2020 algorithms”.
thinking that when someone says “we literally could scale up 2020 algorithms and get TAI” they are imagining something less expensive than that
I just literally mean “there is some obscene amount of compute, such that if you use that much compute with 2020 algorithms, and you did some engineering to make sure you could use that compute effectively (things more like hyperparameter tuning and less like inventing Transformers), and you got the data that was needed (who knows what that is), then you get TAI”. That’s the belief that makes you take bio anchors more seriously. Pre-bio-anchors, it would have been hard for me to give you a specific number for the obscene compute that would be needed.
Pre bio-anchors couldn’t you have at least thought that recapitulating evolution would be enough? Or are you counting that as part of the bio anchors framework?
What exactly does “recapitulating evolution” mean? If you mean simulating our laws of physics in an initial state that is as big as the actual world and includes, say, a perfect simulation of bacteria, and then letting the simulation evolve for the equivalent of billions of years until some parts of the environment implement general intelligence, then sure, that would be enough, but also that’s way way more compute than the evolution anchor (and also we don’t have the knowledge to set up the initial state right). (You could even then be worried about anthropic arguments saying that this won’t work.)
If you instead mean that we have some simulated environment that we hope resembles the ancestral environment, and we put in simulated animal bodies with a neural network to control them, and then train those neural networks with current gradient descent or evolutionary algorithms, I would not then and do not now think that such an approach is clearly going to produce TAI given evolutionary anchor levels of compute.
Maybe you already had better models here, but I was pretty unsure whether we needed any additional substantial advances to get good multimodal learning. As Gwern says in his comment:
For me this paper wasn’t a massive update, but it did update me some that we really don’t need much of any fundamental advancement to build very big agents that learn on multimodal data, and that barrier was one of the last ones to making AGI just one big “scaling up” step away, which has implications for both my timeline models and my risk models.
I’ve been saying this for some time now. (I often feel like I’m more confident in the bio anchors approach to timelines than Open Phil is, because I’m more willing to say “yes we literally could scale up 2020 algorithms and get TAI, given some engineering effort and enough good data, without any fundamental advances”.)
My explanation for the negative transfer in ALE is that ALE isn’t sufficiently diverse / randomized; you can see this in CoinRun (see “the diversity hypothesis” in Understanding RL vision), where you only get interpretable vision features for aspects of the environment that were randomized. In contrast, image classifiers trained on real world images have interpretable vision features at all layers except perhaps some of the later ones, and often lead to positive transfer on new tasks.
A big part of my model predicting what kind of transfer does and doesn’t work in deep learning is figuring out to what extent I expect there to be large entangled variation in the features of the training data. If this variation is present, then I expect the neural network is forced to learn the real actual feature, and there isn’t some other simpler program that happens to get it correct in only the training situations. If you have the real actual feature, then you’re going to transfer better.
You usually don’t get sufficient diversity with programmatically generated data, but you do get it with real-world data, because reality is laced together very tightly. So I often expect transfer to be a lot harder with programmatically generated data (unless the transfer is only to things that “could have been” programmatically generated, as was the case in e.g. XLand).
(I was initially going to say I believed this in 2019, but looking back at my notes from the time, I wrote very related stuff but didn’t actually write this particular thing. I’ve definitely been saying it in public talks since about the beginning of 2022. So I probably started believing this some time in between then.)
“Multi-Game Decision Transformers”, Lee et al 2022 is worth a close look, especially for Gato critics.
Interesting, thanks, I thought you were much more in agreement with Ajeya’s view (and therefore similarly uncertain about the probability that 2020′s algorithms would scale up etc.) Do you in fact have shorter timelines than Ajeya now, or is there something else that pushes you towards longer timelines than her in a way that cancels out?
I… don’t particularly remember that as a major difference between us? Does she actually lengthen timelines significantly based on not knowing whether 2020 algorithms would scale up?
I do recall her talking about putting more weight on long horizons / evolution out of general uncertainty or “some problem will come up” type intuitions. I didn’t like this method of dealing with it, but I do agree with the intuition, though for me it’s a bit more precise, something like “deployment is difficult; you need to be extremely robust, much more so than humans, it’s a lot of work to iron out all such problems”. I incorporated it by taking the model’s output and pushing my timelines further out than the model said—see “accounting for challenges” in my opinion.
(Though looking back at that I notice that my intuitions say those timelines are slightly too long, like maybe the median should be 2045. I think the biggest change there is reflecting on how the bio anchors model doesn’t incorporate AI-driven acceleration of AI research before TAI happens.)
Maybe I misinterpreted you and/or her sorry. I guess I was eyeballing Ajeya’s final distribution and seeing how much of it is above the genome anchor / medium horizon anchor, and thinking that when someone says “we literally could scale up 2020 algorithms and get TAI” they are imagining something less expensive than that (since arguably medium/genome and above, especially evolution, represents doing a search for algorithms rather than scaling up an existing algorithm, and also takes such a ridiculously large amount of compute that it’s weird to say we “could” scale up to it.) So I was thinking that probability mass in “yes we could literally scale existing algorithms” is probability mass below +12 OOMs basically. Wheras Ajeya is at 50% by +12. I see I was probably misunderstanding you; you meant scaling up existing algorithms to include stuff like genome and long-horizon anchor? But you agree it doesn’t include evolution, right?)
All of the short-horizon, medium-horizon, or long-horizon paths would count as “scaling up 2020 algorithms”.
I mostly ignore the genome anchor (see “Ignoring the genome anchor” in my opinion).
I’m not entirely sure how you’re imagining redoing evolution. If you’re redoing it by creating a multiagent environment simulation, with the agents implemented via neural networks updated using some form of gradient descent, I think that’s “scaling up 2020 algorithms”.
If you instead imagine having a long string of parameters (analogous to DNA) that tells you how to build a brain for the agent, and then learning involves making a random change to the long string of parameters and seeing how that goes, and keeping it if it’s good—I agree that’s not “scaling up 2020 algorithms”.
I just literally mean “there is some obscene amount of compute, such that if you use that much compute with 2020 algorithms, and you did some engineering to make sure you could use that compute effectively (things more like hyperparameter tuning and less like inventing Transformers), and you got the data that was needed (who knows what that is), then you get TAI”. That’s the belief that makes you take bio anchors more seriously. Pre-bio-anchors, it would have been hard for me to give you a specific number for the obscene compute that would be needed.
Right, OK.
Pre bio-anchors couldn’t you have at least thought that recapitulating evolution would be enough? Or are you counting that as part of the bio anchors framework?
What exactly does “recapitulating evolution” mean? If you mean simulating our laws of physics in an initial state that is as big as the actual world and includes, say, a perfect simulation of bacteria, and then letting the simulation evolve for the equivalent of billions of years until some parts of the environment implement general intelligence, then sure, that would be enough, but also that’s way way more compute than the evolution anchor (and also we don’t have the knowledge to set up the initial state right). (You could even then be worried about anthropic arguments saying that this won’t work.)
If you instead mean that we have some simulated environment that we hope resembles the ancestral environment, and we put in simulated animal bodies with a neural network to control them, and then train those neural networks with current gradient descent or evolutionary algorithms, I would not then and do not now think that such an approach is clearly going to produce TAI given evolutionary anchor levels of compute.