Yeah, this is (part of) why I put compute + scaling laws front and center and make inferences about data efficiency; you can have much stronger conclusions when you start reasoning from the thing you believe is the bottleneck.
I didn’t quite follow this part. Do you think I’m not reasoning from the thing I believe is the bottleneck?
Certainly “several OOMs using tricks and techniques we could implement in a year” would be way faster than that trend, but you’ve really got to wonder why these people haven’t done it yet—if I interpret “several OOMs” as “at least 3 OOMs”, that would bring the compute cost down to around $1000, which is accessible for basically any AI researcher (including academics). I’ll happily take a 10:1 bet against a model as competent as GPT-3 being trained on $1000 of compute within the next year.
Perhaps the tricks and techniques are sufficiently challenging that they need a full team of engineers working for multiple years—if so, this seems plausibly consistent with the 2-3 year doubling time.
Some of the people I talked to said about 2 OOMs, others expressed it differently, saying that the faster scaling law can be continued past the kink point predicted by Kaplan et al. Still others simply said that GPT-3 was done in a deliberately simple, non-cutting-edge way to prove a point and that it could have used its compute much more compute-efficiently if they threw the latest bags of tricks at it. I am skeptical of all this, of course, but perhaps less skeptical than you? 2 OOMs is 7 doublings, which will happen around 2037 according to Ajeya. Would you be willing to take a 10:1 bet that there won’t be something as good as GPT-3 trained on 2 OOMs less compute by 2030? I think I’d take the other side of that bet.
Evolution was presumably also going for compute-optimal performance, so it seems like this is the right comparison to make.
I don’t think evolution was going for compute-optimal performance in the relevant sense. With AI, we can easily trade off between training models longer and making models bigger, and according to the scaling laws it seems like we should increase training time by 0.75 OOMs for every OOM of parameter count increase. With biological systems, sure maybe it is true that if you faced a trade-off where you were trying to minimize total number of neuron firings over the course of the organism’s childhood, the right ratio would be 0.75 OOMs of extra childhood duration for every 1 OOM of extra synapses… maybe. But even if this were true, it’s pretty non-obvious that that’s the trade-off regime evolution faces. There are all sorts of other pros and cons associated with more synapses and longer childhoods. For example, maybe evolution finds it easier to increase synapse count than to increase childhood, because increased childhood reduces fitness significantly (more chances to die before you reproduce, longer doubling time of population).
Both Ajeya and I think that AI systems will be incredibly useful before they get to the level of “transformative AI”. The tasks in the graph you link are particularly easy and not that important; having superhuman performance on them would not transform the world.
Yeah, sorry, by useful I meant useful for transformative tasks.
Yes, obviously the tasks in the graph are not transformative. But it seems to me to be… like, 25% likely or so that once we have pre-trained, unsupervised models that build up high skill level at all those tasks on the graph, it’s because they’ve developed general intelligence in the relevant sense. Or maybe they haven’t but it’s a sign that general intelligence is near, perhaps with a more sophisticated training regime and architecture. Like, yeah those tasks are “particularly easy” compared to taking over the world, but they are also incredibly hard in some sense; IIRC GPT-3 was also tested on a big dataset of exam questions used for high school, college, and graduate-level admissions, and got 50% or so whereas every other AI system got 25%, random chance, and I bet most english-speaking literate humans in the world today would have done worse than 50%.
I just put literally 100% mass on short horizon in my version of the timelines model (which admittedly has changed some other parameters, though not hugely iirc) and the median I get is 2041 (about 10 years lower than what it was previously). So I don’t think this is making a huge difference (though certainly 10 years is substantial).
Huh. When I put 100% mass on short horizon in my version of Ajeya’s model, it says median 2031. Admittedly, I had made some changes to some other parameters too, also not hugely iirc. I wonder if this means those other-parameter changes matter more than I’d thought.
I see horizon length (as used in the report) as a function of a task, so “horizon length of GPT-3” feels like a type error given that what we care about is how GPT-3 can do many tasks. Any task done by GPT-3 has a maximum horizon length of 2048 (the size of its context window). During training, GPT-3 saw 300 billion tokens, so it saw around 100 million “effective examples” of size 2048. It makes sense within the bio anchors framework that there would be some tasks with horizon length in the thousands that GPT-3 would be able to do well.
Huh, that’s totally not how I saw it. From Ajeya’s report:
I’ll define the “effective horizon length” of an ML problem as the amount of data it takes (on average) to tell whether a perturbation to the model improves performance or worsens performance.If we believe that the number of “samples” required to train a model of size P is given by KP, then the number of subjective seconds that would be required should be given by HKP, where H is the effective horizon length expressed in units of “subjective seconds per sample.”
To me this really sounds like it’s saying the horizon length = the number of subjective seconds per sample during training. So, maybe it makes sense to talk about “horizon length of task X” (i.e. number of subjective seconds per sample during training of a typical ML model on that task) but it seems to make even more sense to talk about “horizon length of model X” since model X actually had a training run and actually had an average number of subjective seconds per sample.
But I’m happy to 70% defer to your judgment on this since you probably have talked to Ajeya etc. and know more about this than me.
At any rate, deferring to you on this doesn’t undermine the point I was making at all, as far as I can tell.
you could imagine that most of the training is unsupervised pretraining on a short-horizon objective, similarly to GPT-3, after which you finetune (with negligible compute cost) on the long-horizon transformative task you care about, so that on average your horizon is short. I definitely remember this being an important reason in me putting as much weight on short horizons as I did; I think this was also true for Ajeya.
Exactly. I think this is what humans do too, to a large extent. I’d be curious to hear why you put so much weight on medium and long horizons. I put 50% on short, 20% on medium, and 10% on long.
I didn’t quite follow this part. Do you think I’m not reasoning from the thing I believe is the bottleneck?
I actually don’t remember what I meant to convey with that :/
Would you be willing to take a 10:1 bet that there won’t be something as good as GPT-3 trained on 2 OOMs less compute by 2030?
No, I’d also take the other side of the bet. A few reasons:
Estimated algorithmic efficiency in the report is low because researchers are not currently optimizing for “efficiency on a transformative task”, whereas researchers probably are optimizing for “efficiency of GPT-3 style systems”, suggesting faster improvements in algorithmic efficiency for GPT-3 than estimated in the report.
90% confidence is quite a lot; I do not have high certainty in the algorithmic efficiency part of the report.
(Note that 2 OOMs in 10 years seems significantly different from “we can get several OOMs more data-efficient training than the GPT’s had using various already-developed tricks and techniques”. I also assume that you have more than 10% credence in this, since 10% seems too low to make a difference to timelines.)
I don’t think evolution was going for compute-optimal performance in the relevant sense.
I feel like this is already taken into account by the methodology by which we estimated the ratio of evolution to human design? Like, taking your example of flight, presumably evolution was not optimizing just for power-to-weight ratio, it was optimizing for a bunch of other things; nonetheless we ignore those other things when making the comparison. Similarly, in the report the estimate is that evolution is ~10x better than humans on the chosen metrics, even though evolution was not literally optimizing just for the chosen metric. Why not expect the same here?
I think you’d need to argue that there is a specific other property that evolution was optimizing for, that clearly trades off against compute-efficiency, to argue that we should expect that in this case evolution was worse than in other cases.
But it seems to me to be… like, 25% likely or so that once we have pre-trained, unsupervised models that build up high skill level at all those tasks on the graph, it’s because they’ve developed general intelligence in the relevant sense.
This seems like it is realist about rationality, which I mostly don’t buy. Still, 25% doesn’t seem crazy, I’d probably put 10 or 20% on it myself. But even at 25% that seems pretty consistent with my timelines; 25% does not make the median.
Or maybe they haven’t but it’s a sign that general intelligence is near, perhaps with a more sophisticated training regime and architecture.
Why aren’t we already using the most sophisticated training regime and architecture? I agree it will continue to improve, but that’s already what the model does.
GPT-3 was also tested on a big dataset of exam questions used for high school, college, and graduate-level admissions, and got 50% or so whereas every other AI system got 25%, random chance, and I bet most english-speaking literate humans in the world today would have done worse than 50%.
I don’t particularly care about comparisons of memory / knowledge between GPT-3 and humans. Humans weren’t optimized for that.
I expect that Google search beats GPT-3 on that dataset.
I don’t really know what you mean when you say that this task is “hard”. Sure, humans don’t do it very well. We also don’t do arithmetic very well, while calculators do.
But I’m happy to 70% defer to your judgment on this since you probably have talked to Ajeya etc. and know more about this than me.
Er, note that I’ve talked to Ajeya for like an hour or two on the entire report. I’m not that confident that Ajeya also believes the things I’m saying (maybe I’m 80% confident).
To me this really sounds like it’s saying the horizon length = the number of subjective seconds per sample during training. [...]
I agree that the definition used in the report does seem consistent with that. I think that’s mostly because the report assumes that you are training a model to perform a single (transformative) task, and so a definition in terms of the model is equivalent to definition in terms of the task. The report doesn’t really talk about the unsupervised pretraining approach so its definitions didn’t have to handle that case.
But like, irrespective of what Ajeya meant, I think the important concept would be task-based. You would want to have different timelines for “when a neural net can do human-level summarization” and “when a neural net can be a human-level personal assistant”, even if you expect to use unsupervised pretraining for both. The only parameter in the model that can plausibly do that is the horizon length. If you don’t use the horizon length for that purpose, I think you should have some other way of incorporating “difficulty of the task” into your timelines.
Exactly. I think this is what humans do too, to a large extent. I’d be curious to hear why you put so much weight on medium and long horizons. I put 50% on short, 20% on medium, and 10% on long.
I mean, I’m at 30 / 40 / 10, so that isn’t that much of a difference. Half of the difference could be explained by your 25% on general reasoning, vs my (let’s say) 15% on it.
Thanks again. My general impression is that we disagree less than it first appeared, and that our disagreements are currently bottoming out in different intuitions rather than obvious cruxes we can drill down on. Plus I’m getting tired. ;) So, I say we call it a day. To be continued later, perhaps in person, perhaps in future comment chains on future posts!
For the sake of completeness, to answer your questions though:
I don’t really know what you mean when you say that this task is “hard”. Sure, humans don’t do it very well. We also don’t do arithmetic very well, while calculators do.
By “hard” I mean something like “Difficult to get AIs to do well.” If we imagine all the tasks we can get AIs to do lined up by difficulty, there is some transformative task A which is least difficult. As the tasks we succeed at getting AIs to do get harder and harder, we must be getting closer to A. I think that getting an AI to do well on all the benchmarks we throw at it despite not being trained for any of them (but rather just trained to predict random internet text) seems like a sign that we are getting close to A. You say this is because I believe in realism about rationality; I hope not, since I don’t believe in realism about rationality. Maybe there’s a contradiction in my views then which you have pointed to, but I don’t see it yet.
I feel like this is already taken into account by the methodology by which we estimated the ratio of evolution to human design? Like, taking your example of flight, presumably evolution was not optimizing just for power-to-weight ratio, it was optimizing for a bunch of other things; nonetheless we ignore those other things when making the comparison. Similarly, in the report the estimate is that evolution is ~10x better than humans on the chosen metrics, even though evolution was not literally optimizing just for the chosen metric. Why not expect the same here?
At this point I feel the need to break things down into premise-conclusion form because I am feeling confused about how the various bits of your argument are connecting to each other. I realize this is a big ask, so don’t feel any particular pressure to do it.
I totally agree that evolution wasn’t optimizing just for power-to-weight ratio. But I never claimed that it was. I don’t think that my comparison relied on the assumption that evolution was optimizing for power-to-weight ratio. By contrast, you explicitly said “presumably evolution was also going for compute-optimal performance.” Once we reject that claim, my original point stands that it’s not clear how we should apply the scaling laws to the human brain, since the scaling laws are about compute-optimal performance, i.e. how you should trade off size and training time if all you care about is minimizing compute. Since evolution obviously cares about a lot more than that (and indeed doesn’t care about minimizing compute at all, it just cares about minimizing size and training time separately, with no particular ratio between them except that which is set by the fitness landscape) the laws aren’t directly relevant. In other words, for all we know, if the human brain was 3 OOMs smaller and had one OOM more training time it would be qualitatively superior! Or for all we know, if it had 1 OOM more synapses it would need 2 OOMs less training time to be just as capable. Or… etc. Judging by the scaling laws, it seems like the human brain has a lot more synapses than its childhood length would suggest for optimal performance, or else a lot less if you buy the idea that evolutionary history is part of its training data.
Thanks for the detailed reply!
I didn’t quite follow this part. Do you think I’m not reasoning from the thing I believe is the bottleneck?
Some of the people I talked to said about 2 OOMs, others expressed it differently, saying that the faster scaling law can be continued past the kink point predicted by Kaplan et al. Still others simply said that GPT-3 was done in a deliberately simple, non-cutting-edge way to prove a point and that it could have used its compute much more compute-efficiently if they threw the latest bags of tricks at it. I am skeptical of all this, of course, but perhaps less skeptical than you? 2 OOMs is 7 doublings, which will happen around 2037 according to Ajeya. Would you be willing to take a 10:1 bet that there won’t be something as good as GPT-3 trained on 2 OOMs less compute by 2030? I think I’d take the other side of that bet.
I don’t think evolution was going for compute-optimal performance in the relevant sense. With AI, we can easily trade off between training models longer and making models bigger, and according to the scaling laws it seems like we should increase training time by 0.75 OOMs for every OOM of parameter count increase. With biological systems, sure maybe it is true that if you faced a trade-off where you were trying to minimize total number of neuron firings over the course of the organism’s childhood, the right ratio would be 0.75 OOMs of extra childhood duration for every 1 OOM of extra synapses… maybe. But even if this were true, it’s pretty non-obvious that that’s the trade-off regime evolution faces. There are all sorts of other pros and cons associated with more synapses and longer childhoods. For example, maybe evolution finds it easier to increase synapse count than to increase childhood, because increased childhood reduces fitness significantly (more chances to die before you reproduce, longer doubling time of population).
Yeah, sorry, by useful I meant useful for transformative tasks.
Yes, obviously the tasks in the graph are not transformative. But it seems to me to be… like, 25% likely or so that once we have pre-trained, unsupervised models that build up high skill level at all those tasks on the graph, it’s because they’ve developed general intelligence in the relevant sense. Or maybe they haven’t but it’s a sign that general intelligence is near, perhaps with a more sophisticated training regime and architecture. Like, yeah those tasks are “particularly easy” compared to taking over the world, but they are also incredibly hard in some sense; IIRC GPT-3 was also tested on a big dataset of exam questions used for high school, college, and graduate-level admissions, and got 50% or so whereas every other AI system got 25%, random chance, and I bet most english-speaking literate humans in the world today would have done worse than 50%.
Huh. When I put 100% mass on short horizon in my version of Ajeya’s model, it says median 2031. Admittedly, I had made some changes to some other parameters too, also not hugely iirc. I wonder if this means those other-parameter changes matter more than I’d thought.
Huh, that’s totally not how I saw it. From Ajeya’s report:
To me this really sounds like it’s saying the horizon length = the number of subjective seconds per sample during training. So, maybe it makes sense to talk about “horizon length of task X” (i.e. number of subjective seconds per sample during training of a typical ML model on that task) but it seems to make even more sense to talk about “horizon length of model X” since model X actually had a training run and actually had an average number of subjective seconds per sample.
But I’m happy to 70% defer to your judgment on this since you probably have talked to Ajeya etc. and know more about this than me.
At any rate, deferring to you on this doesn’t undermine the point I was making at all, as far as I can tell.
Exactly. I think this is what humans do too, to a large extent. I’d be curious to hear why you put so much weight on medium and long horizons. I put 50% on short, 20% on medium, and 10% on long.
I actually don’t remember what I meant to convey with that :/
No, I’d also take the other side of the bet. A few reasons:
Estimated algorithmic efficiency in the report is low because researchers are not currently optimizing for “efficiency on a transformative task”, whereas researchers probably are optimizing for “efficiency of GPT-3 style systems”, suggesting faster improvements in algorithmic efficiency for GPT-3 than estimated in the report.
90% confidence is quite a lot; I do not have high certainty in the algorithmic efficiency part of the report.
(Note that 2 OOMs in 10 years seems significantly different from “we can get several OOMs more data-efficient training than the GPT’s had using various already-developed tricks and techniques”. I also assume that you have more than 10% credence in this, since 10% seems too low to make a difference to timelines.)
I feel like this is already taken into account by the methodology by which we estimated the ratio of evolution to human design? Like, taking your example of flight, presumably evolution was not optimizing just for power-to-weight ratio, it was optimizing for a bunch of other things; nonetheless we ignore those other things when making the comparison. Similarly, in the report the estimate is that evolution is ~10x better than humans on the chosen metrics, even though evolution was not literally optimizing just for the chosen metric. Why not expect the same here?
I think you’d need to argue that there is a specific other property that evolution was optimizing for, that clearly trades off against compute-efficiency, to argue that we should expect that in this case evolution was worse than in other cases.
This seems like it is realist about rationality, which I mostly don’t buy. Still, 25% doesn’t seem crazy, I’d probably put 10 or 20% on it myself. But even at 25% that seems pretty consistent with my timelines; 25% does not make the median.
Why aren’t we already using the most sophisticated training regime and architecture? I agree it will continue to improve, but that’s already what the model does.
I don’t particularly care about comparisons of memory / knowledge between GPT-3 and humans. Humans weren’t optimized for that.
I expect that Google search beats GPT-3 on that dataset.
I don’t really know what you mean when you say that this task is “hard”. Sure, humans don’t do it very well. We also don’t do arithmetic very well, while calculators do.
Er, note that I’ve talked to Ajeya for like an hour or two on the entire report. I’m not that confident that Ajeya also believes the things I’m saying (maybe I’m 80% confident).
I agree that the definition used in the report does seem consistent with that. I think that’s mostly because the report assumes that you are training a model to perform a single (transformative) task, and so a definition in terms of the model is equivalent to definition in terms of the task. The report doesn’t really talk about the unsupervised pretraining approach so its definitions didn’t have to handle that case.
But like, irrespective of what Ajeya meant, I think the important concept would be task-based. You would want to have different timelines for “when a neural net can do human-level summarization” and “when a neural net can be a human-level personal assistant”, even if you expect to use unsupervised pretraining for both. The only parameter in the model that can plausibly do that is the horizon length. If you don’t use the horizon length for that purpose, I think you should have some other way of incorporating “difficulty of the task” into your timelines.
I mean, I’m at 30 / 40 / 10, so that isn’t that much of a difference. Half of the difference could be explained by your 25% on general reasoning, vs my (let’s say) 15% on it.
Thanks again. My general impression is that we disagree less than it first appeared, and that our disagreements are currently bottoming out in different intuitions rather than obvious cruxes we can drill down on. Plus I’m getting tired. ;) So, I say we call it a day. To be continued later, perhaps in person, perhaps in future comment chains on future posts!
For the sake of completeness, to answer your questions though:
By “hard” I mean something like “Difficult to get AIs to do well.” If we imagine all the tasks we can get AIs to do lined up by difficulty, there is some transformative task A which is least difficult. As the tasks we succeed at getting AIs to do get harder and harder, we must be getting closer to A. I think that getting an AI to do well on all the benchmarks we throw at it despite not being trained for any of them (but rather just trained to predict random internet text) seems like a sign that we are getting close to A. You say this is because I believe in realism about rationality; I hope not, since I don’t believe in realism about rationality. Maybe there’s a contradiction in my views then which you have pointed to, but I don’t see it yet.
At this point I feel the need to break things down into premise-conclusion form because I am feeling confused about how the various bits of your argument are connecting to each other. I realize this is a big ask, so don’t feel any particular pressure to do it.
I totally agree that evolution wasn’t optimizing just for power-to-weight ratio. But I never claimed that it was. I don’t think that my comparison relied on the assumption that evolution was optimizing for power-to-weight ratio. By contrast, you explicitly said “presumably evolution was also going for compute-optimal performance.” Once we reject that claim, my original point stands that it’s not clear how we should apply the scaling laws to the human brain, since the scaling laws are about compute-optimal performance, i.e. how you should trade off size and training time if all you care about is minimizing compute. Since evolution obviously cares about a lot more than that (and indeed doesn’t care about minimizing compute at all, it just cares about minimizing size and training time separately, with no particular ratio between them except that which is set by the fitness landscape) the laws aren’t directly relevant. In other words, for all we know, if the human brain was 3 OOMs smaller and had one OOM more training time it would be qualitatively superior! Or for all we know, if it had 1 OOM more synapses it would need 2 OOMs less training time to be just as capable. Or… etc. Judging by the scaling laws, it seems like the human brain has a lot more synapses than its childhood length would suggest for optimal performance, or else a lot less if you buy the idea that evolutionary history is part of its training data.