I didn’t quite follow this part. Do you think I’m not reasoning from the thing I believe is the bottleneck?
I actually don’t remember what I meant to convey with that :/
Would you be willing to take a 10:1 bet that there won’t be something as good as GPT-3 trained on 2 OOMs less compute by 2030?
No, I’d also take the other side of the bet. A few reasons:
Estimated algorithmic efficiency in the report is low because researchers are not currently optimizing for “efficiency on a transformative task”, whereas researchers probably are optimizing for “efficiency of GPT-3 style systems”, suggesting faster improvements in algorithmic efficiency for GPT-3 than estimated in the report.
90% confidence is quite a lot; I do not have high certainty in the algorithmic efficiency part of the report.
(Note that 2 OOMs in 10 years seems significantly different from “we can get several OOMs more data-efficient training than the GPT’s had using various already-developed tricks and techniques”. I also assume that you have more than 10% credence in this, since 10% seems too low to make a difference to timelines.)
I don’t think evolution was going for compute-optimal performance in the relevant sense.
I feel like this is already taken into account by the methodology by which we estimated the ratio of evolution to human design? Like, taking your example of flight, presumably evolution was not optimizing just for power-to-weight ratio, it was optimizing for a bunch of other things; nonetheless we ignore those other things when making the comparison. Similarly, in the report the estimate is that evolution is ~10x better than humans on the chosen metrics, even though evolution was not literally optimizing just for the chosen metric. Why not expect the same here?
I think you’d need to argue that there is a specific other property that evolution was optimizing for, that clearly trades off against compute-efficiency, to argue that we should expect that in this case evolution was worse than in other cases.
But it seems to me to be… like, 25% likely or so that once we have pre-trained, unsupervised models that build up high skill level at all those tasks on the graph, it’s because they’ve developed general intelligence in the relevant sense.
This seems like it is realist about rationality, which I mostly don’t buy. Still, 25% doesn’t seem crazy, I’d probably put 10 or 20% on it myself. But even at 25% that seems pretty consistent with my timelines; 25% does not make the median.
Or maybe they haven’t but it’s a sign that general intelligence is near, perhaps with a more sophisticated training regime and architecture.
Why aren’t we already using the most sophisticated training regime and architecture? I agree it will continue to improve, but that’s already what the model does.
GPT-3 was also tested on a big dataset of exam questions used for high school, college, and graduate-level admissions, and got 50% or so whereas every other AI system got 25%, random chance, and I bet most english-speaking literate humans in the world today would have done worse than 50%.
I don’t particularly care about comparisons of memory / knowledge between GPT-3 and humans. Humans weren’t optimized for that.
I expect that Google search beats GPT-3 on that dataset.
I don’t really know what you mean when you say that this task is “hard”. Sure, humans don’t do it very well. We also don’t do arithmetic very well, while calculators do.
But I’m happy to 70% defer to your judgment on this since you probably have talked to Ajeya etc. and know more about this than me.
Er, note that I’ve talked to Ajeya for like an hour or two on the entire report. I’m not that confident that Ajeya also believes the things I’m saying (maybe I’m 80% confident).
To me this really sounds like it’s saying the horizon length = the number of subjective seconds per sample during training. [...]
I agree that the definition used in the report does seem consistent with that. I think that’s mostly because the report assumes that you are training a model to perform a single (transformative) task, and so a definition in terms of the model is equivalent to definition in terms of the task. The report doesn’t really talk about the unsupervised pretraining approach so its definitions didn’t have to handle that case.
But like, irrespective of what Ajeya meant, I think the important concept would be task-based. You would want to have different timelines for “when a neural net can do human-level summarization” and “when a neural net can be a human-level personal assistant”, even if you expect to use unsupervised pretraining for both. The only parameter in the model that can plausibly do that is the horizon length. If you don’t use the horizon length for that purpose, I think you should have some other way of incorporating “difficulty of the task” into your timelines.
Exactly. I think this is what humans do too, to a large extent. I’d be curious to hear why you put so much weight on medium and long horizons. I put 50% on short, 20% on medium, and 10% on long.
I mean, I’m at 30 / 40 / 10, so that isn’t that much of a difference. Half of the difference could be explained by your 25% on general reasoning, vs my (let’s say) 15% on it.
Thanks again. My general impression is that we disagree less than it first appeared, and that our disagreements are currently bottoming out in different intuitions rather than obvious cruxes we can drill down on. Plus I’m getting tired. ;) So, I say we call it a day. To be continued later, perhaps in person, perhaps in future comment chains on future posts!
For the sake of completeness, to answer your questions though:
I don’t really know what you mean when you say that this task is “hard”. Sure, humans don’t do it very well. We also don’t do arithmetic very well, while calculators do.
By “hard” I mean something like “Difficult to get AIs to do well.” If we imagine all the tasks we can get AIs to do lined up by difficulty, there is some transformative task A which is least difficult. As the tasks we succeed at getting AIs to do get harder and harder, we must be getting closer to A. I think that getting an AI to do well on all the benchmarks we throw at it despite not being trained for any of them (but rather just trained to predict random internet text) seems like a sign that we are getting close to A. You say this is because I believe in realism about rationality; I hope not, since I don’t believe in realism about rationality. Maybe there’s a contradiction in my views then which you have pointed to, but I don’t see it yet.
I feel like this is already taken into account by the methodology by which we estimated the ratio of evolution to human design? Like, taking your example of flight, presumably evolution was not optimizing just for power-to-weight ratio, it was optimizing for a bunch of other things; nonetheless we ignore those other things when making the comparison. Similarly, in the report the estimate is that evolution is ~10x better than humans on the chosen metrics, even though evolution was not literally optimizing just for the chosen metric. Why not expect the same here?
At this point I feel the need to break things down into premise-conclusion form because I am feeling confused about how the various bits of your argument are connecting to each other. I realize this is a big ask, so don’t feel any particular pressure to do it.
I totally agree that evolution wasn’t optimizing just for power-to-weight ratio. But I never claimed that it was. I don’t think that my comparison relied on the assumption that evolution was optimizing for power-to-weight ratio. By contrast, you explicitly said “presumably evolution was also going for compute-optimal performance.” Once we reject that claim, my original point stands that it’s not clear how we should apply the scaling laws to the human brain, since the scaling laws are about compute-optimal performance, i.e. how you should trade off size and training time if all you care about is minimizing compute. Since evolution obviously cares about a lot more than that (and indeed doesn’t care about minimizing compute at all, it just cares about minimizing size and training time separately, with no particular ratio between them except that which is set by the fitness landscape) the laws aren’t directly relevant. In other words, for all we know, if the human brain was 3 OOMs smaller and had one OOM more training time it would be qualitatively superior! Or for all we know, if it had 1 OOM more synapses it would need 2 OOMs less training time to be just as capable. Or… etc. Judging by the scaling laws, it seems like the human brain has a lot more synapses than its childhood length would suggest for optimal performance, or else a lot less if you buy the idea that evolutionary history is part of its training data.
I actually don’t remember what I meant to convey with that :/
No, I’d also take the other side of the bet. A few reasons:
Estimated algorithmic efficiency in the report is low because researchers are not currently optimizing for “efficiency on a transformative task”, whereas researchers probably are optimizing for “efficiency of GPT-3 style systems”, suggesting faster improvements in algorithmic efficiency for GPT-3 than estimated in the report.
90% confidence is quite a lot; I do not have high certainty in the algorithmic efficiency part of the report.
(Note that 2 OOMs in 10 years seems significantly different from “we can get several OOMs more data-efficient training than the GPT’s had using various already-developed tricks and techniques”. I also assume that you have more than 10% credence in this, since 10% seems too low to make a difference to timelines.)
I feel like this is already taken into account by the methodology by which we estimated the ratio of evolution to human design? Like, taking your example of flight, presumably evolution was not optimizing just for power-to-weight ratio, it was optimizing for a bunch of other things; nonetheless we ignore those other things when making the comparison. Similarly, in the report the estimate is that evolution is ~10x better than humans on the chosen metrics, even though evolution was not literally optimizing just for the chosen metric. Why not expect the same here?
I think you’d need to argue that there is a specific other property that evolution was optimizing for, that clearly trades off against compute-efficiency, to argue that we should expect that in this case evolution was worse than in other cases.
This seems like it is realist about rationality, which I mostly don’t buy. Still, 25% doesn’t seem crazy, I’d probably put 10 or 20% on it myself. But even at 25% that seems pretty consistent with my timelines; 25% does not make the median.
Why aren’t we already using the most sophisticated training regime and architecture? I agree it will continue to improve, but that’s already what the model does.
I don’t particularly care about comparisons of memory / knowledge between GPT-3 and humans. Humans weren’t optimized for that.
I expect that Google search beats GPT-3 on that dataset.
I don’t really know what you mean when you say that this task is “hard”. Sure, humans don’t do it very well. We also don’t do arithmetic very well, while calculators do.
Er, note that I’ve talked to Ajeya for like an hour or two on the entire report. I’m not that confident that Ajeya also believes the things I’m saying (maybe I’m 80% confident).
I agree that the definition used in the report does seem consistent with that. I think that’s mostly because the report assumes that you are training a model to perform a single (transformative) task, and so a definition in terms of the model is equivalent to definition in terms of the task. The report doesn’t really talk about the unsupervised pretraining approach so its definitions didn’t have to handle that case.
But like, irrespective of what Ajeya meant, I think the important concept would be task-based. You would want to have different timelines for “when a neural net can do human-level summarization” and “when a neural net can be a human-level personal assistant”, even if you expect to use unsupervised pretraining for both. The only parameter in the model that can plausibly do that is the horizon length. If you don’t use the horizon length for that purpose, I think you should have some other way of incorporating “difficulty of the task” into your timelines.
I mean, I’m at 30 / 40 / 10, so that isn’t that much of a difference. Half of the difference could be explained by your 25% on general reasoning, vs my (let’s say) 15% on it.
Thanks again. My general impression is that we disagree less than it first appeared, and that our disagreements are currently bottoming out in different intuitions rather than obvious cruxes we can drill down on. Plus I’m getting tired. ;) So, I say we call it a day. To be continued later, perhaps in person, perhaps in future comment chains on future posts!
For the sake of completeness, to answer your questions though:
By “hard” I mean something like “Difficult to get AIs to do well.” If we imagine all the tasks we can get AIs to do lined up by difficulty, there is some transformative task A which is least difficult. As the tasks we succeed at getting AIs to do get harder and harder, we must be getting closer to A. I think that getting an AI to do well on all the benchmarks we throw at it despite not being trained for any of them (but rather just trained to predict random internet text) seems like a sign that we are getting close to A. You say this is because I believe in realism about rationality; I hope not, since I don’t believe in realism about rationality. Maybe there’s a contradiction in my views then which you have pointed to, but I don’t see it yet.
At this point I feel the need to break things down into premise-conclusion form because I am feeling confused about how the various bits of your argument are connecting to each other. I realize this is a big ask, so don’t feel any particular pressure to do it.
I totally agree that evolution wasn’t optimizing just for power-to-weight ratio. But I never claimed that it was. I don’t think that my comparison relied on the assumption that evolution was optimizing for power-to-weight ratio. By contrast, you explicitly said “presumably evolution was also going for compute-optimal performance.” Once we reject that claim, my original point stands that it’s not clear how we should apply the scaling laws to the human brain, since the scaling laws are about compute-optimal performance, i.e. how you should trade off size and training time if all you care about is minimizing compute. Since evolution obviously cares about a lot more than that (and indeed doesn’t care about minimizing compute at all, it just cares about minimizing size and training time separately, with no particular ratio between them except that which is set by the fitness landscape) the laws aren’t directly relevant. In other words, for all we know, if the human brain was 3 OOMs smaller and had one OOM more training time it would be qualitatively superior! Or for all we know, if it had 1 OOM more synapses it would need 2 OOMs less training time to be just as capable. Or… etc. Judging by the scaling laws, it seems like the human brain has a lot more synapses than its childhood length would suggest for optimal performance, or else a lot less if you buy the idea that evolutionary history is part of its training data.