First of all, yeah, as far as I can tell you and I agree on everything in the OP. Like I said, this disagreement is an aside.
Now that you mention it / I think about it more, there’s another strong point to add to the argument I sketched in part 3: Insofar as our NN’s aren’t data-efficient, it’ll take more compute to train them, and so even if TAI need not be data-efficient, short-timelines-TAI must be. (Because in the short term, we don’t have much more compute. I’m embarrassed I didn’t notice this earlier and include it in the argument.) That helps the argument a lot; it means that all the argument has to do is establish that we aren’t going to get more data-efficient NN’s anytime soon.
And yeah, I agree the scaling laws are a great source of evidence about this. I had them in mind when I wrote the argument in part 3. I guess I’m just not as convinced as you (?) that (a) when we are routinely training NN’s with 10e15 params, it’ll take roughly 10e15 data points to get to a useful level of performance, and (b) average horizon length for the data points will need to be more than short.
Some reasons I currently doubt (a):
--A bunch of people I talk to, who know more about AI than me, seem confident that we can get several OOMs more data-efficient training than the GPT’s had using various already-developed tricks and techniques.
--The scaling laws, IIRC, don’t tell us how much data is needed to reach a useful level of performance. Rather, they tell us how much data is needed if you want to use your compute budget optimally. It could be that at 10e15 params and 10e15 data points, performance is actually much higher than merely useful; maybe only 10e13 params and 10e13 data points would be the first to cross the usefulness threshold. (Counterpoint: Extrapolating GPT performance trends on text prediction suggests it wouldn’t be human-level at text prediction until about 10e15 params and 10e15 data points, according to data I got from Lanrian. Countercounterpoint: Extrapolating GPT performance trends on tasks other than text prediction makes it seem to me that it could be pretty useful well before then; see these figures, in which I think 10e15/10e15 would be the far-right edge of the graph).
Some reasons I currently doubt (b):
--I’ve been impressed with how much GPT-3 has learned despite having a very short horizon length, very limited data modality, very limited input channel, very limited architecture, very small size, etc. This makes me think that yeah, if we improve on GPT-3 in all of those dimensions, we could get something really useful for some transformative tasks, even if we keep the horizon length small.
--I think that humans have a tiny horizon length—our brains are constantly updating, right? I guess it’s hard to make the comparison, given how it’s an analog system etc. But it sure seems like the equivalent of the average horizon length for the brain is around a second or so. Now, it could be that humans get away with such a small horizon length because of all the fancy optimizations evolution has done on them. But it also could just be that that’s all you need.
--Having a small average horizon length doesn’t preclude also training lots on long-horizon tasks. It just means that on average your horizon length is small. So e.g. if the training process involves a bit of predict-the-next input, and also a bit of make-and-execute-plans-actions-over-the-span-of-days, you could get quite a few data points of the latter variety and still have a short average horizon length.
I’m very uncertain about all of this and would love to hear your thoughts, which is why I asked. :)
Now that you mention it / I think about it more, there’s another strong point to add to the argument I sketched in part 3: Insofar as our NN’s aren’t data-efficient, it’ll take more compute to train them, and so even if TAI need not be data-efficient, short-timelines-TAI must be.
Yeah, this is (part of) why I put compute + scaling laws front and center and make inferences about data efficiency; you can have much stronger conclusions when you start reasoning from the thing you believe is the bottleneck.
--A bunch of people I talk to, who know more about AI than me, seem confident that we can get several OOMs more data-efficient training than the GPT’s had using various already-developed tricks and techniques.
Note that Ajeya’s report does have a term for “algorithmic efficiency”, that has a doubling time of 2-3 years.
Certainly “several OOMs using tricks and techniques we could implement in a year” would be way faster than that trend, but you’ve really got to wonder why these people haven’t done it yet—if I interpret “several OOMs” as “at least 3 OOMs”, that would bring the compute cost down to around $1000, which is accessible for basically any AI researcher (including academics). I’ll happily take a 10:1 bet against a model as competent as GPT-3 being trained on $1000 of compute within the next year.
Perhaps the tricks and techniques are sufficiently challenging that they need a full team of engineers working for multiple years—if so, this seems plausibly consistent with the 2-3 year doubling time.
-The scaling laws, IIRC, don’t tell us how much data is needed to reach a useful level of performance. Rather, they tell us how much data is needed if you want to use your compute budget optimally.
Evolution was presumably also going for compute-optimal performance, so it seems like this is the right comparison to make. I agree there’s uncertainty here, but I don’t see why the uncertainty should bias us towards shorter timelines rather than longer timelines.
I could see it if we thought we were better than evolution, since then we could say “we’d figure something out that evolution missed and that would bias towards short timelines”; but this is also something that Ajeya considered and iirc she then estimated that evolution tended to be ~10x better than us (lots of caveats here though).
Countercounterpoint: Extrapolating GPT performance trends on tasks other than text prediction makes it seem to me that it could be pretty useful well before then; see these figures, in which I think 10e15/10e15 would be the far-right edge of the graph
Both Ajeya and I think that AI systems will be incredibly useful before they get to the level of “transformative AI”. The tasks in the graph you link are particularly easy and not that important; having superhuman performance on them would not transform the world.
(b) average horizon length for the data points will need to be more than short.
I just put literally 100% mass on short horizon in my version of the timelines model (which admittedly has changed some other parameters, though not hugely iirc) and the median I get is 2041 (about 10 years lower than what it was previously). So I don’t think this is making a huge difference (though certainly 10 years is substantial).
--I’ve been impressed with how much GPT-3 has learned despite having a very short horizon length, very limited data modality, very limited input channel, very limited architecture, very small size, etc. This makes me think that yeah, if we improve on GPT-3 in all of those dimensions, we could get something really useful for some transformative tasks, even if we keep the horizon length small.
I see horizon length (as used in the report) as a function of a task, so “horizon length of GPT-3” feels like a type error given that what we care about is how GPT-3 can do many tasks. Any task done by GPT-3 has a maximum horizon length of 2048 (the size of its context window). During training, GPT-3 saw 300 billion tokens, so it saw around 100 million “effective examples” of size 2048. It makes sense within the bio anchors framework that there would be some tasks with horizon length in the thousands that GPT-3 would be able to do well.
--I think that humans have a tiny horizon length—our brains are constantly updating, right? I guess it’s hard to make the comparison, given how it’s an analog system etc. But it sure seems like the equivalent of the average horizon length for the brain is around a second or so. Now, it could be that humans get away with such a small horizon length because of all the fancy optimizations evolution has done on them. But it also could just be that that’s all you need.
Again, this feels like a type error to me. Horizon length isn’t about the optimization algorithm, it’s about the task.
(You can of course define your own version of “horizon length” that’s about the optimization algorithm, but then I think you need to have some way of incorporating the “difficulty” of a transformative task into your timelines estimate, given that the scaling laws are all calculated on “easy” tasks.)
--Having a small average horizon length doesn’t preclude also training lots on long-horizon tasks. It just means that on average your horizon length is small. So e.g. if the training process involves a bit of predict-the-next input, and also a bit of make-and-execute-plans-actions-over-the-span-of-days, you could get quite a few data points of the latter variety and still have a short average horizon length.
Agree with this. I remember mentioning this to Ajeya but I don’t actually remember what the conclusion was.
EDIT: Oh, I remember now. The argument I was making is that you could imagine that most of the training is unsupervised pretraining on a short-horizon objective, similarly to GPT-3, after which you finetune (with negligible compute cost) on the long-horizon transformative task you care about, so that on average your horizon is short. I definitely remember this being an important reason in me putting as much weight on short horizons as I did; I think this was also true for Ajeya.
Yeah, this is (part of) why I put compute + scaling laws front and center and make inferences about data efficiency; you can have much stronger conclusions when you start reasoning from the thing you believe is the bottleneck.
I didn’t quite follow this part. Do you think I’m not reasoning from the thing I believe is the bottleneck?
Certainly “several OOMs using tricks and techniques we could implement in a year” would be way faster than that trend, but you’ve really got to wonder why these people haven’t done it yet—if I interpret “several OOMs” as “at least 3 OOMs”, that would bring the compute cost down to around $1000, which is accessible for basically any AI researcher (including academics). I’ll happily take a 10:1 bet against a model as competent as GPT-3 being trained on $1000 of compute within the next year.
Perhaps the tricks and techniques are sufficiently challenging that they need a full team of engineers working for multiple years—if so, this seems plausibly consistent with the 2-3 year doubling time.
Some of the people I talked to said about 2 OOMs, others expressed it differently, saying that the faster scaling law can be continued past the kink point predicted by Kaplan et al. Still others simply said that GPT-3 was done in a deliberately simple, non-cutting-edge way to prove a point and that it could have used its compute much more compute-efficiently if they threw the latest bags of tricks at it. I am skeptical of all this, of course, but perhaps less skeptical than you? 2 OOMs is 7 doublings, which will happen around 2037 according to Ajeya. Would you be willing to take a 10:1 bet that there won’t be something as good as GPT-3 trained on 2 OOMs less compute by 2030? I think I’d take the other side of that bet.
Evolution was presumably also going for compute-optimal performance, so it seems like this is the right comparison to make.
I don’t think evolution was going for compute-optimal performance in the relevant sense. With AI, we can easily trade off between training models longer and making models bigger, and according to the scaling laws it seems like we should increase training time by 0.75 OOMs for every OOM of parameter count increase. With biological systems, sure maybe it is true that if you faced a trade-off where you were trying to minimize total number of neuron firings over the course of the organism’s childhood, the right ratio would be 0.75 OOMs of extra childhood duration for every 1 OOM of extra synapses… maybe. But even if this were true, it’s pretty non-obvious that that’s the trade-off regime evolution faces. There are all sorts of other pros and cons associated with more synapses and longer childhoods. For example, maybe evolution finds it easier to increase synapse count than to increase childhood, because increased childhood reduces fitness significantly (more chances to die before you reproduce, longer doubling time of population).
Both Ajeya and I think that AI systems will be incredibly useful before they get to the level of “transformative AI”. The tasks in the graph you link are particularly easy and not that important; having superhuman performance on them would not transform the world.
Yeah, sorry, by useful I meant useful for transformative tasks.
Yes, obviously the tasks in the graph are not transformative. But it seems to me to be… like, 25% likely or so that once we have pre-trained, unsupervised models that build up high skill level at all those tasks on the graph, it’s because they’ve developed general intelligence in the relevant sense. Or maybe they haven’t but it’s a sign that general intelligence is near, perhaps with a more sophisticated training regime and architecture. Like, yeah those tasks are “particularly easy” compared to taking over the world, but they are also incredibly hard in some sense; IIRC GPT-3 was also tested on a big dataset of exam questions used for high school, college, and graduate-level admissions, and got 50% or so whereas every other AI system got 25%, random chance, and I bet most english-speaking literate humans in the world today would have done worse than 50%.
I just put literally 100% mass on short horizon in my version of the timelines model (which admittedly has changed some other parameters, though not hugely iirc) and the median I get is 2041 (about 10 years lower than what it was previously). So I don’t think this is making a huge difference (though certainly 10 years is substantial).
Huh. When I put 100% mass on short horizon in my version of Ajeya’s model, it says median 2031. Admittedly, I had made some changes to some other parameters too, also not hugely iirc. I wonder if this means those other-parameter changes matter more than I’d thought.
I see horizon length (as used in the report) as a function of a task, so “horizon length of GPT-3” feels like a type error given that what we care about is how GPT-3 can do many tasks. Any task done by GPT-3 has a maximum horizon length of 2048 (the size of its context window). During training, GPT-3 saw 300 billion tokens, so it saw around 100 million “effective examples” of size 2048. It makes sense within the bio anchors framework that there would be some tasks with horizon length in the thousands that GPT-3 would be able to do well.
Huh, that’s totally not how I saw it. From Ajeya’s report:
I’ll define the “effective horizon length” of an ML problem as the amount of data it takes (on average) to tell whether a perturbation to the model improves performance or worsens performance.If we believe that the number of “samples” required to train a model of size P is given by KP, then the number of subjective seconds that would be required should be given by HKP, where H is the effective horizon length expressed in units of “subjective seconds per sample.”
To me this really sounds like it’s saying the horizon length = the number of subjective seconds per sample during training. So, maybe it makes sense to talk about “horizon length of task X” (i.e. number of subjective seconds per sample during training of a typical ML model on that task) but it seems to make even more sense to talk about “horizon length of model X” since model X actually had a training run and actually had an average number of subjective seconds per sample.
But I’m happy to 70% defer to your judgment on this since you probably have talked to Ajeya etc. and know more about this than me.
At any rate, deferring to you on this doesn’t undermine the point I was making at all, as far as I can tell.
you could imagine that most of the training is unsupervised pretraining on a short-horizon objective, similarly to GPT-3, after which you finetune (with negligible compute cost) on the long-horizon transformative task you care about, so that on average your horizon is short. I definitely remember this being an important reason in me putting as much weight on short horizons as I did; I think this was also true for Ajeya.
Exactly. I think this is what humans do too, to a large extent. I’d be curious to hear why you put so much weight on medium and long horizons. I put 50% on short, 20% on medium, and 10% on long.
I didn’t quite follow this part. Do you think I’m not reasoning from the thing I believe is the bottleneck?
I actually don’t remember what I meant to convey with that :/
Would you be willing to take a 10:1 bet that there won’t be something as good as GPT-3 trained on 2 OOMs less compute by 2030?
No, I’d also take the other side of the bet. A few reasons:
Estimated algorithmic efficiency in the report is low because researchers are not currently optimizing for “efficiency on a transformative task”, whereas researchers probably are optimizing for “efficiency of GPT-3 style systems”, suggesting faster improvements in algorithmic efficiency for GPT-3 than estimated in the report.
90% confidence is quite a lot; I do not have high certainty in the algorithmic efficiency part of the report.
(Note that 2 OOMs in 10 years seems significantly different from “we can get several OOMs more data-efficient training than the GPT’s had using various already-developed tricks and techniques”. I also assume that you have more than 10% credence in this, since 10% seems too low to make a difference to timelines.)
I don’t think evolution was going for compute-optimal performance in the relevant sense.
I feel like this is already taken into account by the methodology by which we estimated the ratio of evolution to human design? Like, taking your example of flight, presumably evolution was not optimizing just for power-to-weight ratio, it was optimizing for a bunch of other things; nonetheless we ignore those other things when making the comparison. Similarly, in the report the estimate is that evolution is ~10x better than humans on the chosen metrics, even though evolution was not literally optimizing just for the chosen metric. Why not expect the same here?
I think you’d need to argue that there is a specific other property that evolution was optimizing for, that clearly trades off against compute-efficiency, to argue that we should expect that in this case evolution was worse than in other cases.
But it seems to me to be… like, 25% likely or so that once we have pre-trained, unsupervised models that build up high skill level at all those tasks on the graph, it’s because they’ve developed general intelligence in the relevant sense.
This seems like it is realist about rationality, which I mostly don’t buy. Still, 25% doesn’t seem crazy, I’d probably put 10 or 20% on it myself. But even at 25% that seems pretty consistent with my timelines; 25% does not make the median.
Or maybe they haven’t but it’s a sign that general intelligence is near, perhaps with a more sophisticated training regime and architecture.
Why aren’t we already using the most sophisticated training regime and architecture? I agree it will continue to improve, but that’s already what the model does.
GPT-3 was also tested on a big dataset of exam questions used for high school, college, and graduate-level admissions, and got 50% or so whereas every other AI system got 25%, random chance, and I bet most english-speaking literate humans in the world today would have done worse than 50%.
I don’t particularly care about comparisons of memory / knowledge between GPT-3 and humans. Humans weren’t optimized for that.
I expect that Google search beats GPT-3 on that dataset.
I don’t really know what you mean when you say that this task is “hard”. Sure, humans don’t do it very well. We also don’t do arithmetic very well, while calculators do.
But I’m happy to 70% defer to your judgment on this since you probably have talked to Ajeya etc. and know more about this than me.
Er, note that I’ve talked to Ajeya for like an hour or two on the entire report. I’m not that confident that Ajeya also believes the things I’m saying (maybe I’m 80% confident).
To me this really sounds like it’s saying the horizon length = the number of subjective seconds per sample during training. [...]
I agree that the definition used in the report does seem consistent with that. I think that’s mostly because the report assumes that you are training a model to perform a single (transformative) task, and so a definition in terms of the model is equivalent to definition in terms of the task. The report doesn’t really talk about the unsupervised pretraining approach so its definitions didn’t have to handle that case.
But like, irrespective of what Ajeya meant, I think the important concept would be task-based. You would want to have different timelines for “when a neural net can do human-level summarization” and “when a neural net can be a human-level personal assistant”, even if you expect to use unsupervised pretraining for both. The only parameter in the model that can plausibly do that is the horizon length. If you don’t use the horizon length for that purpose, I think you should have some other way of incorporating “difficulty of the task” into your timelines.
Exactly. I think this is what humans do too, to a large extent. I’d be curious to hear why you put so much weight on medium and long horizons. I put 50% on short, 20% on medium, and 10% on long.
I mean, I’m at 30 / 40 / 10, so that isn’t that much of a difference. Half of the difference could be explained by your 25% on general reasoning, vs my (let’s say) 15% on it.
Thanks again. My general impression is that we disagree less than it first appeared, and that our disagreements are currently bottoming out in different intuitions rather than obvious cruxes we can drill down on. Plus I’m getting tired. ;) So, I say we call it a day. To be continued later, perhaps in person, perhaps in future comment chains on future posts!
For the sake of completeness, to answer your questions though:
I don’t really know what you mean when you say that this task is “hard”. Sure, humans don’t do it very well. We also don’t do arithmetic very well, while calculators do.
By “hard” I mean something like “Difficult to get AIs to do well.” If we imagine all the tasks we can get AIs to do lined up by difficulty, there is some transformative task A which is least difficult. As the tasks we succeed at getting AIs to do get harder and harder, we must be getting closer to A. I think that getting an AI to do well on all the benchmarks we throw at it despite not being trained for any of them (but rather just trained to predict random internet text) seems like a sign that we are getting close to A. You say this is because I believe in realism about rationality; I hope not, since I don’t believe in realism about rationality. Maybe there’s a contradiction in my views then which you have pointed to, but I don’t see it yet.
I feel like this is already taken into account by the methodology by which we estimated the ratio of evolution to human design? Like, taking your example of flight, presumably evolution was not optimizing just for power-to-weight ratio, it was optimizing for a bunch of other things; nonetheless we ignore those other things when making the comparison. Similarly, in the report the estimate is that evolution is ~10x better than humans on the chosen metrics, even though evolution was not literally optimizing just for the chosen metric. Why not expect the same here?
At this point I feel the need to break things down into premise-conclusion form because I am feeling confused about how the various bits of your argument are connecting to each other. I realize this is a big ask, so don’t feel any particular pressure to do it.
I totally agree that evolution wasn’t optimizing just for power-to-weight ratio. But I never claimed that it was. I don’t think that my comparison relied on the assumption that evolution was optimizing for power-to-weight ratio. By contrast, you explicitly said “presumably evolution was also going for compute-optimal performance.” Once we reject that claim, my original point stands that it’s not clear how we should apply the scaling laws to the human brain, since the scaling laws are about compute-optimal performance, i.e. how you should trade off size and training time if all you care about is minimizing compute. Since evolution obviously cares about a lot more than that (and indeed doesn’t care about minimizing compute at all, it just cares about minimizing size and training time separately, with no particular ratio between them except that which is set by the fitness landscape) the laws aren’t directly relevant. In other words, for all we know, if the human brain was 3 OOMs smaller and had one OOM more training time it would be qualitatively superior! Or for all we know, if it had 1 OOM more synapses it would need 2 OOMs less training time to be just as capable. Or… etc. Judging by the scaling laws, it seems like the human brain has a lot more synapses than its childhood length would suggest for optimal performance, or else a lot less if you buy the idea that evolutionary history is part of its training data.
The scaling laws, IIRC, don’t tell us how much data is needed to reach a useful level of performance.
The scaling laws from the Kaplan et al papers do tell you this.
The relevant law is L(N,D), for the early-stopped test loss given parameter count N and data size D. It has the functional form
L(N,D)=[(Nc/N)αN/αD+(Dc/D)]αD
with αN∼0.076,αD∼0.095.
The result that you should scale D∝N0.74 comes from trying to keep the two terms in this formula about the same size.
This is not exactly a heuristic for managing compute (since D is not dependent on compute, it’s dependent on how much data you can source). It’s more like a heuristic for ensuring that your problem is the right level of difficulty to show off the power of this model size, as compared to smaller models.
You always can train models that are “too large” on datasets that are “too small” according to the heuristic, and they won’t diverge or do poorly or anything. They just won’t improve much upon the results of smaller models.
In terms of the above, you are setting N∼1015 and then asking what D ought to be. If the heuristic gives you an answer that seems very high, that doesn’t mean the model is “not as data efficient as you expected.” Rather, it means that you need a very large dataset if you want a good reason to push the parameter count up to N∼1015 rather than using a smaller model to get almost identical performance.
I find it more intuitive to think about the following, both discussed in the papers:
L(D), the N→∞ limit of L(N,D)
meaning: the peak data efficiency possible with this model class
L(N), the D→∞ limit of L(N,D)
meaning: the scaling of loss with parameters when not data-constrained but still using early stopping
If the Kaplan et al scaling results are relevant for AGI, I expect one of these two limits to provide the relevant constraint, rather than a careful balance between N and D to ensure we are not in either limit.
Ultimately, we expect AGI to require some specific-if-unknown level of performance (ie crossing some loss threshold LAGI). Ajeya’s approach essentially assumes that we’ll cross this threshold at a particular value of N, and then further assumes that this will happen in a regime where data and compute limitations are around the same order of magnitude.
I’m not sure why that ought to be true: it seems more likely that one side of the problem will become practically difficult to scale in proportion to the other, after a certain point, and we will essentially hug tight to either the L(N) or the L(D) curve until it hits LAGI.
Huh, thanks, now I’m more confused about the scaling laws than I was before, in a good way! I appreciate the explanation you gave but am still confused. Some questions:
--In my discussion with Rohin I said:
Since evolution obviously cares about a lot more than that (and indeed doesn’t care about minimizing compute at all, it just cares about minimizing size and training time separately, with no particular ratio between them except that which is set by the fitness landscape) the laws aren’t directly relevant. In other words, for all we know, if the human brain was 3 OOMs smaller and had one OOM more training time it would be qualitatively superior! Or for all we know, if it had 1 OOM more synapses it would need 2 OOMs less training time to be just as capable. Or… etc. Judging by the scaling laws, it seems like the human brain has a lot more synapses than its childhood length would suggest for optimal performance, or else a lot less if you buy the idea that evolutionary history is part of its training data.
Do you agree or disagree? My guess is that you’d disagree, since you say:
If the heuristic gives you an answer that seems very high, that doesn’t mean the model is “not as data efficient as you expected.” Rather, it means that you need a very large dataset if you want a good reason to push the parameter count up to N∼10^15 rather than using a smaller model to get almost identical performance.
which I take to mean that you think the human brain could have had almost identical performance with much fewer synapses, since it has much more N than is appropriate given its D? (But wait, surely you don’t think that… OK, yeah, I’m just very confused here, please help!)
2. You say “This is not exactly a heuristic for managing compute (since D is not dependent on compute, it’s dependent on how much data you can source).” Well, isn’t it both? You can’t have more D than you have compute, in some sense, because D isn’t the amount of training examples you’ve collected, it’s the amount you actually use to train… right? So… isn’t this a heuristic for managing compute? It sure seemed like it was presented that way.
3. Perhaps it would help me if I could visualize it in two dimensions. Let the y-axis be parameter count, N, and the x-axis be data trained on, D. Make it a heat map with color = loss. Bluer = lower loss. It sounds to me like the compute-optimal scaling law Kaplan et al tout is something like a 45 degree line from the origin such that every point on the line has the lowest loss of all the points on an equivalent-compute indifference curve that contains that point. Whereas you are saying there are two other interesting lines, the L(D) line and the L(N) line, and the L(D) line is (say) a 60-degree line from the origin such that for any point on that line, all points straight above it are exactly as blue. And the L(N) line is (say) a 30-degree line from the origin such that for any point on that line, all points straight to the right of it are exactly as blue. This is the picture I currently have in my head, is it correct in your opinion? (And you are saying that probably when we hit AGI we won’t be on the 45-degree line but rather will be constrained by model size or by data and so will be hugging one of the other two lines)
You can’t have more D than you have compute, in some sense, because D isn’t the amount of training examples you’ve collected, it’s the amount you actually use to train… right? So… isn’t this a heuristic for managing compute? It sure seemed like it was presented that way.
This is a subtle and confusing thing about the Kaplan et al papers. (It’s also the subject of my post that I linked earlier, so I recommend you check that out.)
There are two things in the papers that could be called “optimal compute budgeting” laws:
A law that assumes a sufficiently large dataset (ie effectively infinite dataset), and tell you how to manage the tradeoff between steps S and params N.
The law we discussed above, that assumes a finite dataset, and then tells you how to manage its size D vs params N.
I said the D vs N law was “not a heuristic for managing compute” because the S vs N law is more directly about compute, and is what the authors mean when they talk about compute optimal budgeting.
However, the D vs N law does tell you about how to spend compute in an indirect way, for the exact reason you say, that D is related to how long you train. Comparing the two laws yields the “breakdown” or “kink point.”
Do you agree or disagree? … I take [you] to mean that you think the human brain could have had almost identical performance with much fewer synapses, since it has much more N than is appropriate given its D?
Sorry, why do you expect I disagree? I think I agree. But also, I’m not really claiming the scaling laws say or don’t say anything about the brain, I’m just trying to clarify what they say about (specific kinds of) neural nets (on specific kinds of problems). We have to first understand what they predict about neural nets before we can go on to ask whether those predictions generalize to explain some other area.
Perhaps it would help me if I could visualize it in two dimensions
This part is 100% qualitatively accurate, I think. The one exception is that there are two “optimal compute” lines on the plot with different slopes, for the two laws referred to above. But yeah, I’m saying we won’t be on either of those lines, but on the L(N) or the L(D) line.
I’ve read your linked post thrice now, it’s excellent, any remaining confusions are my fault.
I didn’t confidently expect you to disagree, I just guessed you did. The reason is that the statement you DID disagree with: ” The scaling laws, IIRC, don’t tell us how much data is needed to reach a useful level of performance. ” was, in my mind, closely related to the paragraph about the human brain which you agree with. Since they were closely related in my mind, I thought if you disagreed with one you’d disagree with the other. The statement about brains is the one I care more about, since it relates to my disagreement with Rohin.
I’m glad my 2D visualization is qualitatively correct! Quantitatively, roughly how many degrees do you think there would be between the L(D) and L(N) laws? In my example it was 30, but of course I just made that up.
Actually, I think I spoke too soon about the visualization… I don’t think your image of L(D) and L(N) is quite right.
Here is what the actual visualization looks like. More blue = lower loss, and I made it a contour plot so it’s easy to see indifference curves of the loss.
In these coordinates, L(D) and L(N) are not really straight lines, but they are close to straight lines when we are far from the diagonal line:
If you look at the upper left region, the indifference curves are parallel to the vertical (N) axis. That is, in this regime, N doesn’t matter and loss is effectively a function of D alone.
This is L(D).
It looks like the color changes you see if you move horizontally through the upper left region.
Likewise, in the lower right region, D doesn’t matter and loss depends on N alone.
This is L(N).
It looks like the color changes you see if you move vertically through the lower right region.
To restate my earlier claims…
If either N or D is orders of magnitude larger than the other, then you get close to the same loss you would get from N ~ D ~ (whichever OOM is lower). So, setting eg (N, D) = (1e15, 1e12) would be sort of a waste of N, achieving only slightly lower loss than (N, D) = (1e12, 1e12).
This is what motives the heuristic that you scale D with N, to stay on the diagonal line.
On the other hand, if your goal is to reach some target loss and you have resource constraints, what matters is whichever resource constraint is more restrictive. For example, if we were never able to scale D above 1e12, then we would be stuck achieving a loss similar to GPT-3, never reaching the darkest colors on the graph.
When I said that it’s intuitive to think about L(D) and L(N), I mean that I care about which target losses we can reach. And that’s going to be set, more or less, by the highest N or the highest D we can reach, whichever is more restrictive.
Asking “what could we do with a N=1e15 model?” (or any other number) is kind of a weird question from the perspective of this plot. It could mean either of two very different situations: either we are in the top right corner with N and D scaled together, hitting the bluest region … or we are just near the top somewhere, in which case our loss is entirely determined by D and can be arbitrarily low.
In Ajeya’s work, this question means “let’s assume we’re using an N=1e15 model, and then let’s assume we actually need that many parameters, which must mean we want to reach the target losses in the upper right corner, and then let’s figure out how big D has to be to get there.”
So, the a priori choice of N=1e15 is driving the definition of sufficient performance, defined here as “the performance which you could only reach with N=1e15 params”.
What feels weird to me—which you touched on above—is the way this lets the scaling relations “backset drive” the definition of sufficient quality for AGI. Instead of saying we want to achieve some specific thing, then deducing we would need N=1e15 params to do it… we start with an unspecified goal and the postulate that we need N=1e15 params to reach it, and then derive the goal from there.
OK, wow, I didn’t realize the indifference curves were so close to being indifference L-shapes! Now I think Ajeya’s methodology was great after all—my worries have been largely dispelled!
Given that the indifference curves are so close to being L-shaped, it seems there’a a pretty strong argument to be made that since the human brain has 10e15 params or so, it must be doing some fairly important tasks which can’t be done (at least not as well) for much less than 10e15 params. Like, maybe a 10e13 param brain could do the task if it didn’t have to worry about other biological constraints like noisy neurons that occasionally die randomly, or being energy-efficient, etc. But probably these constraints and others like them aren’t that big a deal, such that we can be fairly confident that these tasks require a NN of 10e13 or more params.
The next step in the argument is to say that TAI requires one of these tasks. Then we point out that an AI which is bigger than the human brain should be able to do all the things it can do, in principle. Thus we feel justified in setting the parameter count of our hypothetical TAI to “within a few OOMs of 10e15.”
Then we look at the scaling law chart you just provided us, and we look at those L-shaped indifference curves, and we think: OK, so a task which can’t be done for less than 10e15 params is a task which requires 10e15 data points also. Because otherwise we could reduce parameter count below 10e15 and keep the same performance.
So I no longer feel weird about this; I feel like this part of Ajeya’s analysis makes sense.
But I am now intensely curious as to how many “data points” the human brain has. Either the argument I just gave above is totally wrong, or the human brain must be trained on 10e15 data points in the course of a human lifetime, or the genome must be substituting for the data points via priors, architectures, etc.
Is the second possibility plausible? I guess so. there are 10^9 seconds in a human lifetime, so if you are processing a million data points a second… Huh, that seems a bit much.
What about active learning and the like? You talked about how sufficiently big models are extracting all the info out of the data, and so that’s why you need more data to do better—but that suggests that curating the data to make it more info-dense should reduce compute requirements, right? Maybe that’s what humans are doing—“only” a billion data points in a lifetime, but really high-quality ones and good mechanisms for focusing on the right stuff to update on of all your sensory data coming in?
And then there’s the third possibility of course. The third possibility says: These scaling laws only apply to blank-slate, simple neural nets. The brain is not a blank slate, nor is it simple; it has lots of instincts and modules and priors etc. given to it by evolution. So that’s how humans can get away with only 10^9 data points or so. (well, I guess it should be more like 10^11, right? Each second of experience is more than just one data point, probably more like a hundred, right? What would you say?)
Then we look at the scaling law chart you just provided us, and we look at those L-shaped indifference curves, and we think: OK, so a task which can’t be done for less than 10e15 params is a task which requires 10e15 data points also.
In the picture, it looks like there’s something special about having a 1:1 ratio of data to params. But this is a coincidence due to the authors’ choice of units.
They define “one data point” as “one token,” which is fine. But it seems equally defensible to define “one data point” as “what the model can process in one forward pass,” which is ~1e3 tokens. If the authors had chosen that definition in their paper, I would be showing you a picture that looked identical except with different numbers on the data axis, and you would conclude from the picture that the brain should have around 1e12 data points to match its 1e15 params!
To state the point generally, the functional form of the scaling law says nothing about the actual ratio D/N where the indifference curves have their cusps. This depends on your choice of units. And, even if we were careful to use the same units, this ratio could be vastly different for different systems, and people would still say the systems “have the same scaling law.” Scaling is about relationships between differences, not relationships between absolute magnitudes.
On the larger topic, I’m pessimistic about our ability to figure out how many parameters the brain has, and even more pessimistic about our ability to understand what a reasonable scale for “a data point” is. This is mostly for “Could a Neuroscientist Understand a Microprocessor?”-type reasons. I would be more interested in an argument that starts with upper/lower bounds that feel absurdly extreme but relatively certain, and then tries to understand if (even) these weak bounds imply anything interesting, rather than an argument that aims for an point estimate or a subjective distribution.
They define “one data point” as “one token,” which is fine. But it seems equally defensible to define “one data point” as “what the model can process in one forward pass,” which is ~1e3 tokens. If the authors had chosen that definition in their paper, I would be showing you a picture that looked identical except with different numbers on the data axis, and you would conclude from the picture that the brain should have around 1e12 data points to match its 1e15 params!
Holy shit, mind blown! Then… how are the scaling laws useful at all then? I thought the whole point was to tell you how to divide your compute between… Oh, I see. The recommendations for how to divide up your compute would be the same regardless of which definition of data we used. I guess this suggests that it would be most convenient to define data as “how long you run the model during training” (which in turn is maybe “how many times the average parameter of the model is activated during training?”) Because that way we can just multiply parameter count by data to get our total compute cost. Or maybe instead we should do what Ajeya does, and define data as the number of updates to the model * the batch size, and then calculate compute by multiplying data * “horizon length.”
I’m very interested to hear your thoughts on Ajeya’s methodology. Is my sketch of it above accurate? Do you agree it’s a good methodology? Does it indeed imply (in conjunction with the scaling laws) that a model with 10^15 params should need 10^15 data points to train to a performance level that you couldn’t have got more easily with a smaller model—regardless of what the horizon length is, or what your training environment is, or what the task is?
...
As for the broader point, what do you think of the Carlsmith report? The figure given in the conclusion seems to give some absurdly extreme but reasonably certain upper and lower bounds. And I think the conclusions we draw from them are already drawn in Ajeya’s report, because she includes uncertainty about this in her model. I suppose you could just redo her model but with even more variance… that would probably make her timelines shorter, funnily enough!
Update: According to this the human brain actually is getting ~10^7 bits of data every second, although the highest level conscious awareness is only processing ~50. So insofar as we go with the “tokens” definition, it does seem that the human brain is processing plenty of tokens for its parameter count -- 10^16, in fact, over the course of its lifetime. More than enough! And insofar as we go with the “single pass through the network” definition, which would mean we are looking for about 10^12… then we get a small discrepancy; the maximum firing rate of neurons is 250 − 1000 times per second, which means 10^11.5 or so… actually this more or less checks out I’d say. Assuming it’s the max rate that matters and not the average rate (the average rate is about once per second).
Does this mean that it may not actually be true that humans are several OOMs more data-efficient than ANNs? Maybe the apparent data-efficiency advantage is really mostly just the result of transfer learning from vast previous life experience, just as GPT-3 can “few-shot learn” totally new tasks, and also “fine-tune” on relatively small amounts of data (3+ OOMs less, according to the transfer laws paper!) but really what’s going on is just transfer learning from its vast pre-training experience.
I intended to mean something similar to what Ajeya meant in her report:
I’ll define the “effective horizon length” of an ML problem as the amount of data it takes (on average) to tell whether a perturbation to the model improves performance or worsens performance. If we believe that the number of “samples” required to train a model of size P is given by KP, then the number of subjective seconds that would be required should be given by HKP, where H is the effective horizon length expressed in units of “subjective seconds per sample.”
To be clear, I’m still a bit confused about the concept of horizon length. I’m not sure it’s a good idea to think about things this way. But it seems reasonable enough for now.
OK, so here is a fuller response:
First of all, yeah, as far as I can tell you and I agree on everything in the OP. Like I said, this disagreement is an aside.
Now that you mention it / I think about it more, there’s another strong point to add to the argument I sketched in part 3: Insofar as our NN’s aren’t data-efficient, it’ll take more compute to train them, and so even if TAI need not be data-efficient, short-timelines-TAI must be. (Because in the short term, we don’t have much more compute. I’m embarrassed I didn’t notice this earlier and include it in the argument.) That helps the argument a lot; it means that all the argument has to do is establish that we aren’t going to get more data-efficient NN’s anytime soon.
And yeah, I agree the scaling laws are a great source of evidence about this. I had them in mind when I wrote the argument in part 3. I guess I’m just not as convinced as you (?) that (a) when we are routinely training NN’s with 10e15 params, it’ll take roughly 10e15 data points to get to a useful level of performance, and (b) average horizon length for the data points will need to be more than short.
Some reasons I currently doubt (a):
--A bunch of people I talk to, who know more about AI than me, seem confident that we can get several OOMs more data-efficient training than the GPT’s had using various already-developed tricks and techniques.
--The scaling laws, IIRC, don’t tell us how much data is needed to reach a useful level of performance. Rather, they tell us how much data is needed if you want to use your compute budget optimally. It could be that at 10e15 params and 10e15 data points, performance is actually much higher than merely useful; maybe only 10e13 params and 10e13 data points would be the first to cross the usefulness threshold. (Counterpoint: Extrapolating GPT performance trends on text prediction suggests it wouldn’t be human-level at text prediction until about 10e15 params and 10e15 data points, according to data I got from Lanrian. Countercounterpoint: Extrapolating GPT performance trends on tasks other than text prediction makes it seem to me that it could be pretty useful well before then; see these figures, in which I think 10e15/10e15 would be the far-right edge of the graph).
Some reasons I currently doubt (b):
--I’ve been impressed with how much GPT-3 has learned despite having a very short horizon length, very limited data modality, very limited input channel, very limited architecture, very small size, etc. This makes me think that yeah, if we improve on GPT-3 in all of those dimensions, we could get something really useful for some transformative tasks, even if we keep the horizon length small.
--I think that humans have a tiny horizon length—our brains are constantly updating, right? I guess it’s hard to make the comparison, given how it’s an analog system etc. But it sure seems like the equivalent of the average horizon length for the brain is around a second or so. Now, it could be that humans get away with such a small horizon length because of all the fancy optimizations evolution has done on them. But it also could just be that that’s all you need.
--Having a small average horizon length doesn’t preclude also training lots on long-horizon tasks. It just means that on average your horizon length is small. So e.g. if the training process involves a bit of predict-the-next input, and also a bit of make-and-execute-plans-actions-over-the-span-of-days, you could get quite a few data points of the latter variety and still have a short average horizon length.
I’m very uncertain about all of this and would love to hear your thoughts, which is why I asked. :)
Yeah, this is (part of) why I put compute + scaling laws front and center and make inferences about data efficiency; you can have much stronger conclusions when you start reasoning from the thing you believe is the bottleneck.
Note that Ajeya’s report does have a term for “algorithmic efficiency”, that has a doubling time of 2-3 years.
Certainly “several OOMs using tricks and techniques we could implement in a year” would be way faster than that trend, but you’ve really got to wonder why these people haven’t done it yet—if I interpret “several OOMs” as “at least 3 OOMs”, that would bring the compute cost down to around $1000, which is accessible for basically any AI researcher (including academics). I’ll happily take a 10:1 bet against a model as competent as GPT-3 being trained on $1000 of compute within the next year.
Perhaps the tricks and techniques are sufficiently challenging that they need a full team of engineers working for multiple years—if so, this seems plausibly consistent with the 2-3 year doubling time.
Evolution was presumably also going for compute-optimal performance, so it seems like this is the right comparison to make. I agree there’s uncertainty here, but I don’t see why the uncertainty should bias us towards shorter timelines rather than longer timelines.
I could see it if we thought we were better than evolution, since then we could say “we’d figure something out that evolution missed and that would bias towards short timelines”; but this is also something that Ajeya considered and iirc she then estimated that evolution tended to be ~10x better than us (lots of caveats here though).
Both Ajeya and I think that AI systems will be incredibly useful before they get to the level of “transformative AI”. The tasks in the graph you link are particularly easy and not that important; having superhuman performance on them would not transform the world.
I just put literally 100% mass on short horizon in my version of the timelines model (which admittedly has changed some other parameters, though not hugely iirc) and the median I get is 2041 (about 10 years lower than what it was previously). So I don’t think this is making a huge difference (though certainly 10 years is substantial).
I see horizon length (as used in the report) as a function of a task, so “horizon length of GPT-3” feels like a type error given that what we care about is how GPT-3 can do many tasks. Any task done by GPT-3 has a maximum horizon length of 2048 (the size of its context window). During training, GPT-3 saw 300 billion tokens, so it saw around 100 million “effective examples” of size 2048. It makes sense within the bio anchors framework that there would be some tasks with horizon length in the thousands that GPT-3 would be able to do well.
Again, this feels like a type error to me. Horizon length isn’t about the optimization algorithm, it’s about the task.
(You can of course define your own version of “horizon length” that’s about the optimization algorithm, but then I think you need to have some way of incorporating the “difficulty” of a transformative task into your timelines estimate, given that the scaling laws are all calculated on “easy” tasks.)
Agree with this. I remember mentioning this to Ajeya but I don’t actually remember what the conclusion was.
EDIT: Oh, I remember now. The argument I was making is that you could imagine that most of the training is unsupervised pretraining on a short-horizon objective, similarly to GPT-3, after which you finetune (with negligible compute cost) on the long-horizon transformative task you care about, so that on average your horizon is short. I definitely remember this being an important reason in me putting as much weight on short horizons as I did; I think this was also true for Ajeya.
Thanks for the detailed reply!
I didn’t quite follow this part. Do you think I’m not reasoning from the thing I believe is the bottleneck?
Some of the people I talked to said about 2 OOMs, others expressed it differently, saying that the faster scaling law can be continued past the kink point predicted by Kaplan et al. Still others simply said that GPT-3 was done in a deliberately simple, non-cutting-edge way to prove a point and that it could have used its compute much more compute-efficiently if they threw the latest bags of tricks at it. I am skeptical of all this, of course, but perhaps less skeptical than you? 2 OOMs is 7 doublings, which will happen around 2037 according to Ajeya. Would you be willing to take a 10:1 bet that there won’t be something as good as GPT-3 trained on 2 OOMs less compute by 2030? I think I’d take the other side of that bet.
I don’t think evolution was going for compute-optimal performance in the relevant sense. With AI, we can easily trade off between training models longer and making models bigger, and according to the scaling laws it seems like we should increase training time by 0.75 OOMs for every OOM of parameter count increase. With biological systems, sure maybe it is true that if you faced a trade-off where you were trying to minimize total number of neuron firings over the course of the organism’s childhood, the right ratio would be 0.75 OOMs of extra childhood duration for every 1 OOM of extra synapses… maybe. But even if this were true, it’s pretty non-obvious that that’s the trade-off regime evolution faces. There are all sorts of other pros and cons associated with more synapses and longer childhoods. For example, maybe evolution finds it easier to increase synapse count than to increase childhood, because increased childhood reduces fitness significantly (more chances to die before you reproduce, longer doubling time of population).
Yeah, sorry, by useful I meant useful for transformative tasks.
Yes, obviously the tasks in the graph are not transformative. But it seems to me to be… like, 25% likely or so that once we have pre-trained, unsupervised models that build up high skill level at all those tasks on the graph, it’s because they’ve developed general intelligence in the relevant sense. Or maybe they haven’t but it’s a sign that general intelligence is near, perhaps with a more sophisticated training regime and architecture. Like, yeah those tasks are “particularly easy” compared to taking over the world, but they are also incredibly hard in some sense; IIRC GPT-3 was also tested on a big dataset of exam questions used for high school, college, and graduate-level admissions, and got 50% or so whereas every other AI system got 25%, random chance, and I bet most english-speaking literate humans in the world today would have done worse than 50%.
Huh. When I put 100% mass on short horizon in my version of Ajeya’s model, it says median 2031. Admittedly, I had made some changes to some other parameters too, also not hugely iirc. I wonder if this means those other-parameter changes matter more than I’d thought.
Huh, that’s totally not how I saw it. From Ajeya’s report:
To me this really sounds like it’s saying the horizon length = the number of subjective seconds per sample during training. So, maybe it makes sense to talk about “horizon length of task X” (i.e. number of subjective seconds per sample during training of a typical ML model on that task) but it seems to make even more sense to talk about “horizon length of model X” since model X actually had a training run and actually had an average number of subjective seconds per sample.
But I’m happy to 70% defer to your judgment on this since you probably have talked to Ajeya etc. and know more about this than me.
At any rate, deferring to you on this doesn’t undermine the point I was making at all, as far as I can tell.
Exactly. I think this is what humans do too, to a large extent. I’d be curious to hear why you put so much weight on medium and long horizons. I put 50% on short, 20% on medium, and 10% on long.
I actually don’t remember what I meant to convey with that :/
No, I’d also take the other side of the bet. A few reasons:
Estimated algorithmic efficiency in the report is low because researchers are not currently optimizing for “efficiency on a transformative task”, whereas researchers probably are optimizing for “efficiency of GPT-3 style systems”, suggesting faster improvements in algorithmic efficiency for GPT-3 than estimated in the report.
90% confidence is quite a lot; I do not have high certainty in the algorithmic efficiency part of the report.
(Note that 2 OOMs in 10 years seems significantly different from “we can get several OOMs more data-efficient training than the GPT’s had using various already-developed tricks and techniques”. I also assume that you have more than 10% credence in this, since 10% seems too low to make a difference to timelines.)
I feel like this is already taken into account by the methodology by which we estimated the ratio of evolution to human design? Like, taking your example of flight, presumably evolution was not optimizing just for power-to-weight ratio, it was optimizing for a bunch of other things; nonetheless we ignore those other things when making the comparison. Similarly, in the report the estimate is that evolution is ~10x better than humans on the chosen metrics, even though evolution was not literally optimizing just for the chosen metric. Why not expect the same here?
I think you’d need to argue that there is a specific other property that evolution was optimizing for, that clearly trades off against compute-efficiency, to argue that we should expect that in this case evolution was worse than in other cases.
This seems like it is realist about rationality, which I mostly don’t buy. Still, 25% doesn’t seem crazy, I’d probably put 10 or 20% on it myself. But even at 25% that seems pretty consistent with my timelines; 25% does not make the median.
Why aren’t we already using the most sophisticated training regime and architecture? I agree it will continue to improve, but that’s already what the model does.
I don’t particularly care about comparisons of memory / knowledge between GPT-3 and humans. Humans weren’t optimized for that.
I expect that Google search beats GPT-3 on that dataset.
I don’t really know what you mean when you say that this task is “hard”. Sure, humans don’t do it very well. We also don’t do arithmetic very well, while calculators do.
Er, note that I’ve talked to Ajeya for like an hour or two on the entire report. I’m not that confident that Ajeya also believes the things I’m saying (maybe I’m 80% confident).
I agree that the definition used in the report does seem consistent with that. I think that’s mostly because the report assumes that you are training a model to perform a single (transformative) task, and so a definition in terms of the model is equivalent to definition in terms of the task. The report doesn’t really talk about the unsupervised pretraining approach so its definitions didn’t have to handle that case.
But like, irrespective of what Ajeya meant, I think the important concept would be task-based. You would want to have different timelines for “when a neural net can do human-level summarization” and “when a neural net can be a human-level personal assistant”, even if you expect to use unsupervised pretraining for both. The only parameter in the model that can plausibly do that is the horizon length. If you don’t use the horizon length for that purpose, I think you should have some other way of incorporating “difficulty of the task” into your timelines.
I mean, I’m at 30 / 40 / 10, so that isn’t that much of a difference. Half of the difference could be explained by your 25% on general reasoning, vs my (let’s say) 15% on it.
Thanks again. My general impression is that we disagree less than it first appeared, and that our disagreements are currently bottoming out in different intuitions rather than obvious cruxes we can drill down on. Plus I’m getting tired. ;) So, I say we call it a day. To be continued later, perhaps in person, perhaps in future comment chains on future posts!
For the sake of completeness, to answer your questions though:
By “hard” I mean something like “Difficult to get AIs to do well.” If we imagine all the tasks we can get AIs to do lined up by difficulty, there is some transformative task A which is least difficult. As the tasks we succeed at getting AIs to do get harder and harder, we must be getting closer to A. I think that getting an AI to do well on all the benchmarks we throw at it despite not being trained for any of them (but rather just trained to predict random internet text) seems like a sign that we are getting close to A. You say this is because I believe in realism about rationality; I hope not, since I don’t believe in realism about rationality. Maybe there’s a contradiction in my views then which you have pointed to, but I don’t see it yet.
At this point I feel the need to break things down into premise-conclusion form because I am feeling confused about how the various bits of your argument are connecting to each other. I realize this is a big ask, so don’t feel any particular pressure to do it.
I totally agree that evolution wasn’t optimizing just for power-to-weight ratio. But I never claimed that it was. I don’t think that my comparison relied on the assumption that evolution was optimizing for power-to-weight ratio. By contrast, you explicitly said “presumably evolution was also going for compute-optimal performance.” Once we reject that claim, my original point stands that it’s not clear how we should apply the scaling laws to the human brain, since the scaling laws are about compute-optimal performance, i.e. how you should trade off size and training time if all you care about is minimizing compute. Since evolution obviously cares about a lot more than that (and indeed doesn’t care about minimizing compute at all, it just cares about minimizing size and training time separately, with no particular ratio between them except that which is set by the fitness landscape) the laws aren’t directly relevant. In other words, for all we know, if the human brain was 3 OOMs smaller and had one OOM more training time it would be qualitatively superior! Or for all we know, if it had 1 OOM more synapses it would need 2 OOMs less training time to be just as capable. Or… etc. Judging by the scaling laws, it seems like the human brain has a lot more synapses than its childhood length would suggest for optimal performance, or else a lot less if you buy the idea that evolutionary history is part of its training data.
The scaling laws from the Kaplan et al papers do tell you this.
The relevant law is L(N,D), for the early-stopped test loss given parameter count N and data size D. It has the functional form
L(N,D)=[(Nc/N)αN/αD+(Dc/D)]αD
with αN∼0.076,αD∼0.095.
The result that you should scale D∝N0.74 comes from trying to keep the two terms in this formula about the same size.
This is not exactly a heuristic for managing compute (since D is not dependent on compute, it’s dependent on how much data you can source). It’s more like a heuristic for ensuring that your problem is the right level of difficulty to show off the power of this model size, as compared to smaller models.
You always can train models that are “too large” on datasets that are “too small” according to the heuristic, and they won’t diverge or do poorly or anything. They just won’t improve much upon the results of smaller models.
In terms of the above, you are setting N∼1015 and then asking what D ought to be. If the heuristic gives you an answer that seems very high, that doesn’t mean the model is “not as data efficient as you expected.” Rather, it means that you need a very large dataset if you want a good reason to push the parameter count up to N∼1015 rather than using a smaller model to get almost identical performance.
I find it more intuitive to think about the following, both discussed in the papers:
L(D), the N→∞ limit of L(N,D)
meaning: the peak data efficiency possible with this model class
L(N), the D→∞ limit of L(N,D)
meaning: the scaling of loss with parameters when not data-constrained but still using early stopping
If the Kaplan et al scaling results are relevant for AGI, I expect one of these two limits to provide the relevant constraint, rather than a careful balance between N and D to ensure we are not in either limit.
Ultimately, we expect AGI to require some specific-if-unknown level of performance (ie crossing some loss threshold LAGI). Ajeya’s approach essentially assumes that we’ll cross this threshold at a particular value of N, and then further assumes that this will happen in a regime where data and compute limitations are around the same order of magnitude.
I’m not sure why that ought to be true: it seems more likely that one side of the problem will become practically difficult to scale in proportion to the other, after a certain point, and we will essentially hug tight to either the L(N) or the L(D) curve until it hits LAGI.
See also my post here.
Huh, thanks, now I’m more confused about the scaling laws than I was before, in a good way! I appreciate the explanation you gave but am still confused. Some questions:
--In my discussion with Rohin I said:
Do you agree or disagree? My guess is that you’d disagree, since you say:
which I take to mean that you think the human brain could have had almost identical performance with much fewer synapses, since it has much more N than is appropriate given its D? (But wait, surely you don’t think that… OK, yeah, I’m just very confused here, please help!)
2. You say “This is not exactly a heuristic for managing compute (since D is not dependent on compute, it’s dependent on how much data you can source).” Well, isn’t it both? You can’t have more D than you have compute, in some sense, because D isn’t the amount of training examples you’ve collected, it’s the amount you actually use to train… right? So… isn’t this a heuristic for managing compute? It sure seemed like it was presented that way.
3. Perhaps it would help me if I could visualize it in two dimensions. Let the y-axis be parameter count, N, and the x-axis be data trained on, D. Make it a heat map with color = loss. Bluer = lower loss. It sounds to me like the compute-optimal scaling law Kaplan et al tout is something like a 45 degree line from the origin such that every point on the line has the lowest loss of all the points on an equivalent-compute indifference curve that contains that point. Whereas you are saying there are two other interesting lines, the L(D) line and the L(N) line, and the L(D) line is (say) a 60-degree line from the origin such that for any point on that line, all points straight above it are exactly as blue. And the L(N) line is (say) a 30-degree line from the origin such that for any point on that line, all points straight to the right of it are exactly as blue. This is the picture I currently have in my head, is it correct in your opinion? (And you are saying that probably when we hit AGI we won’t be on the 45-degree line but rather will be constrained by model size or by data and so will be hugging one of the other two lines)
This is a subtle and confusing thing about the Kaplan et al papers. (It’s also the subject of my post that I linked earlier, so I recommend you check that out.)
There are two things in the papers that could be called “optimal compute budgeting” laws:
A law that assumes a sufficiently large dataset (ie effectively infinite dataset), and tell you how to manage the tradeoff between steps S and params N.
The law we discussed above, that assumes a finite dataset, and then tells you how to manage its size D vs params N.
I said the D vs N law was “not a heuristic for managing compute” because the S vs N law is more directly about compute, and is what the authors mean when they talk about compute optimal budgeting.
However, the D vs N law does tell you about how to spend compute in an indirect way, for the exact reason you say, that D is related to how long you train. Comparing the two laws yields the “breakdown” or “kink point.”
Sorry, why do you expect I disagree? I think I agree. But also, I’m not really claiming the scaling laws say or don’t say anything about the brain, I’m just trying to clarify what they say about (specific kinds of) neural nets (on specific kinds of problems). We have to first understand what they predict about neural nets before we can go on to ask whether those predictions generalize to explain some other area.
This part is 100% qualitatively accurate, I think. The one exception is that there are two “optimal compute” lines on the plot with different slopes, for the two laws referred to above. But yeah, I’m saying we won’t be on either of those lines, but on the L(N) or the L(D) line.
I’ve read your linked post thrice now, it’s excellent, any remaining confusions are my fault.
I didn’t confidently expect you to disagree, I just guessed you did. The reason is that the statement you DID disagree with: ” The scaling laws, IIRC, don’t tell us how much data is needed to reach a useful level of performance. ” was, in my mind, closely related to the paragraph about the human brain which you agree with. Since they were closely related in my mind, I thought if you disagreed with one you’d disagree with the other. The statement about brains is the one I care more about, since it relates to my disagreement with Rohin.
I’m glad my 2D visualization is qualitatively correct! Quantitatively, roughly how many degrees do you think there would be between the L(D) and L(N) laws? In my example it was 30, but of course I just made that up.
Actually, I think I spoke too soon about the visualization… I don’t think your image of L(D) and L(N) is quite right.
Here is what the actual visualization looks like. More blue = lower loss, and I made it a contour plot so it’s easy to see indifference curves of the loss.
https://64.media.tumblr.com/8b1897853a66bccafa72043b2717a198/de8ee87db2e582fd-63/s540x810/8b960b152359e9379916ff878c80f130034d1cbb.png
In these coordinates, L(D) and L(N) are not really straight lines, but they are close to straight lines when we are far from the diagonal line:
If you look at the upper left region, the indifference curves are parallel to the vertical (N) axis. That is, in this regime, N doesn’t matter and loss is effectively a function of D alone.
This is L(D).
It looks like the color changes you see if you move horizontally through the upper left region.
Likewise, in the lower right region, D doesn’t matter and loss depends on N alone.
This is L(N).
It looks like the color changes you see if you move vertically through the lower right region.
To restate my earlier claims…
If either N or D is orders of magnitude larger than the other, then you get close to the same loss you would get from N ~ D ~ (whichever OOM is lower). So, setting eg (N, D) = (1e15, 1e12) would be sort of a waste of N, achieving only slightly lower loss than (N, D) = (1e12, 1e12).
This is what motives the heuristic that you scale D with N, to stay on the diagonal line.
On the other hand, if your goal is to reach some target loss and you have resource constraints, what matters is whichever resource constraint is more restrictive. For example, if we were never able to scale D above 1e12, then we would be stuck achieving a loss similar to GPT-3, never reaching the darkest colors on the graph.
When I said that it’s intuitive to think about L(D) and L(N), I mean that I care about which target losses we can reach. And that’s going to be set, more or less, by the highest N or the highest D we can reach, whichever is more restrictive.
Asking “what could we do with a N=1e15 model?” (or any other number) is kind of a weird question from the perspective of this plot. It could mean either of two very different situations: either we are in the top right corner with N and D scaled together, hitting the bluest region … or we are just near the top somewhere, in which case our loss is entirely determined by D and can be arbitrarily low.
In Ajeya’s work, this question means “let’s assume we’re using an N=1e15 model, and then let’s assume we actually need that many parameters, which must mean we want to reach the target losses in the upper right corner, and then let’s figure out how big D has to be to get there.”
So, the a priori choice of N=1e15 is driving the definition of sufficient performance, defined here as “the performance which you could only reach with N=1e15 params”.
What feels weird to me—which you touched on above—is the way this lets the scaling relations “backset drive” the definition of sufficient quality for AGI. Instead of saying we want to achieve some specific thing, then deducing we would need N=1e15 params to do it… we start with an unspecified goal and the postulate that we need N=1e15 params to reach it, and then derive the goal from there.
OK, wow, I didn’t realize the indifference curves were so close to being indifference L-shapes! Now I think Ajeya’s methodology was great after all—my worries have been largely dispelled!
Given that the indifference curves are so close to being L-shaped, it seems there’a a pretty strong argument to be made that since the human brain has 10e15 params or so, it must be doing some fairly important tasks which can’t be done (at least not as well) for much less than 10e15 params. Like, maybe a 10e13 param brain could do the task if it didn’t have to worry about other biological constraints like noisy neurons that occasionally die randomly, or being energy-efficient, etc. But probably these constraints and others like them aren’t that big a deal, such that we can be fairly confident that these tasks require a NN of 10e13 or more params.
The next step in the argument is to say that TAI requires one of these tasks. Then we point out that an AI which is bigger than the human brain should be able to do all the things it can do, in principle. Thus we feel justified in setting the parameter count of our hypothetical TAI to “within a few OOMs of 10e15.”
Then we look at the scaling law chart you just provided us, and we look at those L-shaped indifference curves, and we think: OK, so a task which can’t be done for less than 10e15 params is a task which requires 10e15 data points also. Because otherwise we could reduce parameter count below 10e15 and keep the same performance.
So I no longer feel weird about this; I feel like this part of Ajeya’s analysis makes sense.
But I am now intensely curious as to how many “data points” the human brain has. Either the argument I just gave above is totally wrong, or the human brain must be trained on 10e15 data points in the course of a human lifetime, or the genome must be substituting for the data points via priors, architectures, etc.
Is the second possibility plausible? I guess so. there are 10^9 seconds in a human lifetime, so if you are processing a million data points a second… Huh, that seems a bit much.
What about active learning and the like? You talked about how sufficiently big models are extracting all the info out of the data, and so that’s why you need more data to do better—but that suggests that curating the data to make it more info-dense should reduce compute requirements, right? Maybe that’s what humans are doing—“only” a billion data points in a lifetime, but really high-quality ones and good mechanisms for focusing on the right stuff to update on of all your sensory data coming in?
And then there’s the third possibility of course. The third possibility says: These scaling laws only apply to blank-slate, simple neural nets. The brain is not a blank slate, nor is it simple; it has lots of instincts and modules and priors etc. given to it by evolution. So that’s how humans can get away with only 10^9 data points or so. (well, I guess it should be more like 10^11, right? Each second of experience is more than just one data point, probably more like a hundred, right? What would you say?)
What do you think of these three possibilities?
I’m don’t think this step makes sense:
In the picture, it looks like there’s something special about having a 1:1 ratio of data to params. But this is a coincidence due to the authors’ choice of units.
They define “one data point” as “one token,” which is fine. But it seems equally defensible to define “one data point” as “what the model can process in one forward pass,” which is ~1e3 tokens. If the authors had chosen that definition in their paper, I would be showing you a picture that looked identical except with different numbers on the data axis, and you would conclude from the picture that the brain should have around 1e12 data points to match its 1e15 params!
To state the point generally, the functional form of the scaling law says nothing about the actual ratio D/N where the indifference curves have their cusps. This depends on your choice of units. And, even if we were careful to use the same units, this ratio could be vastly different for different systems, and people would still say the systems “have the same scaling law.” Scaling is about relationships between differences, not relationships between absolute magnitudes.
On the larger topic, I’m pessimistic about our ability to figure out how many parameters the brain has, and even more pessimistic about our ability to understand what a reasonable scale for “a data point” is. This is mostly for “Could a Neuroscientist Understand a Microprocessor?”-type reasons. I would be more interested in an argument that starts with upper/lower bounds that feel absurdly extreme but relatively certain, and then tries to understand if (even) these weak bounds imply anything interesting, rather than an argument that aims for an point estimate or a subjective distribution.
Holy shit, mind blown! Then… how are the scaling laws useful at all then? I thought the whole point was to tell you how to divide your compute between… Oh, I see. The recommendations for how to divide up your compute would be the same regardless of which definition of data we used. I guess this suggests that it would be most convenient to define data as “how long you run the model during training” (which in turn is maybe “how many times the average parameter of the model is activated during training?”) Because that way we can just multiply parameter count by data to get our total compute cost. Or maybe instead we should do what Ajeya does, and define data as the number of updates to the model * the batch size, and then calculate compute by multiplying data * “horizon length.”
I’m very interested to hear your thoughts on Ajeya’s methodology. Is my sketch of it above accurate? Do you agree it’s a good methodology? Does it indeed imply (in conjunction with the scaling laws) that a model with 10^15 params should need 10^15 data points to train to a performance level that you couldn’t have got more easily with a smaller model—regardless of what the horizon length is, or what your training environment is, or what the task is?
...
As for the broader point, what do you think of the Carlsmith report? The figure given in the conclusion seems to give some absurdly extreme but reasonably certain upper and lower bounds. And I think the conclusions we draw from them are already drawn in Ajeya’s report, because she includes uncertainty about this in her model. I suppose you could just redo her model but with even more variance… that would probably make her timelines shorter, funnily enough!
Update: According to this the human brain actually is getting ~10^7 bits of data every second, although the highest level conscious awareness is only processing ~50. So insofar as we go with the “tokens” definition, it does seem that the human brain is processing plenty of tokens for its parameter count -- 10^16, in fact, over the course of its lifetime. More than enough! And insofar as we go with the “single pass through the network” definition, which would mean we are looking for about 10^12… then we get a small discrepancy; the maximum firing rate of neurons is 250 − 1000 times per second, which means 10^11.5 or so… actually this more or less checks out I’d say. Assuming it’s the max rate that matters and not the average rate (the average rate is about once per second).
Does this mean that it may not actually be true that humans are several OOMs more data-efficient than ANNs? Maybe the apparent data-efficiency advantage is really mostly just the result of transfer learning from vast previous life experience, just as GPT-3 can “few-shot learn” totally new tasks, and also “fine-tune” on relatively small amounts of data (3+ OOMs less, according to the transfer laws paper!) but really what’s going on is just transfer learning from its vast pre-training experience.
What do you mean by horizon length here?
I intended to mean something similar to what Ajeya meant in her report:
To be clear, I’m still a bit confused about the concept of horizon length. I’m not sure it’s a good idea to think about things this way. But it seems reasonable enough for now.
I’ve been working on a draft blog post kinda related to that, if you’re interested in I can DM you a link, it could use a second pair of eyes.
Sure!