I can see why you might think an argument that requires as much computational power as a Matrioshka brain might not be super relevant to AGI timelines if you think we’re likely to get AGI before Matrioshka brains.
Thinking this would be an error though!
Compare:
Suppose all the nations of the world agreed to ban any AI experiment that requires more than 10x as much compute as our biggest current AI. Yay! Your timelines become super long!
Then suppose someone publishes a convincing proof + demonstration that 100x as much compute as our biggest current AI is sufficient for human-level AGI.
Your timelines should now be substantially shorter than they were before you learned this, even though you are confident that we’ll never get to 100x. Why? Because you should reasonably assume that if 100x is enough, plausibly 10x is enough too or will be after a decade or so of further research. Algorithmic progress is a thing, etc. etc.
The point is: The main source of timelines uncertainty is what the distribution over compute requirements looks like. Learning that e.g. +12 OOMs would probably be enough or that 10^43 ops would probably be enough are both ways of putting a soft upper bound on the distribution, which therefore translates to more probability mass in the low region that we expect to actually cross.
Your factor 10 is pulling a lot of weight there. It is not particularly uncommon to find linear factors of 10 in efficiency lying around in implementations of algorithms, or the algorithms themselves. In that sense, the one 100x algorithm directly relates to finding the 10x algorithm.
It is however practically miraculous to find linear factors of 1010. What you tend to find instead, when speedups of that magnitude are available, is fundamental improvements to the class of algorithm you are doing. These sorts of algorithmic improvements completely detach you from your original anchor. That you can do brute-force evolution in 1043 operations is not meaningfully more informative to whether you can do some other non-brute-force algorithm in 1030 operations than knowing you can do brute-force evolution in 1070 operations.
The important claim being made is just that evolution of intelligence is possible, and it’s plausible that its mechanism can be replicated in an algorithmically cheaper way. The way to use this is to think hard about how you expect that attempt to pan out in practice. The post is arguing however that there’s value specifically in 1043 as a soft upper bound, which I don’t see.
The example I gave was more extreme, involving a mere one OOM instead of ten or twenty. But I don’t think that matters for the point I was making.
If you don’t think 10^43 is a soft upper bound, then you must think that substantial probability mass is beyond 10^43, i.e. you must think that even if we had 10^43 ops to play around with (and billions of dollars, a giant tech company, several years, etc.) we wouldn’t be able to build TAI/AI-PONR/etc. without big new ideas, the sort of ideas that come along every decade or three rather than every year or three.
And that seems pretty unreasonable to me; it seems like at least 90% of our probability mass should be below 10^43.
Hypothetically, if the Bio Anchors paper had claimed that strict brute-force evolution would take 1060 ops instead of 1043, what about your argument would actually change? It seems to me that none of it would, to any meaningful degree.
I’m not arguing 1043 is the wrong bound, I’m arguing that if you don’t actually go ahead and use that number to compute something meaningful, it’s not doing any work. It’s not actually grounding a relevant prediction you should care about.
I brought up the 1043 number because I thought it would be much easier to make this point here than it has turned out to be, but this is emblematic of the problem with this article as a whole. The article is defending that actually their math was done correctly. But Eliezer was not disputing that the math was done correctly, he was disputing that it was the correct math to be doing.
Hypothetically, if the Bio Anchors paper had claimed that strict brute-force evolution would take 1060 ops instead of 1043, what about your argument would actually change? It seems to me that none of it would, to any meaningful degree.
If there are 20 vs 40 orders of magnitude between “here” and “upper limit,” then you end up with ~5% vs ~2.5% on the typical order of magnitude. A factor of 2 in probability seems like a large change, though I’m not sure what you mean by “to any meaningful degree.”
It looks like the plausible ML extrapolations span much of the range from here to 1040. If we were in a different world where the upper bound was much larger, it would be more plausible for someone to think that the ML-based estimates are too low.
If you’re dividing your probability from algorithms derived from evolutionary brute force uniformly between now and then, then I would consider that a meaningful degree. But that doesn’t seem like an argument being made by Bio Anchors or defended here, nor do I believe that’s a legitimate thing you can do. Would you apply this argument to other arbitrary algorithms? Like if there’s a known method of calculating X with Y times more computation, then in general there’s a 50% chance that there’s a method derived from that method for calculating X with √Y times more computation?
I think I endorse the general schema. Namely: if I believe that we can achieve X with 1043 flops but not 1023 flops (holding everything else constant), then I think that gives a prima facie reason to guess a 50% chance that we could achieve it with 1033 flops.
(This isn’t fully general, like if you told me that we could achieve something with 22100 flops but not 2210 I’d be more inclined to guess a median of 2250 than 2299.)
I’m surprised to see this bullet being bitten. I can easily think of trivial examples against the claim, where we know the minimum complexity of simple things versus their naïve implementations, but I’m not sure what arguments there are for it. It sounds pretty wild to me honestly, I have no intuition algorithmic complexity works anything like that.
If you think the probability derived from the upper limit set by evolutionary brute force should be spread out uniformly over the next 20 orders of magnitude, then I assume you think that if we bought 4 orders of magnitude today, there is a 20% chance that a method derived from evolutionary brute force will give us AGI? Whereas I would put that probability much lower, since brute force evolution is not nearly powerful enough at those scales.
I would say there is a meaningful probability that a method not derived from evolutionary brute force, particularly scaled up neural networks, will give us AGI at that point with only minimal fiddling. However that general class of techniques does not observe the evolutionary brute force upper bound. It is entirely coherent to say that as you push neural network size, their improvements flatten out and never reach AGI. The chance that a 1016 parameter model unlocks AGI given a 1011 parameter model doesn’t is much larger than the chance that a 1031 parameter model unlocks AGI given a 1026 parameter model doesn’t. So it’s not clear how you could expect neural networks to observe the upper bound for an unrelated algorithm, and so it’s unclear why their probability distribution would be affected by where exactly that upper bound lands.
I’m also unclear whether you consider this a general rule of thumb for probabilities in general, or something specific to algorithms. Would you for instance say that if there was a weak proof that we could travel interstellar with Y times better fuel energy density, then there’s a 50% chance that there’s a method derived from that method for interstellar travel with just √Y times better energy density?
I’m surprised to see this bullet being bitten. I can easily think of trivial examples against the claim, where we know the minimum complexity of simple things versus their naïve implementations, but I’m not sure what arguments there are for it. It sounds pretty wild to me honestly, I have no intuition algorithmic complexity works anything like that.
I don’t know what you mean by an “example against the claim.” I certainly agree that there is often other evidence that will improve your bet. Perhaps this is a disagreement about the term “prima facie”?
Learning that there is a very slow algorithm for a problem is often a very important indicator that a problem is solvable, and savings like 1040 to 1030 seem routine (and often have very strong similarities between the algorithms). And very often the running time of one algorithm is indeed a useful indicator for the running time of a very different approach. It’s possible we are thinking about different domains here, I’m mostly thinking of traditional algorithms (like graph problems, approximation algorithms, CSPs, etc.) scaled to input sizes where the computation cost is in this regime. Though it seems like the same is also true for ML (though I have much less data there and moreover all the examples are radically less crisp).
The chance that a 1016 parameter model unlocks AGI given a 1011 parameter model doesn’t is much larger than the chance that a 1031 parameter model unlocks AGI given a 1026 parameter model doesn’t.
This seems wrong but maybe for reasons unrelated to the matter at hand. (In general an unknown number is much more likely to lie between 1011 and 1016 than between 1026 and 1031, just as an unknown number is more likely to lie between 11 and 16 than between 26 and 31.)
I’m also unclear whether you consider this a general rule of thumb for probabilities in general, or something specific to algorithms. Would you for instance say that if there was a weak proof that we could travel interstellar with Y times better fuel energy density, then there’s a 50% chance that there’s a method derived from that method for interstellar travel with just √Y times better energy density?
I think it’s a good rule of thumb for estimating numbers in general. If you know a number is between A and B (and nothing else), where A and B are on the order of 1020, then a log-uniform distribution between A and B is a reasonable prima facie guess.
This holds whether the number is “The best you can do on the task using method X” or “The best you can do on the task using any method we can discover in 100 years” or “The best we could do on this task with a week and some duct tape” or “The mass of a random object in the universe.”
If you think the probability derived from the upper limit set by evolutionary brute force should be spread out uniformly over the next 20 orders of magnitude, then I assume you think that if we bought 4 orders of magnitude today, there is a 20% chance that a method derived from evolutionary brute force will give us AGI? Whereas I would put that probability much lower, since brute force evolution is not nearly powerful enough at those scales.
I don’t know what “derived from evolutionary brute force” means (I don’t think anyone has said those words anywhere in this thread other than you?)
But in terms of P(AGI), I think that “20% for next 4 orders of magnitude” is a fine prima facie estimate if you bring in this single consideration and nothing else. Of course I don’t think anyone would ever do that, but frankly I still think “20% for the next 4 orders of magnitude” is still better than most communities’ estimates.
Thinking this would be an error though!
Compare:
Suppose all the nations of the world agreed to ban any AI experiment that requires more than 10x as much compute as our biggest current AI. Yay! Your timelines become super long!
Then suppose someone publishes a convincing proof + demonstration that 100x as much compute as our biggest current AI is sufficient for human-level AGI.
Your timelines should now be substantially shorter than they were before you learned this, even though you are confident that we’ll never get to 100x. Why? Because you should reasonably assume that if 100x is enough, plausibly 10x is enough too or will be after a decade or so of further research. Algorithmic progress is a thing, etc. etc.
The point is: The main source of timelines uncertainty is what the distribution over compute requirements looks like. Learning that e.g. +12 OOMs would probably be enough or that 10^43 ops would probably be enough are both ways of putting a soft upper bound on the distribution, which therefore translates to more probability mass in the low region that we expect to actually cross.
Your factor 10 is pulling a lot of weight there. It is not particularly uncommon to find linear factors of 10 in efficiency lying around in implementations of algorithms, or the algorithms themselves. In that sense, the one 100x algorithm directly relates to finding the 10x algorithm.
It is however practically miraculous to find linear factors of 1010. What you tend to find instead, when speedups of that magnitude are available, is fundamental improvements to the class of algorithm you are doing. These sorts of algorithmic improvements completely detach you from your original anchor. That you can do brute-force evolution in 1043 operations is not meaningfully more informative to whether you can do some other non-brute-force algorithm in 1030 operations than knowing you can do brute-force evolution in 1070 operations.
The important claim being made is just that evolution of intelligence is possible, and it’s plausible that its mechanism can be replicated in an algorithmically cheaper way. The way to use this is to think hard about how you expect that attempt to pan out in practice. The post is arguing however that there’s value specifically in 1043 as a soft upper bound, which I don’t see.
The example I gave was more extreme, involving a mere one OOM instead of ten or twenty. But I don’t think that matters for the point I was making.
If you don’t think 10^43 is a soft upper bound, then you must think that substantial probability mass is beyond 10^43, i.e. you must think that even if we had 10^43 ops to play around with (and billions of dollars, a giant tech company, several years, etc.) we wouldn’t be able to build TAI/AI-PONR/etc. without big new ideas, the sort of ideas that come along every decade or three rather than every year or three.
And that seems pretty unreasonable to me; it seems like at least 90% of our probability mass should be below 10^43.
Hypothetically, if the Bio Anchors paper had claimed that strict brute-force evolution would take 1060 ops instead of 1043, what about your argument would actually change? It seems to me that none of it would, to any meaningful degree.
I’m not arguing 1043 is the wrong bound, I’m arguing that if you don’t actually go ahead and use that number to compute something meaningful, it’s not doing any work. It’s not actually grounding a relevant prediction you should care about.
I brought up the 1043 number because I thought it would be much easier to make this point here than it has turned out to be, but this is emblematic of the problem with this article as a whole. The article is defending that actually their math was done correctly. But Eliezer was not disputing that the math was done correctly, he was disputing that it was the correct math to be doing.
If there are 20 vs 40 orders of magnitude between “here” and “upper limit,” then you end up with ~5% vs ~2.5% on the typical order of magnitude. A factor of 2 in probability seems like a large change, though I’m not sure what you mean by “to any meaningful degree.”
It looks like the plausible ML extrapolations span much of the range from here to 1040. If we were in a different world where the upper bound was much larger, it would be more plausible for someone to think that the ML-based estimates are too low.
If you’re dividing your probability from algorithms derived from evolutionary brute force uniformly between now and then, then I would consider that a meaningful degree. But that doesn’t seem like an argument being made by Bio Anchors or defended here, nor do I believe that’s a legitimate thing you can do. Would you apply this argument to other arbitrary algorithms? Like if there’s a known method of calculating X with Y times more computation, then in general there’s a 50% chance that there’s a method derived from that method for calculating X with √Y times more computation?
I think I endorse the general schema. Namely: if I believe that we can achieve X with 1043 flops but not 1023 flops (holding everything else constant), then I think that gives a prima facie reason to guess a 50% chance that we could achieve it with 1033 flops.
(This isn’t fully general, like if you told me that we could achieve something with 22100 flops but not 2210 I’d be more inclined to guess a median of 2250 than 2299.)
I’m surprised to see this bullet being bitten. I can easily think of trivial examples against the claim, where we know the minimum complexity of simple things versus their naïve implementations, but I’m not sure what arguments there are for it. It sounds pretty wild to me honestly, I have no intuition algorithmic complexity works anything like that.
If you think the probability derived from the upper limit set by evolutionary brute force should be spread out uniformly over the next 20 orders of magnitude, then I assume you think that if we bought 4 orders of magnitude today, there is a 20% chance that a method derived from evolutionary brute force will give us AGI? Whereas I would put that probability much lower, since brute force evolution is not nearly powerful enough at those scales.
I would say there is a meaningful probability that a method not derived from evolutionary brute force, particularly scaled up neural networks, will give us AGI at that point with only minimal fiddling. However that general class of techniques does not observe the evolutionary brute force upper bound. It is entirely coherent to say that as you push neural network size, their improvements flatten out and never reach AGI. The chance that a 1016 parameter model unlocks AGI given a 1011 parameter model doesn’t is much larger than the chance that a 1031 parameter model unlocks AGI given a 1026 parameter model doesn’t. So it’s not clear how you could expect neural networks to observe the upper bound for an unrelated algorithm, and so it’s unclear why their probability distribution would be affected by where exactly that upper bound lands.
I’m also unclear whether you consider this a general rule of thumb for probabilities in general, or something specific to algorithms. Would you for instance say that if there was a weak proof that we could travel interstellar with Y times better fuel energy density, then there’s a 50% chance that there’s a method derived from that method for interstellar travel with just √Y times better energy density?
I don’t know what you mean by an “example against the claim.” I certainly agree that there is often other evidence that will improve your bet. Perhaps this is a disagreement about the term “prima facie”?
Learning that there is a very slow algorithm for a problem is often a very important indicator that a problem is solvable, and savings like 1040 to 1030 seem routine (and often have very strong similarities between the algorithms). And very often the running time of one algorithm is indeed a useful indicator for the running time of a very different approach. It’s possible we are thinking about different domains here, I’m mostly thinking of traditional algorithms (like graph problems, approximation algorithms, CSPs, etc.) scaled to input sizes where the computation cost is in this regime. Though it seems like the same is also true for ML (though I have much less data there and moreover all the examples are radically less crisp).
This seems wrong but maybe for reasons unrelated to the matter at hand. (In general an unknown number is much more likely to lie between 1011 and 1016 than between 1026 and 1031, just as an unknown number is more likely to lie between 11 and 16 than between 26 and 31.)
I think it’s a good rule of thumb for estimating numbers in general. If you know a number is between A and B (and nothing else), where A and B are on the order of 1020, then a log-uniform distribution between A and B is a reasonable prima facie guess.
This holds whether the number is “The best you can do on the task using method X” or “The best you can do on the task using any method we can discover in 100 years” or “The best we could do on this task with a week and some duct tape” or “The mass of a random object in the universe.”
I don’t know what “derived from evolutionary brute force” means (I don’t think anyone has said those words anywhere in this thread other than you?)
But in terms of P(AGI), I think that “20% for next 4 orders of magnitude” is a fine prima facie estimate if you bring in this single consideration and nothing else. Of course I don’t think anyone would ever do that, but frankly I still think “20% for the next 4 orders of magnitude” is still better than most communities’ estimates.