Thanks again for the detailed reply; I feel like I’m coming to understand you (and fusion!) much better.
You may indeed be hoping the OP is something it’s not.
That said, I think I have more to say in agreement with your strong position:
There is an issue here with trading off scale and manufacturing to substitute for complexity and the things we don’t understand.
‘Part 1: Extra brute force can make the problem a lot easier’ says that you can do this sort of trade for AI, and it justifies this in part by drawing analogy to flight. But it’s hard to see what intrinsically motivates this comparison specifically, because trading off a motor’s power-to-weight ratio for physical upness is very different to trading off a computer’s FLOP rate for abstract thinkingness. I assumed you did this because you believed (as I do) that this sort of argument is general. Hence, a general argument should apply generally, so unless there’s something special about fusion, it should apply there too. If you don’t believe it’s a general sort of argument, then why the comparison to flight, rather than to useful, self-replicating nanoscale robots?
If instead you’re just drawing comparison to flight to say it’s potentially possible that compute is fungible with complexity, rather than it being likely, then it just seems like not a very impactful argument.
1. I don’t know enough about nanotech to say whether it’s a counterexample to Shorty’s position Currently I suspect it isn’t. This is a separate issue from the issue you raise, which is whether it’s a counterexample to the position “In general, you can substitute brute force in some variables for special sauce.” Call this position the strong view.
2. I’m not sure whether I hold the strong view. I certainly didn’t try to argue for it in the OP (though I did present a small amount of evidence for it I suppose.)
3. I do hold the strong-view-applied-to-AI. That is, I do think we can make the problem of building TAI easier by using more compute. (As you say, compute is fungible with complexity). I gave two reasons for this in the OP: Can scale up the key variables, and can use compute to automate the search for special sauce. I think both of these reasons are solid on their own; I don’t need to appeal to historical case studies to justify them.
4. I am happy to expand on both arguments if you like. I think the “can use compute to automate search for special sauce” is pretty self-explanatory. The “can scale up the key variables” thing is based on deep learning theory as I understand it, which is that bigger neural nets work by containing more and better lottery tickets (and you need longer to train to isolate and promote those tickets from the sludge of competitor subnetworks?). And neural networks are universal function approximators. So whatever skill it is that humans do and that you are trying to get an AI to do, with a big enough neural net trained on enough data, you’ll succeed. And “big enough” means probably about the size of the human brain. This is just the sketch of a skeleton of an argument of course, but I could go on...
Thanks, I think I pretty much understand your framing now.
I think the only thing I really disagree with is that “”can use compute to automate search for special sauce” is pretty self-explanatory.” I think this heavily depends on what sort of variable you expect the special sauce to be. Eg. for useful, self-replicating nanoscale robots, my hypothetical atomic manufacturing technology would enable rapid automated iteration, but it’s unclear how you could use that to automatically search for a solution in practice. It’s an enabler for research, moreso than a substitute. Personally I’m not sure how I’d justify that claim for AI without importing a whole bunch of background knowledge of the generality of optimization procedures!
IIUC this is mostly outside the scope of what your article was about, and we don’t disagree on the meat of the matter, so I’m happy to leave this here.
I think I agree that it’s not clear compute can be used to search for special sauce in general, but in the case of AI it seems pretty clear to me: AIs themselves run in computers, and the capabilities we are interested in (some of them, at least) can be detected on AIs in simulations (no need for e.g. robotic bodies) and so we can do trial-and-error on our AI designs in proportion to how much compute we have. More compute, more trial-and-error. (Except it’s more efficient than mere trial-and-error, we have access to all sorts of learning and meta-learning and architecture search algorithms, not to mention human insight). If you had enough compute, you could just simulate the entire history of life evolving on an earth-sized planet for a billion years, in a very detailed and realistic physics environment!
Eventually the conclusion holds trivially, sure, but that takes us very far from the HBHL anchor. Most evolutionary algorithms we do today are very constrained in what programs they can generate, and are run over small models for a small number of iteration steps. A more general search would be exponentially slower, and even more disconnected from current ML. If you expect that sort of research to be pulling a lot of weight, you probably shouldn’t expect the result to look like large connectionist models trained on lots of data, and you lose most of the argument for anchoring to HBHL.
A more standard framing is that ‘we can do trial-and-error on our AI designs’, but there we’re again in a regime where scale is an enabler for research, moreso than a substitute for it. Architecture search will still fine-tune and validate these ideas, but is less likely to drive them directly in a significant way.
Eventually the conclusion holds trivially, sure, but that takes us very far from the HBHL anchor.
It takes us about 17 orders of magnitude away from the HBHL anchor, in fact. Which is not very far, when you think about it. Divide 100 percentage points of probability mass evenly across those 17 orders of magnitude, and you get almost 6% per OOM, which means something like 4x as much probability mass on the HBHL anchor than Ajeya puts on it in her report!
If you expect that sort of research to be pulling a lot of weight, you probably shouldn’t expect the result to look like large connectionist models trained on lots of data, and you lose most of the argument for anchoring to HBHL.
I don’t follow this argument. It sounds like double-counting to me, like: “If you put some of your probability mass away from HBHL, that means you are less confident that AI will be made in the HBHL-like way, which means you should have even less of your probability mass on HBHL.”
A more standard framing is that ‘we can do trial-and-error on our AI designs’, but there we’re again in a regime where scale is an enabler for research, moreso than a substitute for it. Architecture search will still fine-tune and validate these ideas, but is less likely to drive them directly in a significant way.
I’m not sure I get the distinction between enabler and substitute, or why it is relevant here. The point is that we can use compute to search for the missing special sauce. Maybe humans are still in the loop; sure.
It takes us about 17 orders of magnitude away from the HBHL anchor, in fact. Which is not very far, when you think about it. Divide 100 percentage points of probability mass evenly across those 17 orders of magnitude, and you get almost 6% per OOM, which means something like 4x as much probability mass on the HBHL anchor than Ajeya puts on it in her report!
I don’t understand what you’re doing here. Why 17 orders of magnitude, and why would I split 100% across each order?
I don’t follow this argument. It sounds like double-counting to me
Read ‘and therefore’, not ‘and in addition’. The point is that the more you spend your compute on search, the less directly your search can exploit computationally expensive models.
Put another way, if you have HBHL compute but spend nine orders of magnitude on search, then the per-model compute is much less than HBHL, so the reasons to argue for HBHL don’t apply to it. Equivalently, if your per-model compute estimate is HBHL, then the HBHL metric is only relevant for timelines if search is fairly limited.
I’m not sure I get the distinction between enabler and substitute, or why it is relevant here. The point is that we can use compute to search for the missing special sauce. Maybe humans are still in the loop; sure.
Motors are an enabler in the context of flight research because they let you build and test designs, learn what issues to solve, build better physical models, and verify good ideas.
Motors are a substitute in the context of flight research because a better motor means more, easier, and less optimal solutions become viable.
Ajeya estimates (and I agree with her) how much compute it would take to recapitulate evolution, i.e. simulate the entire history of life on earth evolving for a billion years etc. The number she gets is 10^41 FLOP give or take a few OOMs. That’s 17 OOMs away from where we are now. So if you take 10^41 as an upper bound, and divide up the probability evenly across the OOMs… Of course it probably shouldn’t be a hard upper bound, so instead of dividing up 100 percentage points you should divide up 95 or 90 or whatever your credence is that TAI could be achieved for 10^41 or less compute. But that wouldn’t change the result much, which is that a naive, flat-across-orders-of-magnitude-up-until-the-upper-bound-is-reached distribution would assign substantially higher probability to Shorty’s position than Ajeya does.
I’m still not following the argument. I agree that you won’t be able to use your HBHL compute to do search over HBHL-sized brains+childhoods, because if you only have HBHL compute, you can only do one HBHL-sized brain+childhood. But that doesn’t undermine my point, which is that as you get more compute, you can use it to do search. So e.g. when you have 3 OOMs more compute than the HBHL milestone, you can do automated search over 1000 HBHL-sized brains+childhoods. (Also I suppose even when you only have HBHL compute you could do search over architectures and childhoods that are a little bit smaller and hope that the lessons generalize)
I think part of what might be going on here is that since Shorty’s position isn’t “TAI will happen as soon as we hit HBHL” but rather “TAI will happen shortly after we hit HBHL” there’s room for an OOM or three of extra compute beyond the HBHL to be used. (Compute costs decrease fairly quickly, and investment can increase much faster, and probably will when TAI is nigh) I agree that we can’t use compute to search for special sauce if we only have exactly HBHL compute (setting aside the paranthetica in the previous paragraph, which suggests that we can)
Well I understand now where you get the 17, but I don’t understand why you want to spread it uniformly across the orders of magnitude. Shouldn’t you put the all probability mass for the brute-force evolution approach on some gaussian around where we’d expect that to land, and only have probability elsewhere to account for competing hypotheses? Like I think it’s fair to say the probability of a ground-up evolutionary approach only using 10-100 agents is way closer to zero than to 4%.
I’m still not following the argument. [...] So e.g. when you have 3 OOMs more compute than the HBHL milestone
I think you’re mixing up my paragraphs. I was referring here to cases where you’re trying to substitute searching over programs for the AI special sauce.
If you’re in the position where searching 1000 HBHL hypotheses finds TAI, then the implicit assumption is that model scaling has already substituted for the majority of AI special sauce, and the remaining search is just an enabler for figuring out the few remaining details. That or that there wasn’t much special sauce in the first place.
To maybe make my framing a bit more transparent, consider the example of a company trying to build useful, self-replicating nanoscale robots using a atomically precise 3D printer under the conditions where 1) nobody there has a good idea of how to go about doing this, and 2) you have 1000 tries.
--I agree that for the brute-force evolution approach, we should have a gaussian around where we’d expect that to land. My “Let’s just do evenly across all the OOMs between now and evolution” is only a reasonable first-pass approach to what our all-things-considered distribution should be like, including evolution but also various other strategies. (Even better would be having a taxonomy of the various strategies and a gaussian for each; this is sorta what Ajeya does. the problem is that insofar as you don’t trust your taxonomy to be exhaustive, the resulting distribution is untrustworthy as well.) I think it’s reasonable to extend the probability mass down to where we are now, because we are currently at the HBHL milestone pretty much, which seems like a pretty relevant milestone to say the least.
If you’re in the position where searching 1000 HBHL hypotheses finds TAI, then the implicit assumption is that model scaling has already substituted for the majority of AI special sauce, and the remaining search is just an enabler for figuring out the few remaining details. That or that there wasn’t much special sauce in the first place.
This seems right to me.
To maybe make my framing a bit more transparent, consider the example of a company trying to build useful, self-replicating nanoscale robots using a atomically precise 3D printer under the conditions where 1) nobody there has a good idea of how to go about doing this, and 2) you have 1000 tries.
I like this analogy. I think our intuitions about how hard it would be might differ though. Also, our intuitions about the extent to which nobody has a good idea of how to make TAI might differ too.
Also, our intuitions about the extent to which nobody has a good idea of how to make TAI might differ too.
To be clear I’m not saying nobody has a good idea of how to make TAI. I expect pretty short timelines, because I expect the remaining fundamental challenges aren’t very big.
What I don’t expect is that the remaining fundamental challenges go away through small-N search over large architectures, if the special sauce does turn out to be significant.
Thanks again for the detailed reply; I feel like I’m coming to understand you (and fusion!) much better.
You may indeed be hoping the OP is something it’s not.
That said, I think I have more to say in agreement with your strong position:
1. I don’t know enough about nanotech to say whether it’s a counterexample to Shorty’s position Currently I suspect it isn’t. This is a separate issue from the issue you raise, which is whether it’s a counterexample to the position “In general, you can substitute brute force in some variables for special sauce.” Call this position the strong view.
2. I’m not sure whether I hold the strong view. I certainly didn’t try to argue for it in the OP (though I did present a small amount of evidence for it I suppose.)
3. I do hold the strong-view-applied-to-AI. That is, I do think we can make the problem of building TAI easier by using more compute. (As you say, compute is fungible with complexity). I gave two reasons for this in the OP: Can scale up the key variables, and can use compute to automate the search for special sauce. I think both of these reasons are solid on their own; I don’t need to appeal to historical case studies to justify them.
4. I am happy to expand on both arguments if you like. I think the “can use compute to automate search for special sauce” is pretty self-explanatory. The “can scale up the key variables” thing is based on deep learning theory as I understand it, which is that bigger neural nets work by containing more and better lottery tickets (and you need longer to train to isolate and promote those tickets from the sludge of competitor subnetworks?). And neural networks are universal function approximators. So whatever skill it is that humans do and that you are trying to get an AI to do, with a big enough neural net trained on enough data, you’ll succeed. And “big enough” means probably about the size of the human brain. This is just the sketch of a skeleton of an argument of course, but I could go on...
Thanks, I think I pretty much understand your framing now.
I think the only thing I really disagree with is that “”can use compute to automate search for special sauce” is pretty self-explanatory.” I think this heavily depends on what sort of variable you expect the special sauce to be. Eg. for useful, self-replicating nanoscale robots, my hypothetical atomic manufacturing technology would enable rapid automated iteration, but it’s unclear how you could use that to automatically search for a solution in practice. It’s an enabler for research, moreso than a substitute. Personally I’m not sure how I’d justify that claim for AI without importing a whole bunch of background knowledge of the generality of optimization procedures!
IIUC this is mostly outside the scope of what your article was about, and we don’t disagree on the meat of the matter, so I’m happy to leave this here.
I think I agree that it’s not clear compute can be used to search for special sauce in general, but in the case of AI it seems pretty clear to me: AIs themselves run in computers, and the capabilities we are interested in (some of them, at least) can be detected on AIs in simulations (no need for e.g. robotic bodies) and so we can do trial-and-error on our AI designs in proportion to how much compute we have. More compute, more trial-and-error. (Except it’s more efficient than mere trial-and-error, we have access to all sorts of learning and meta-learning and architecture search algorithms, not to mention human insight). If you had enough compute, you could just simulate the entire history of life evolving on an earth-sized planet for a billion years, in a very detailed and realistic physics environment!
Eventually the conclusion holds trivially, sure, but that takes us very far from the HBHL anchor. Most evolutionary algorithms we do today are very constrained in what programs they can generate, and are run over small models for a small number of iteration steps. A more general search would be exponentially slower, and even more disconnected from current ML. If you expect that sort of research to be pulling a lot of weight, you probably shouldn’t expect the result to look like large connectionist models trained on lots of data, and you lose most of the argument for anchoring to HBHL.
A more standard framing is that ‘we can do trial-and-error on our AI designs’, but there we’re again in a regime where scale is an enabler for research, moreso than a substitute for it. Architecture search will still fine-tune and validate these ideas, but is less likely to drive them directly in a significant way.
It takes us about 17 orders of magnitude away from the HBHL anchor, in fact. Which is not very far, when you think about it. Divide 100 percentage points of probability mass evenly across those 17 orders of magnitude, and you get almost 6% per OOM, which means something like 4x as much probability mass on the HBHL anchor than Ajeya puts on it in her report!
I don’t follow this argument. It sounds like double-counting to me, like: “If you put some of your probability mass away from HBHL, that means you are less confident that AI will be made in the HBHL-like way, which means you should have even less of your probability mass on HBHL.”
I’m not sure I get the distinction between enabler and substitute, or why it is relevant here. The point is that we can use compute to search for the missing special sauce. Maybe humans are still in the loop; sure.
I don’t understand what you’re doing here. Why 17 orders of magnitude, and why would I split 100% across each order?
Read ‘and therefore’, not ‘and in addition’. The point is that the more you spend your compute on search, the less directly your search can exploit computationally expensive models.
Put another way, if you have HBHL compute but spend nine orders of magnitude on search, then the per-model compute is much less than HBHL, so the reasons to argue for HBHL don’t apply to it. Equivalently, if your per-model compute estimate is HBHL, then the HBHL metric is only relevant for timelines if search is fairly limited.
Motors are an enabler in the context of flight research because they let you build and test designs, learn what issues to solve, build better physical models, and verify good ideas.
Motors are a substitute in the context of flight research because a better motor means more, easier, and less optimal solutions become viable.
Ajeya estimates (and I agree with her) how much compute it would take to recapitulate evolution, i.e. simulate the entire history of life on earth evolving for a billion years etc. The number she gets is 10^41 FLOP give or take a few OOMs. That’s 17 OOMs away from where we are now. So if you take 10^41 as an upper bound, and divide up the probability evenly across the OOMs… Of course it probably shouldn’t be a hard upper bound, so instead of dividing up 100 percentage points you should divide up 95 or 90 or whatever your credence is that TAI could be achieved for 10^41 or less compute. But that wouldn’t change the result much, which is that a naive, flat-across-orders-of-magnitude-up-until-the-upper-bound-is-reached distribution would assign substantially higher probability to Shorty’s position than Ajeya does.
I’m still not following the argument. I agree that you won’t be able to use your HBHL compute to do search over HBHL-sized brains+childhoods, because if you only have HBHL compute, you can only do one HBHL-sized brain+childhood. But that doesn’t undermine my point, which is that as you get more compute, you can use it to do search. So e.g. when you have 3 OOMs more compute than the HBHL milestone, you can do automated search over 1000 HBHL-sized brains+childhoods. (Also I suppose even when you only have HBHL compute you could do search over architectures and childhoods that are a little bit smaller and hope that the lessons generalize)
I think part of what might be going on here is that since Shorty’s position isn’t “TAI will happen as soon as we hit HBHL” but rather “TAI will happen shortly after we hit HBHL” there’s room for an OOM or three of extra compute beyond the HBHL to be used. (Compute costs decrease fairly quickly, and investment can increase much faster, and probably will when TAI is nigh) I agree that we can’t use compute to search for special sauce if we only have exactly HBHL compute (setting aside the paranthetica in the previous paragraph, which suggests that we can)
Well I understand now where you get the 17, but I don’t understand why you want to spread it uniformly across the orders of magnitude. Shouldn’t you put the all probability mass for the brute-force evolution approach on some gaussian around where we’d expect that to land, and only have probability elsewhere to account for competing hypotheses? Like I think it’s fair to say the probability of a ground-up evolutionary approach only using 10-100 agents is way closer to zero than to 4%.
I think you’re mixing up my paragraphs. I was referring here to cases where you’re trying to substitute searching over programs for the AI special sauce.
If you’re in the position where searching 1000 HBHL hypotheses finds TAI, then the implicit assumption is that model scaling has already substituted for the majority of AI special sauce, and the remaining search is just an enabler for figuring out the few remaining details. That or that there wasn’t much special sauce in the first place.
To maybe make my framing a bit more transparent, consider the example of a company trying to build useful, self-replicating nanoscale robots using a atomically precise 3D printer under the conditions where 1) nobody there has a good idea of how to go about doing this, and 2) you have 1000 tries.
Sorry I didn’t see this until now!
--I agree that for the brute-force evolution approach, we should have a gaussian around where we’d expect that to land. My “Let’s just do evenly across all the OOMs between now and evolution” is only a reasonable first-pass approach to what our all-things-considered distribution should be like, including evolution but also various other strategies. (Even better would be having a taxonomy of the various strategies and a gaussian for each; this is sorta what Ajeya does. the problem is that insofar as you don’t trust your taxonomy to be exhaustive, the resulting distribution is untrustworthy as well.) I think it’s reasonable to extend the probability mass down to where we are now, because we are currently at the HBHL milestone pretty much, which seems like a pretty relevant milestone to say the least.
This seems right to me.
I like this analogy. I think our intuitions about how hard it would be might differ though. Also, our intuitions about the extent to which nobody has a good idea of how to make TAI might differ too.
To be clear I’m not saying nobody has a good idea of how to make TAI. I expect pretty short timelines, because I expect the remaining fundamental challenges aren’t very big.
What I don’t expect is that the remaining fundamental challenges go away through small-N search over large architectures, if the special sauce does turn out to be significant.