What caused the researchers to go from “$1M run of NAS” to “$1B run of NAS”, without first trying “$10M run of NAS”? I especially have this question if you’re modeling ML research as “trial and error”;
I indeed model a big part of contemporary ML research as “trial and error”. I agree that it seems unlikely that before the first $1B NAS there won’t be any $10M NAS. Suppose there will even be a $100M NAS just before the $1B NAS that (by assumption) results in AGI. I’m pretty agnostic about whether the result of that $100M NAS would serve as a fire alarm for AGI.
Current AI systems are very subhuman, and throwing more money at NAS has led to relatively small improvements. Why don’t we expect similar incremental improvements from the next 3-4 orders of magnitude of compute?
If we look at the history of deep learning from ~1965 to 2019, how well do trend extrapolation methods fare in terms of predicting performance gains for the next 3-4 orders of magnitude of compute? My best guess is that they don’t fare all that well. For example, based on data prior to 2011, I assume such methods predict mostly business-as-usual for deep learning during 2011-2019 (i.e. completely missing the deep learning revolution). More generally, when using trend extrapolations in AI, consider the following from this Open Phil blog post (2016) by Holden Karnofsky (footnote 7):
The most exhaustive retrospective analysis of historical technology forecasts we have yet found, Mullins (2012), categorized thousands of published technology forecasts by methodology, using eight categories including “multiple methods” as one category. [...] However, when comparing success rates for methodologies solely within the computer technology area tag, quantitative trend analysis performs slight below average,
(The link in the quote appears to be broken, here is one that works.)
NAS seems to me like a good example for an expensive computation that could plausibly constitute a “search in idea-space” that finds an AGI model (without human involvement). But my argument here applies to any such computation. I think it may even apply to a ‘$1B SGD’ (on a single huge network), if we consider a gradient update (or a sequence thereof) to be an “exploration step in idea-space”.
Suppose that such a NAS did lead to human-level AGI. Shouldn’t that mean that the AGI makes progress in AI at the same rate that we did?
I first need to understand what “human-level AGI” means. Can models in this category pass strong versions of the Turing test? Does this category exclude systems that outperform humans on one or more important dimensions? (It seems to me that the first SGD-trained model that passes strong versions of the Turing test may be a superintelligence.)
In all the previous NASs, why did the paths taken produce AI systems that were so much worse than the one taken by the $1B NAS? Did the $1B NAS just get lucky?
Yes, the $1B NAS may indeed just get lucky. A local search sometimes gets lucky (in the sense of finding a local optimum that is a lot better than the ones found in most runs; not in the sense of miraculously starting the search at a great fragile solution). [EDIT: also, something about this NAS might be slightly novel—like the neural architecture space.]
If you want to make the case for a discontinuity because of the lack of human involvement, you would need to argue:
The replacement for humans is way cheaper / faster / more effective than humans (in that case why wasn’t it automated earlier?)
The discontinuity happens as soon as humans are replaced (otherwise, the system-without-human-involvement becomes the new baseline, and all future systems will look like relatively continuous improvements of this system)
In some past cases where humans did not serve any role in performance gains that were achieved with more compute/data (e.g. training GPT-2 by scaling up GPT), there were no humans to replace. So I don’t understand the question “why wasn’t it automated earlier?”
In the second point, I need to first understand how you define that moment in which “humans are replaced”. (In the $1B NAS scenario, would that moment be the one in which the NAS is invoked?)
Meta: I feel like I am arguing for “there will not be a discontinuity”, and you are interpreting me as arguing for “we will not get AGI soon / AGI will not be transformative”, neither of which I believe. (I have wide uncertainty on timelines, and I certainly think AGI will be transformative.) I’d like you to state what position you think I’m arguing for, tabooing “discontinuity” (not the arguments for it, just the position).
I indeed model a big part of contemporary ML research as “trial and error”. I agree that it seems unlikely that before the first $1B NAS there won’t be any $10M NAS. Suppose there will even be a $100M NAS just before the $1B NAS that (by assumption) results in AGI. I’m pretty agnostic about whether the result of that $100M NAS would serve as a fire alarm for AGI.
I’m arguing against FOOM, not about whether there will be a fire alarm. The fire alarm question seems orthogonal to me. I’m more uncertain about the fire alarm question.
quantitative trend analysis performs slight below average [...] NAS seems to me like a good example for an expensive computation that could plausibly constitute a “search in idea-space” that finds an AGI model [...] it may even apply to a ‘$1B SGD’ (on a single huge network) [...] the $1B NAS may indeed just get lucky
This sounds to me like saying “well, we can’t trust predictions based on past data, and we don’t know that we won’t find an AGI, so we should worry about that”. I am not compelled by arguments that tell me to worry about scenario X without giving me a reason to believe that scenario X is likely. (Compare: “we can’t rule out the possibility that the simulators want us to build a tower to the moon or else they’ll shut off the simulation, so we better get started on that moon tower.”)
This is not to say the such scenario X’s must be false—reality could be that way—but that given my limited amount of time, I must prioritize which scenarios to pay attention to, and one really good heuristic for that is to focus on scenarios that have some inside-view reason that makes me think they are likely. If I had infinite time, I’d eventually consider these scenarios (even the simulators wanting us to build a moon tower hypothesis).
Some other more tangential things:
If we look at the history of deep learning from ~1965 to 2019, how well do trend extrapolation methods fare in terms of predicting performance gains for the next 3-4 orders of magnitude of compute? My best guess is that they don’t fare all that well. For example, based on data prior to 2011, I assume such methods predict mostly business-as-usual for deep learning during 2011-2019 (i.e. completely missing the deep learning revolution).
The trend that changed in 2012 was that of the amount of compute applied to deep learning. I suspect trend extrapolation with compute as the x-axis would do okay; trend extrapolation with calendar year as the x-axis would do poorly. But as I mentioned above, this is not a crux for me, since it doesn’t give me an inside-view reason to expect FOOM; I wouldn’t even consider it weak evidence for FOOM if I changed my mind on this. (If the data showed a big discontinuity, that would be evidence, but I’m fairly confident that while there was a discontinuity it was relatively small.)
I’d like you to state what position you think I’m arguing for
I think you’re arguing for something like: Conditioned on [the first AGI is created at time t by AI lab X], it is very unlikely that immediately before t the researchers at X have a very low credence in the proposition “we will create an AGI sometime in the next 30 days”.
(Tbc, I did not interpret you as arguing about timelines or AGI transformativeness; and neither did I argue about those things here.)
I’m arguing against FOOM, not about whether there will be a fire alarm. The fire alarm question seems orthogonal to me.
Using the “fire alarm” concept here was a mistake, sorry for that. Instead of writing:
I’m pretty agnostic about whether the result of that $100M NAS would serve as a fire alarm for AGI.
I should have written:
I’m pretty agnostic about whether the result of that $100M NAS would be “almost AGI”.
This sounds to me like saying “well, we can’t trust predictions based on past data, and we don’t know that we won’t find an AGI, so we should worry about that”.
I generally have a vague impression that many AIS/x-risk people tend to place too much weight on trend extrapolation arguments in AI (or tend to not give enough attention to important details of such arguments), which may have triggered me to write the related stuff (in response to you seemingly applying a trend extrapolation argument with respect to NAS). I was not listing the reasons for my beliefs specifically about NAS.
If I had infinite time, I’d eventually consider these scenarios (even the simulators wanting us to build a moon tower hypothesis).
(I’m mindful of your time and so I don’t want to branch out this discussion into unrelated topics, but since this seems to me like a potentially important point...) Even if we did have infinite time and the ability to somehow determine the correctness of any given hypothesis with super-high-confidence, we may not want to evaluate all hypotheses—that involve other agents—in arbitrary order. Due to game theoretical stuff, the order in which we do things may matter (e.g. due to commitment races in logical time). For example, after considering some game-theoretical meta considerations we might decide to make certain binding commitments before evaluating such and such hypotheses; or we might decide about what additional things we should consider or do before evaluating some other hypotheses, etcetera.
Conditioned on the first AGI being aligned, it may be important to figure out how do we make sure that that AGI “behaves wisely” with respect to this topic (because the AGI might be able to evaluate a lot of weird hypotheses that we can’t).
Due to game theoretical stuff, the order in which we do things may matter (e.g. due to commitment races in logical time).
Can you give me an example? I don’t see how this would work.
(Tbc, I’m imagining that the universe stops, and only I continue thinking; there are no other agents thinking while I’m thinking, and so afaict I should just implement UDT.)
Creating some sort of commitment device that would bind us to follow UDT—before we evaluate some set of hypotheses—is an example for one potentially consequential intervention.
As an aside, my understanding is that in environments that involve multiple UDT agents, UDT doesn’t necessarily work well (or is not even well-defined?).
Also, if we would use SGD to train a model that ends up being an aligned AGI, maybe we should figure out how to make sure that that model “follows” a good decision theory. (Or does this happen by default? Does it depend on whether “following a good decision theory” is helpful for minimizing expected loss on the training set?)
Conditioned on [the first AGI is created at time t by AI lab X], it is very unlikely that immediately before t the researchers at X have a very low credence in the proposition “we will create an AGI sometime in the next 30 days”.
It wasn’t exactly that (in particular, I didn’t have the researcher’s beliefs in mind), but I also believe that statement for basically the same reasons so that should be fine. There’s a lot of ambiguity in that statement (specifically, what is AGI), but I probably believe it for most operationalizations of AGI.
(For reference, I was considering “will there be a 1 year doubling of economic output that started before the first 4 year doubling of economic output ended”; for that it’s not sufficient to just argue that we will get AGI suddenly, you also have to argue that the AGI will very quickly become superintelligent enough to double economic output in a very short amount of time.)
I’m pretty agnostic about whether the result of that $100M NAS would be “almost AGI”.
I mean, the difference between a $100M NAS and a $1B NAS is:
Up to 10x the number of models evaluated
Up to 10x the size of models evaluated
If you increase the number of models by 10x and leave the size the same, that somewhat increases your optimization power. If you model the NAS as picking architectures randomly, the $1B NAS can have at most 10x the chance of finding AGI, regardless of fragility, and so can only have at most 10x the expected “value” (whatever your notion of “value”).
If you then also model architectures as non-fragile, then once you have some optimization power, adding more optimization power doesn’t make much of a difference, e.g. the max of n draws from Uniform([0, 1]) has expected value nn+1=1−1n+1, so once n is already large (e.g. 100), increasing it makes ~no difference. Of course, our actual distributions will probably be more bottom-heavy, but as distributions get more bottom-heavy we use gradient descent / evolutionary search to deal with that.
For the size, it’s possible that increases in size lead to huge increases in intelligence, but that doesn’t seem to agree with ML practice so far. Even if you ignore trend extrapolation, I don’t see a reason to expect that increasing model sizes should mean the difference between not-even-close-to-AGI and AGI.
If you model the NAS as picking architectures randomly
I don’t. NAS can be done with RL or evolutionary computation methods. (Tbc, when I said I model a big part of contemporary ML research as “trial and error”, by trial and error I did not mean random search.)
If you then also model architectures as non-fragile, then once you have some optimization power, adding more optimization power doesn’t make much of a difference,
Earlier in this discussion you defined fragility as the property “if you make even a slight change to the thing, then it breaks and doesn’t work”. While finding fragile solutions is hard, finding non-fragile solution is not necessarily easy, so I don’t follow the logic of that paragraph.
Suppose that all model architectures are indeed non-fragile, and some of them can implement AGI (call them “AGI architectures”). It may be the case that relative to the set of model architectures that we can end up with when using our favorite method (e.g. evolutionary search), the AGI architectures are a tiny subset. E.g. the size ratio can be 10−10 (and then running our evolutionary search 10x times means roughly 10x probability of finding an AGI architecture, if [number of runs]<<1010).
I don’t. NAS can be done with RL or evolutionary computation methods. (Tbc, when I said I model a big part of contemporary ML research as “trial and error”, by trial and error I did not mean random search.)
I do think that similar conclusions apply there as well, though I’m not going to make a mathematical model for it.
finding non-fragile solution is not necessarily easy
I’m not saying it is; I’m saying that however hard it is to find a non-fragile good solution, it is easier to find a solution that is almost as good. When I say
adding more optimization power doesn’t make much of a difference
I mean to imply that the existing optimization power will do most of the work, for whatever quality of solution you are getting.
Suppose that all model architectures are indeed non-fragile, and some of them can implement AGI (call them “AGI architectures”). It may be the case that relative to the set of model architectures that we can end up with when using our favorite method (e.g. evolutionary search), the AGI architectures are a tiny subset. E.g. the size ratio can be 10−10(and then running our evolutionary search 10x times means roughly 10x probability of finding an AGI architecture, if [number of runs]<<1010).
(Aside: it would be way smaller than 10−10.) In this scenario, my argument is that the size ratio for “almost-AGI architectures” is better (e.g. 10−9), and so you’re more likely to find one of those first.
In practice, if you have a thousand parameters that determine an architecture, and 10 settings for each of them, the size ratio for the (assumed unique) globally best architecture is 10−1000. In this setting, I expect several orders of magnitude of difference between the size ratio of almost-AGI and the size ratio of AGI, making it essentially guaranteed that you find an almost-AGI architecture before an AGI architecture.
In this scenario, my argument is that the size ratio for “almost-AGI architectures” is better (e.g. 10−9), and so you’re more likely to find one of those first.
For a “local search NAS” (rather than “random search NAS”) it seems that we should be considering here the set of [“almost-AGI architectures” from which the local search would not find an “AGI architecture”].
The “$1B NAS discontinuity scenario” allows for the $1B NAS to find “almost-AGI architectures” before finding an “AGI architecture”.
For a “local search NAS” (rather than “random search NAS”) it seems that we should be considering here the set of [“almost-AGI architectures” from which the local search would not find an “AGI architecture”].
The “$1B NAS discontinuity scenario” allows for the $1B NAS to find “almost-AGI architectures” before finding an “AGI architecture”.
Agreed. My point is that the $100M NAS would find the almost-AGI architectures. (My point with the size ratios is that whatever criterion you use to say “and that’s why the $1B NAS finds AGI while the $100M NAS doesn’t”, my response would be that “well, almost-AGI architectures require a slightly easier-to-achieve value of <criterion>, that the $100M NAS would have achieved”.)
I indeed model a big part of contemporary ML research as “trial and error”. I agree that it seems unlikely that before the first $1B NAS there won’t be any $10M NAS. Suppose there will even be a $100M NAS just before the $1B NAS that (by assumption) results in AGI. I’m pretty agnostic about whether the result of that $100M NAS would serve as a fire alarm for AGI.
If we look at the history of deep learning from ~1965 to 2019, how well do trend extrapolation methods fare in terms of predicting performance gains for the next 3-4 orders of magnitude of compute? My best guess is that they don’t fare all that well. For example, based on data prior to 2011, I assume such methods predict mostly business-as-usual for deep learning during 2011-2019 (i.e. completely missing the deep learning revolution). More generally, when using trend extrapolations in AI, consider the following from this Open Phil blog post (2016) by Holden Karnofsky (footnote 7):
(The link in the quote appears to be broken, here is one that works.)
NAS seems to me like a good example for an expensive computation that could plausibly constitute a “search in idea-space” that finds an AGI model (without human involvement). But my argument here applies to any such computation. I think it may even apply to a ‘$1B SGD’ (on a single huge network), if we consider a gradient update (or a sequence thereof) to be an “exploration step in idea-space”.
I first need to understand what “human-level AGI” means. Can models in this category pass strong versions of the Turing test? Does this category exclude systems that outperform humans on one or more important dimensions? (It seems to me that the first SGD-trained model that passes strong versions of the Turing test may be a superintelligence.)
Yes, the $1B NAS may indeed just get lucky. A local search sometimes gets lucky (in the sense of finding a local optimum that is a lot better than the ones found in most runs; not in the sense of miraculously starting the search at a great fragile solution). [EDIT: also, something about this NAS might be slightly novel—like the neural architecture space.]
In some past cases where humans did not serve any role in performance gains that were achieved with more compute/data (e.g. training GPT-2 by scaling up GPT), there were no humans to replace. So I don’t understand the question “why wasn’t it automated earlier?”
In the second point, I need to first understand how you define that moment in which “humans are replaced”. (In the $1B NAS scenario, would that moment be the one in which the NAS is invoked?)
Meta: I feel like I am arguing for “there will not be a discontinuity”, and you are interpreting me as arguing for “we will not get AGI soon / AGI will not be transformative”, neither of which I believe. (I have wide uncertainty on timelines, and I certainly think AGI will be transformative.) I’d like you to state what position you think I’m arguing for, tabooing “discontinuity” (not the arguments for it, just the position).
I’m arguing against FOOM, not about whether there will be a fire alarm. The fire alarm question seems orthogonal to me. I’m more uncertain about the fire alarm question.
This sounds to me like saying “well, we can’t trust predictions based on past data, and we don’t know that we won’t find an AGI, so we should worry about that”. I am not compelled by arguments that tell me to worry about scenario X without giving me a reason to believe that scenario X is likely. (Compare: “we can’t rule out the possibility that the simulators want us to build a tower to the moon or else they’ll shut off the simulation, so we better get started on that moon tower.”)
This is not to say the such scenario X’s must be false—reality could be that way—but that given my limited amount of time, I must prioritize which scenarios to pay attention to, and one really good heuristic for that is to focus on scenarios that have some inside-view reason that makes me think they are likely. If I had infinite time, I’d eventually consider these scenarios (even the simulators wanting us to build a moon tower hypothesis).
Some other more tangential things:
The trend that changed in 2012 was that of the amount of compute applied to deep learning. I suspect trend extrapolation with compute as the x-axis would do okay; trend extrapolation with calendar year as the x-axis would do poorly. But as I mentioned above, this is not a crux for me, since it doesn’t give me an inside-view reason to expect FOOM; I wouldn’t even consider it weak evidence for FOOM if I changed my mind on this. (If the data showed a big discontinuity, that would be evidence, but I’m fairly confident that while there was a discontinuity it was relatively small.)
I think you’re arguing for something like: Conditioned on [the first AGI is created at time t by AI lab X], it is very unlikely that immediately before t the researchers at X have a very low credence in the proposition “we will create an AGI sometime in the next 30 days”.
(Tbc, I did not interpret you as arguing about timelines or AGI transformativeness; and neither did I argue about those things here.)
Using the “fire alarm” concept here was a mistake, sorry for that. Instead of writing:
I should have written:
I generally have a vague impression that many AIS/x-risk people tend to place too much weight on trend extrapolation arguments in AI (or tend to not give enough attention to important details of such arguments), which may have triggered me to write the related stuff (in response to you seemingly applying a trend extrapolation argument with respect to NAS). I was not listing the reasons for my beliefs specifically about NAS.
(I’m mindful of your time and so I don’t want to branch out this discussion into unrelated topics, but since this seems to me like a potentially important point...) Even if we did have infinite time and the ability to somehow determine the correctness of any given hypothesis with super-high-confidence, we may not want to evaluate all hypotheses—that involve other agents—in arbitrary order. Due to game theoretical stuff, the order in which we do things may matter (e.g. due to commitment races in logical time). For example, after considering some game-theoretical meta considerations we might decide to make certain binding commitments before evaluating such and such hypotheses; or we might decide about what additional things we should consider or do before evaluating some other hypotheses, etcetera.
Conditioned on the first AGI being aligned, it may be important to figure out how do we make sure that that AGI “behaves wisely” with respect to this topic (because the AGI might be able to evaluate a lot of weird hypotheses that we can’t).
Can you give me an example? I don’t see how this would work.
(Tbc, I’m imagining that the universe stops, and only I continue thinking; there are no other agents thinking while I’m thinking, and so afaict I should just implement UDT.)
Creating some sort of commitment device that would bind us to follow UDT—before we evaluate some set of hypotheses—is an example for one potentially consequential intervention.
As an aside, my understanding is that in environments that involve multiple UDT agents, UDT doesn’t necessarily work well (or is not even well-defined?).
Also, if we would use SGD to train a model that ends up being an aligned AGI, maybe we should figure out how to make sure that that model “follows” a good decision theory. (Or does this happen by default? Does it depend on whether “following a good decision theory” is helpful for minimizing expected loss on the training set?)
It wasn’t exactly that (in particular, I didn’t have the researcher’s beliefs in mind), but I also believe that statement for basically the same reasons so that should be fine. There’s a lot of ambiguity in that statement (specifically, what is AGI), but I probably believe it for most operationalizations of AGI.
(For reference, I was considering “will there be a 1 year doubling of economic output that started before the first 4 year doubling of economic output ended”; for that it’s not sufficient to just argue that we will get AGI suddenly, you also have to argue that the AGI will very quickly become superintelligent enough to double economic output in a very short amount of time.)
I mean, the difference between a $100M NAS and a $1B NAS is:
Up to 10x the number of models evaluated
Up to 10x the size of models evaluated
If you increase the number of models by 10x and leave the size the same, that somewhat increases your optimization power. If you model the NAS as picking architectures randomly, the $1B NAS can have at most 10x the chance of finding AGI, regardless of fragility, and so can only have at most 10x the expected “value” (whatever your notion of “value”).
If you then also model architectures as non-fragile, then once you have some optimization power, adding more optimization power doesn’t make much of a difference, e.g. the max of n draws from Uniform([0, 1]) has expected value nn+1=1−1n+1, so once n is already large (e.g. 100), increasing it makes ~no difference. Of course, our actual distributions will probably be more bottom-heavy, but as distributions get more bottom-heavy we use gradient descent / evolutionary search to deal with that.
For the size, it’s possible that increases in size lead to huge increases in intelligence, but that doesn’t seem to agree with ML practice so far. Even if you ignore trend extrapolation, I don’t see a reason to expect that increasing model sizes should mean the difference between not-even-close-to-AGI and AGI.
I don’t. NAS can be done with RL or evolutionary computation methods. (Tbc, when I said I model a big part of contemporary ML research as “trial and error”, by trial and error I did not mean random search.)
Earlier in this discussion you defined fragility as the property “if you make even a slight change to the thing, then it breaks and doesn’t work”. While finding fragile solutions is hard, finding non-fragile solution is not necessarily easy, so I don’t follow the logic of that paragraph.
Suppose that all model architectures are indeed non-fragile, and some of them can implement AGI (call them “AGI architectures”). It may be the case that relative to the set of model architectures that we can end up with when using our favorite method (e.g. evolutionary search), the AGI architectures are a tiny subset. E.g. the size ratio can be 10−10 (and then running our evolutionary search 10x times means roughly 10x probability of finding an AGI architecture, if [number of runs]<<1010).
I do think that similar conclusions apply there as well, though I’m not going to make a mathematical model for it.
I’m not saying it is; I’m saying that however hard it is to find a non-fragile good solution, it is easier to find a solution that is almost as good. When I say
I mean to imply that the existing optimization power will do most of the work, for whatever quality of solution you are getting.
(Aside: it would be way smaller than 10−10.) In this scenario, my argument is that the size ratio for “almost-AGI architectures” is better (e.g. 10−9), and so you’re more likely to find one of those first.
In practice, if you have a thousand parameters that determine an architecture, and 10 settings for each of them, the size ratio for the (assumed unique) globally best architecture is 10−1000. In this setting, I expect several orders of magnitude of difference between the size ratio of almost-AGI and the size ratio of AGI, making it essentially guaranteed that you find an almost-AGI architecture before an AGI architecture.
For a “local search NAS” (rather than “random search NAS”) it seems that we should be considering here the set of [“almost-AGI architectures” from which the local search would not find an “AGI architecture”].
The “$1B NAS discontinuity scenario” allows for the $1B NAS to find “almost-AGI architectures” before finding an “AGI architecture”.
Agreed. My point is that the $100M NAS would find the almost-AGI architectures. (My point with the size ratios is that whatever criterion you use to say “and that’s why the $1B NAS finds AGI while the $100M NAS doesn’t”, my response would be that “well, almost-AGI architectures require a slightly easier-to-achieve value of <criterion>, that the $100M NAS would have achieved”.)