I don’t. NAS can be done with RL or evolutionary computation methods. (Tbc, when I said I model a big part of contemporary ML research as “trial and error”, by trial and error I did not mean random search.)
I do think that similar conclusions apply there as well, though I’m not going to make a mathematical model for it.
finding non-fragile solution is not necessarily easy
I’m not saying it is; I’m saying that however hard it is to find a non-fragile good solution, it is easier to find a solution that is almost as good. When I say
adding more optimization power doesn’t make much of a difference
I mean to imply that the existing optimization power will do most of the work, for whatever quality of solution you are getting.
Suppose that all model architectures are indeed non-fragile, and some of them can implement AGI (call them “AGI architectures”). It may be the case that relative to the set of model architectures that we can end up with when using our favorite method (e.g. evolutionary search), the AGI architectures are a tiny subset. E.g. the size ratio can be 10−10(and then running our evolutionary search 10x times means roughly 10x probability of finding an AGI architecture, if [number of runs]<<1010).
(Aside: it would be way smaller than 10−10.) In this scenario, my argument is that the size ratio for “almost-AGI architectures” is better (e.g. 10−9), and so you’re more likely to find one of those first.
In practice, if you have a thousand parameters that determine an architecture, and 10 settings for each of them, the size ratio for the (assumed unique) globally best architecture is 10−1000. In this setting, I expect several orders of magnitude of difference between the size ratio of almost-AGI and the size ratio of AGI, making it essentially guaranteed that you find an almost-AGI architecture before an AGI architecture.
In this scenario, my argument is that the size ratio for “almost-AGI architectures” is better (e.g. 10−9), and so you’re more likely to find one of those first.
For a “local search NAS” (rather than “random search NAS”) it seems that we should be considering here the set of [“almost-AGI architectures” from which the local search would not find an “AGI architecture”].
The “$1B NAS discontinuity scenario” allows for the $1B NAS to find “almost-AGI architectures” before finding an “AGI architecture”.
For a “local search NAS” (rather than “random search NAS”) it seems that we should be considering here the set of [“almost-AGI architectures” from which the local search would not find an “AGI architecture”].
The “$1B NAS discontinuity scenario” allows for the $1B NAS to find “almost-AGI architectures” before finding an “AGI architecture”.
Agreed. My point is that the $100M NAS would find the almost-AGI architectures. (My point with the size ratios is that whatever criterion you use to say “and that’s why the $1B NAS finds AGI while the $100M NAS doesn’t”, my response would be that “well, almost-AGI architectures require a slightly easier-to-achieve value of <criterion>, that the $100M NAS would have achieved”.)
I do think that similar conclusions apply there as well, though I’m not going to make a mathematical model for it.
I’m not saying it is; I’m saying that however hard it is to find a non-fragile good solution, it is easier to find a solution that is almost as good. When I say
I mean to imply that the existing optimization power will do most of the work, for whatever quality of solution you are getting.
(Aside: it would be way smaller than 10−10.) In this scenario, my argument is that the size ratio for “almost-AGI architectures” is better (e.g. 10−9), and so you’re more likely to find one of those first.
In practice, if you have a thousand parameters that determine an architecture, and 10 settings for each of them, the size ratio for the (assumed unique) globally best architecture is 10−1000. In this setting, I expect several orders of magnitude of difference between the size ratio of almost-AGI and the size ratio of AGI, making it essentially guaranteed that you find an almost-AGI architecture before an AGI architecture.
For a “local search NAS” (rather than “random search NAS”) it seems that we should be considering here the set of [“almost-AGI architectures” from which the local search would not find an “AGI architecture”].
The “$1B NAS discontinuity scenario” allows for the $1B NAS to find “almost-AGI architectures” before finding an “AGI architecture”.
Agreed. My point is that the $100M NAS would find the almost-AGI architectures. (My point with the size ratios is that whatever criterion you use to say “and that’s why the $1B NAS finds AGI while the $100M NAS doesn’t”, my response would be that “well, almost-AGI architectures require a slightly easier-to-achieve value of <criterion>, that the $100M NAS would have achieved”.)