Conclusion: Most of what you want to measure comes down to neural network training. The training framework is not directly comparable or backwards-compatible with old techniques, so the experiment formulation has to address this.
This seems right if “the dynamics of ML R&D are unrelated to other software R&D—you can’t learn about neural net efficiency improvements by looking at efficiency improvements in other domains.” But I’m not so sure about that (and haven’t seen any evidence for it).
ETA: to clarify, I’m mostly interested in how much future AI will get improved as we massively scale up R&D investment (by applying AI to AI development). This includes e.g. “Tweaking neural net architectures” or “Better optimization algorithms for neural networks” or “better ways to integrate neural networks with search” or whatever. Those improvements are indeed different from “better forms of tree search” or “better position evaluations” and so on. But I still think they are related—if I learn that for a few different domains “doubling R&D doubles performance,” then that gives me evidence that neural net performance will be similar, and if I learn that this kind of return is very rare then I’ll be more skeptical about that kind of extrapolation holding up even if I observe it for the first few orders of magnitude for neural networks.
As you can see in my Figure in this post (https://www.lesswrong.com/posts/75dnjiD8kv2khe9eQ/measuring-hardware-overhang), Leela (Neural Network based chess engine) has a very similar log-linear ELO-FLOPs scaling as traditional algorithms. At least in this case, Neutral Networks scale slightly better for more compute, and worse for less compute. It would be interesting to determine if the bad scaling to old machines is a universal feature of NNs. Perhaps it is: NNs require a certain amount of memory, etc., which gives stronger constraints. The conclusion would be that the hardware overhang is reduced: Older hardware is less suitable for NNs.
This seems right if “the dynamics of ML R&D are unrelated to other software R&D—you can’t learn about neural net efficiency improvements by looking at efficiency improvements in other domains.” But I’m not so sure about that (and haven’t seen any evidence for it).
ETA: to clarify, I’m mostly interested in how much future AI will get improved as we massively scale up R&D investment (by applying AI to AI development). This includes e.g. “Tweaking neural net architectures” or “Better optimization algorithms for neural networks” or “better ways to integrate neural networks with search” or whatever. Those improvements are indeed different from “better forms of tree search” or “better position evaluations” and so on. But I still think they are related—if I learn that for a few different domains “doubling R&D doubles performance,” then that gives me evidence that neural net performance will be similar, and if I learn that this kind of return is very rare then I’ll be more skeptical about that kind of extrapolation holding up even if I observe it for the first few orders of magnitude for neural networks.
As you can see in my Figure in this post (https://www.lesswrong.com/posts/75dnjiD8kv2khe9eQ/measuring-hardware-overhang), Leela (Neural Network based chess engine) has a very similar log-linear ELO-FLOPs scaling as traditional algorithms. At least in this case, Neutral Networks scale slightly better for more compute, and worse for less compute. It would be interesting to determine if the bad scaling to old machines is a universal feature of NNs. Perhaps it is: NNs require a certain amount of memory, etc., which gives stronger constraints. The conclusion would be that the hardware overhang is reduced: Older hardware is less suitable for NNs.