David Scott Krueger (formerly: capybaralet) comments on Fun with +12 OOMs of Compute

David Scott Krueger (formerly: capybaralet) 2 Mar 2021 21:15 UTC
LW: 2 AF: 1
AF
Sure, but in what way?
Also I’d be happy to do a quick video chat if that would help (PM me).
- Daniel Kokotajlo 3 Mar 2021 14:49 UTC
  LW: 2 AF: 1
  AF Parent
  Well, I’ve got five tentative answers to Question One in this post. Roughly, they are: Souped-up AlphaStar, Souped-up GPT, Evolution Lite, Engineering Simulation, and Emulation Lite. Five different research programs basically. It sounds like what you are talking about is sufficiently different from these five, and also sufficiently promising/powerful/‘fun’, that it would be a worthy addition to the list basically. So, to flesh it out, maybe you could say something like “Here are some examples of meta-learning/NAS/AIGA in practice today. Here’s a sketch of what you could do if you scaled all this up +12 OOMs. Here’s some argument for why this would be really powerful.”
  - David Scott Krueger (formerly: capybaralet) 5 Mar 2021 13:50 UTC
    LW: 4 AF: 3
    AF Parent
    There’s a ton of work in meta-learning, including Neural Architecture Search (NAS). AIGA’s (Clune) is a paper that argues a similar POV to what I would describe here, so I’d check that out.
    
    I’ll just say “why it would be powerful”: the promise of meta-learning is that—just like learned features outperform engineered features—learned learning algorithms will eventually outperform engineered learning algorithms. Taking the analogy seriously would suggest that the performance gap will be large—a quantitative step-change.
    
    The upper limit we should anchor on is fully automated research. This helps illustrate how powerful this could be, since automating research could easily give many orders of magnitude speed up (e.g. just consider the physical limitation of humans manually inputting information about what experiment to run).
    
    An important underlying question is how much room there is for improvement over current techniques. The idea that current DL techniques are pretty close to perfect (i.e. we’ve uncovered the fundamental principles of efficient learning (associated view: …and maybe DNNs are a good model of the brain)) seems too often implicit in some of the discussions around forecasting and scaling. I think it’s a real possibility, but I think it’s fairly unlikely (~15%, OTTMH). The main evidence for it is that 99% of published improvements don’t seem to make much difference in practice/at-scale.
    
    Assuming that current methods are roughly optimal has two important implications:
    - no new fundamental breakthroughs needed for AGI (faster timelines)
    - no possible acceleration from fundamental algorithmic breakthroughs (slower timelines)