gwern comments on Bad at Arithmetic, Promising at Math

gwern 19 Dec 2022 15:29 UTC
19 points
3
- Yeah, that’s the problem with ChatGPT: it’s so easy to use, and so good within a niche, we’re right back to 2020 where everyone is trying the first thing that comes to mind and declaring that GPT is busted if it doesn’t (won’t) do it. Heck, ChatGPT doesn’t even let you set the temperature! (This is on top of the hidden prompt, RLHF, unknown history mechanism, and what seems to be at least 1 additional layer of filter, like the string-matching that the DALL-E 2 service uses to censor and decide when to apply ‘diversity prompt’.) ‘Deep learning is hitting a wall’ etc...
  
  Just remember anytime anyone uses ChatGPT to declare that “DL can’t X”: “sampling can show the presence of knowledge but not the absence.”
- Python is special in that there’s a ton of it as data, and so it’s probably the single best language in any code-generating model. Using the special-purpose languages will cost you, making it a classic bias-variance tradeoff. That data richness won’t go away until you have full bootstrapping systems—not the fairly limited one-step self-distillation or exploration that we have now (eg they can generate Python puzzles to train themselves but the bootstrap seems to only work once or twice).
- It’s not easy, but I think we have lots of ideas and I already covered them in my final bullet point.
  
  There’s a rich vein of ideas in reinforcement learning & decision theory & evolutionary computation about how to do novelty search or meta-learn exploration/intrinsic-curiosity drives which balance explore-exploit. If you train models based on reward and do things like evolutionary selection, you can develop models which have ‘taste’ in terms of making short-term actions which obtain information or useful states for eventually maximizing final reward across a very wide set of environments. There are a lot of links about that in my links there, like Clune’s manifesto.
  
  We have plenty of ideas about how you would do this; for example, here’s an evolution-strategies approach that would meta-learn exploration: ‘For each GPU, initialize a random set of axioms in a formal system(s), specify a random distant target theorem(s); apply random mutations to a large language model pretrained extensively on math corpuses to do tree search (or whatever ATP approach you prefer); do that tree search for a fixed time budget; keep the X% which successfully find their target theorem; repeat until toasted golden-brown and delicious’. This would gradually select for models which embody optimal time-limited search in a very large space of possible formal systems.
  
  To show that it’s not merely theoretical Schimdhuber-esque musing, consider the recent very impressive success in meta-reinforcement-learning I don’t think has been discussed on LW yet and which many people probably still believe impossible in principle*: “VeLO: Training Versatile Learned Optimizers by Scaling Up”, Metz et al 2022 (Twitter)
  
  While deep learning models have replaced hand-designed features across many domains, these models are still trained with hand-designed optimizers. In this work, we leverage the same scaling approach behind the success of deep learning to learn versatile optimizers. We train an optimizer for deep learning which is itself a small neural network that ingests gradients and outputs parameter updates.
  
  Meta-trained with approximately four thousand TPU-months of compute on a wide variety of optimization tasks, our optimizer not only exhibits compelling performance, but optimizes in interesting and unexpected ways. It requires no hyperparameter tuning, instead automatically adapting to the specifics of the problem being optimized.
  
  We open source our learned optimizer, meta-training code, the associated train and test data, and an extensive optimizer benchmark suite with baselines at this URL.
  
  https://twitter.com/ada_rob/status/1593702507422912516
  
  Here is a real-world 🚲 example (not in the paper) for T5 Small (~60M params). The VeLO-trained model reaches the same loss as >5x steps of Adafactor and takes only ~1.5x as long per step in wall time 🕓...a >3x speedup⏫!! Pretty amazing for such an OOD task!
  
  (I’ll note in passing that “Here is an amazing DL PoC that neither you nor anyone else has heard of yet, but already does the thing you are speculating that AI might some day do in 2040 after another dozen paradigm shifts” describes a lot of DL research in 2022; there’s so much going on right now, and the people who should be pulling 16-hour days reading Arxiv are often distracted by things like FTX or Elon Musk, or just the image generation stuff, so there are many things falling through the cracks. But 5 years from now, you won’t care how much time you spent following Twitter drama, and you will care about missing out on things like Dramatron or CICERO or U-PaLM or RT-1.)
* Given some of the comments on my Clippy story, anyway...
- Yitz 20 Dec 2022 8:25 UTC
  2 points
  0
  Parent
  That paper is insane…you’re finding this stuff just by trawling through Arxiv, or through some other method?
  - gwern 20 Dec 2022 15:18 UTC
    10 points
    2
    Parent
    Twitter + Reddit + Google Scholar Alerts on key researchers. Honestly, it’s too much, I am many months behind in my reading.