Nina Panickssery comments on A Mechanistic Interpretability Analysis of Grokking

Nina Panickssery 24 Aug 2022 20:18 UTC
1 point
0
Status—rough thoughts inspired by skimming this post (want to read in more detail soon!)

Do you think that hand-crafted mathematical functions (potentially slightly more complex ones than the ones mentioned in this research) could be a promising testbed for various alignment techniques? Doing prosaic alignment research with LLMs or huge RL agents is very compute and data hungry, making the process slower and more expensive. I wonder whether there is a way to investigate similar questions with carefully crafted exact functions which can be used to generate enough data quickly, scale down to smaller models, and can be tweaked in different ways to adjust the experiments.

One rough idea I have is training one DNN to implement different simple functions for different sets of numbers, and then seeing how the model generalises OOD given different training methods / alignment techniques.
- Neel Nanda 25 Aug 2022 21:07 UTC
  2 points
  0
  Parent
  I’d personally be somewhat surprised if that was particularly useful—I think there’s a bunch of features of the alignment problem that you just don’t get with smaller models (let alone algorithmic tasks) - eg the model’s ability to understand what alignment even is. Maybe you could get some juice out of it? But knowing that a technique works to “align” an algorithmic problem would feel like v weak evidence that it works on a real problem.
  - Nina Panickssery 25 Aug 2022 22:49 UTC
    1 point
    0
    Parent
    Makes sense. I agree that something working on algorithmic tasks is very weak evidence, although I am somewhat interested in how much insight can we get if we put more effort into hand-crafting algorithmic tasks with interesting properties.
- Good Man 13 Oct 2022 17:15 UTC
  1 point
  0
  Parent
  I found your idea fascinating. You have a good company too. Percy Liang’s group just published a paper along this line of thought and showed transformer’s effectiveness in learning “ML trainers”:
  What Can Transformers Learn In-Context? A Case Study of Simple Function Classes
  https://arxiv.org/abs/2208.01066

Nina Panickssery comments on A Mechanistic Interpretability Analysis of Grokking

What Can Transformers Learn In-Context? A Case Study of Simple Function Classes