Tom Lieberum comments on Hypothesis: gradient descent prefers general circuits

Tom Lieberum 11 Feb 2022 12:21 UTC
LW: 3 AF: 3
AF
So I ran some experiments for the permutation group S_5 with the task x o y = ?

Interestingly here increasing the learning rate just never works. I’m very confused.
- Rohin Shah 11 Feb 2022 13:16 UTC
  LW: 3 AF: 3
  AF Parent
  Also interestingly, in the default setting for these new experiments, grokking happens in ~1000 steps while memorization happens in ~1500 steps, so the grokking is already faster than the memorization, in stark contrast to the graphs in the original post.
  (This does depend on when you start the counter for grokking, as there’s a long period of slowly increasing validation accuracy. You could reasonably say grokking took ~2500 steps.)
  - Tom Lieberum 11 Feb 2022 13:45 UTC
    LW: 3 AF: 2
    AF Parent
    Oh I thought figure 1 was S5 but it actually is modular division. I’ll give that a go..
    
    Here are results for modular division. Not super sure what to make of them. Small increases in learning rate work, but so does just choosing a larger learning rate from the beginning. In fact, increasing lr to 5x from the beginning works super well but switching to 5x once grokking arguably starts just destroys any progress. 10x lr from the start does not work (nor when switching later)
    
    So maybe the initial observation is more a general/global property of the loss landscape for the task and not of the particular region during grokking?
    - Rohin Shah 12 Feb 2022 9:40 UTC
      LW: 3 AF: 1
      AF Parent
      Yeah, that seems right, I think I’m basically at “no, you can’t just 10x the learning rate once grokking starts”.
    - gwern 12 Feb 2022 1:09 UTC
      3 points
      Parent
      Increasing regularization (weight decay in this instance) might rescue the ones which don’t work.
      - Tom Lieberum 12 Feb 2022 10:57 UTC
        1 point
        Parent
        I tried increasing weight decay and increased batch sizes but so far no real success compared to 5x lr. Not going to investigate this further atm.