Adam Jermyn comments on Multi-Component Learning and S-Curves

Adam Jermyn 1 Dec 2022 18:02 UTC
LW: 1 AF: 1
0
AF
I don’t, but here’s my best guess: there’s a sense in which there’s competition among vectors for which learned vectors capture which parts of the target span.
As a toy example, suppose there are two vectors, $a_{1}$ and $a_{2}$ , such that the closest target vector to each of these at initialization is $c$ . Then both vectors might grow towards $c$ . At some point $c$ is represented enough in the span, and it’s not optimal for two vectors to both play the role of representing $c$ , so it becomes optimal for at least one of them to shift to cover other target vectors more.
For example, from a rank-4 case with a bump, here’s the inner product with a single target vector of two learned vectors:
So both vectors grow towards a single target, and the blue one starts realigning towards a different target as the orange one catches up.
Two more weak pieces of evidence in favor of this story:
1. We only ever see this bump when the rank is greater than 1.
2. From visual inspection, bumps are more likely to peak at higher levels of alignment than lower levels, and don’t happen at all in initial norm-decay phase, suggesting the bump is associated with vectors growing (rather than decaying).
- LawrenceC 1 Dec 2022 20:22 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Oh, huh, that makes a lot of sense! I’ll see if I can reproduce these results.
  For example, from a rank-4 case with a bump, here’s the inner product with a single target vector of two learned vectors.
  I’m not sure this explains the grokking bumps from the mod add stuff—I’m not sure what the should be “competition” should be given we see the bumps on every key frequency.
  - Adam Jermyn 1 Dec 2022 20:48 UTC
    LW: 1 AF: 1
    0
    AF Parent
    I’d be very excited to see a reproduction :-)