Adam Jermyn comments on Multi-Component Learning and S-Curves

Adam Jermyn 1 Dec 2022 18:21 UTC
LW: 8 AF: 6
2
AF
I agree with both of your rephrasings and I think both add useful intuition!
Regarding rank 2, I don’t see any difference in behavior from rank 1 other than the “bump” in alignment that Lawrence mentioned. Here’s an example:
This doesn’t happen in all rank-2 cases but is relatively common. I think usually each vector grows primarily towards 1 or the other target. If two vectors grow towards the same target then you get this bump where one of them has to back off and align more towards a different target [at least that’s my current understanding, see my reply to Lawrence for more detail!].
What happens in a cross-entropy loss style setup, rather than MSE loss? IMO cross-entropy loss is a better analogue to real networks. Though I’m confused about the right way to model an internal sub-circuit of the model. I think the exponential decay term just isn’t there?
What does a cross-entropy setup look like here? I’m just not sure how to map this toy model onto that loss (or vice-versa).
How does this interact with weight decay? This seems to give an intrinsic exponential decay to everything
Agreed! I expect weight decay to (1) make the converged solution not actually minimize the original loss (because the weight decay keeps tugging it towards lower norms) and (2) accelerate the initial decay. I don’t think I expect any other changes.
How does this interact with softmax? Intuitively, softmax feels “S-curve-ey”
I’m not sure! Do you have a setup in mind?
How does this with interact with Adam? In particular, Adam gets super messy because you can’t just disentangle things. Even worse, how does it interact with AdamW?
I agree this breaks my theoretical intuition. Experimentally most of the phenomenology is the same, except that the full-rank (rank 100) case regains a plateau.
Here’s rank 2:
rank 10:
(maybe there’s more ‘bump’ formation here than with SGD?)
rank 100:
It kind of looks like the plateau has returned! And this replicates across every rank 100 example I tried, e.g.
The plateau corresponds to a period with a lot of bump formation. If bumps really are a sign of vectors competing to represent different chunks of subspace then maybe this says that Adam produces more such competition (maybe by making different vectors learn at more similar rates?).
I’d be curious if you have any intuition about this!
- LawrenceC 1 Dec 2022 20:42 UTC
  LW: 1 AF: 1
  0
  AF Parent
  The plateau corresponds to a period with a lot of bump formation. If bumps really are a sign of vectors competing to represent different chunks of subspace then maybe this says that Adam produces more such competition (maybe by making different vectors learn at more similar rates?).
  I caution against over-interpreting the results of single runs—I think there’s a good chance the number of bumps varies significantly by random seed.
  - Adam Jermyn 1 Dec 2022 20:57 UTC
    LW: 2 AF: 2
    0
    AF Parent
    It’s a good caution, but I do see more bumps with Adam than with SGD across a number of random initializations.
    - Adam Jermyn 1 Dec 2022 20:58 UTC
      LW: 1 AF: 1
      0
      AF Parent
      (with the caveat that this is still “I tried a few times” and not any quantitative study)
- LawrenceC 1 Dec 2022 20:38 UTC
  LW: 1 AF: 1
  0
  AF Parent
  What happens in a cross-entropy loss style setup, rather than MSE loss? IMO cross-entropy loss is a better analogue to real networks. Though I’m confused about the right way to model an internal sub-circuit of the model. I think the exponential decay term just isn’t there?
  There’s lots of ways to do this, but the obvious way is to flatten C and Z and treat them as logits.
  - Adam Jermyn 1 Dec 2022 20:56 UTC
    LW: 1 AF: 1
    0
    AF Parent
    Something like this?
    def loss(learned, target):
    p_target = torch.exp(target)
    p_target = p_target / torch.sum(p_target)
    
    p_learned = torch.exp(learned)
    p_learned = p_learned / torch.sum(p_learned)
    
    return -torch.sum(p_target * torch.log(p_learned))
    - LawrenceC 1 Dec 2022 20:57 UTC
      LW: 1 AF: 1
      0
      AF Parent
      Well, I’d keep everything in log space and do the whole thing with log_sum_exp for numerical stability, but yeah.
      
      EDIT: e.g. something like:
      import torch.nn.functional as F
      def cross_entropy_loss(Z, C):
      return -torch.sum(F.log_softmax(Z) * C)
      - Adam Jermyn 1 Dec 2022 22:20 UTC
        LW: 1 AF: 1
        0
        AF Parent
        Erm do C and Z have to be valid normalized probabilities for this to work?
        LawrenceC 2 Dec 2022 7:17 UTC
        LW: 1 AF: 1
        0
        AF Parent
        C needs to be probabilities, yeah. Z can be any vector of numbers. (You can convert C into probabilities with softmax)
        Adam Jermyn 2 Dec 2022 21:08 UTC
        LW: 1 AF: 1
        0
        AF Parent
        So indeed with cross-entropy loss I see two plateaus! Here’s rank 2:
        (note that I’ve offset the loss to so that equality of Z and C is zero loss)
        I have trouble getting rank 10 to find the zero-loss solution:
        But the phenomenology at full rank is unchanged: