I agree with both of your rephrasings and I think both add useful intuition!
Regarding rank 2, I don’t see any difference in behavior from rank 1 other than the “bump” in alignment that Lawrence mentioned. Here’s an example:
This doesn’t happen in all rank-2 cases but is relatively common. I think usually each vector grows primarily towards 1 or the other target. If two vectors grow towards the same target then you get this bump where one of them has to back off and align more towards a different target [at least that’s my current understanding, see my reply to Lawrence for more detail!].
What happens in a cross-entropy loss style setup, rather than MSE loss? IMO cross-entropy loss is a better analogue to real networks. Though I’m confused about the right way to model an internal sub-circuit of the model. I think the exponential decay term just isn’t there?
What does a cross-entropy setup look like here? I’m just not sure how to map this toy model onto that loss (or vice-versa).
How does this interact with weight decay? This seems to give an intrinsic exponential decay to everything
Agreed! I expect weight decay to (1) make the converged solution not actually minimize the original loss (because the weight decay keeps tugging it towards lower norms) and (2) accelerate the initial decay. I don’t think I expect any other changes.
How does this interact with softmax? Intuitively, softmax feels “S-curve-ey”
I’m not sure! Do you have a setup in mind?
How does this with interact with Adam? In particular, Adam gets super messy because you can’t just disentangle things. Even worse, how does it interact with AdamW?
I agree this breaks my theoretical intuition. Experimentally most of the phenomenology is the same, except that the full-rank (rank 100) case regains a plateau.
Here’s rank 2:
rank 10:
(maybe there’s more ‘bump’ formation here than with SGD?)
rank 100:
It kind of looks like the plateau has returned! And this replicates across every rank 100 example I tried, e.g.
The plateau corresponds to a period with a lot of bump formation. If bumps really are a sign of vectors competing to represent different chunks of subspace then maybe this says that Adam produces more such competition (maybe by making different vectors learn at more similar rates?).
I’d be curious if you have any intuition about this!
The plateau corresponds to a period with a lot of bump formation. If bumps really are a sign of vectors competing to represent different chunks of subspace then maybe this says that Adam produces more such competition (maybe by making different vectors learn at more similar rates?).
I caution against over-interpreting the results of single runs—I think there’s a good chance the number of bumps varies significantly by random seed.
What happens in a cross-entropy loss style setup, rather than MSE loss? IMO cross-entropy loss is a better analogue to real networks. Though I’m confused about the right way to model an internal sub-circuit of the model. I think the exponential decay term just isn’t there?
There’s lots of ways to do this, but the obvious way is to flatten C and Z and treat them as logits.
I agree with both of your rephrasings and I think both add useful intuition!
Regarding rank 2, I don’t see any difference in behavior from rank 1 other than the “bump” in alignment that Lawrence mentioned. Here’s an example:
This doesn’t happen in all rank-2 cases but is relatively common. I think usually each vector grows primarily towards 1 or the other target. If two vectors grow towards the same target then you get this bump where one of them has to back off and align more towards a different target [at least that’s my current understanding, see my reply to Lawrence for more detail!].
What does a cross-entropy setup look like here? I’m just not sure how to map this toy model onto that loss (or vice-versa).
Agreed! I expect weight decay to (1) make the converged solution not actually minimize the original loss (because the weight decay keeps tugging it towards lower norms) and (2) accelerate the initial decay. I don’t think I expect any other changes.
I’m not sure! Do you have a setup in mind?
I agree this breaks my theoretical intuition. Experimentally most of the phenomenology is the same, except that the full-rank (rank 100) case regains a plateau.
Here’s rank 2:
rank 10:
(maybe there’s more ‘bump’ formation here than with SGD?)
rank 100:
It kind of looks like the plateau has returned! And this replicates across every rank 100 example I tried, e.g.
The plateau corresponds to a period with a lot of bump formation. If bumps really are a sign of vectors competing to represent different chunks of subspace then maybe this says that Adam produces more such competition (maybe by making different vectors learn at more similar rates?).
I’d be curious if you have any intuition about this!
I caution against over-interpreting the results of single runs—I think there’s a good chance the number of bumps varies significantly by random seed.
It’s a good caution, but I do see more bumps with Adam than with SGD across a number of random initializations.
(with the caveat that this is still “I tried a few times” and not any quantitative study)
There’s lots of ways to do this, but the obvious way is to flatten C and Z and treat them as logits.
Something like this?
Well, I’d keep everything in log space and do the whole thing with log_sum_exp for numerical stability, but yeah.
EDIT: e.g. something like:
Erm do C and Z have to be valid normalized probabilities for this to work?
C needs to be probabilities, yeah. Z can be any vector of numbers. (You can convert C into probabilities with softmax)
So indeed with cross-entropy loss I see two plateaus! Here’s rank 2:
(note that I’ve offset the loss to so that equality of Z and C is zero loss)
I have trouble getting rank 10 to find the zero-loss solution:
But the phenomenology at full rank is unchanged: