Have you experimented with subtracting aTa⋅bTb from the loss? It seems to me that doing so would get rid of the second term and allow the model to learn the correct vectors from the beginning.
Sorry, I meant ⟨a,a⟩⋅⟨b,b⟩. And yes, that should eliminate the term that causes the incorrect initialization to decay. Doesn’t that cause the learning to be in the correct direction from the start?
Have you experimented with subtracting aTa⋅bTb from the loss? It seems to me that doing so would get rid of the second term and allow the model to learn the correct vectors from the beginning.
That’s not a scalar, do you mean the trace of that? If so, doesn’t that just eliminate the term that causes the incorrect initialization to decay?
Sorry, I meant ⟨a,a⟩⋅⟨b,b⟩. And yes, that should eliminate the term that causes the incorrect initialization to decay. Doesn’t that cause the learning to be in the correct direction from the start?
I don’t think so? I think that just means you keep the incorrect initialization around while also learning the correct direction.