A thread into which I’ll occasionally post notes on some ML(?) papers I’m reading
I think the world would probably be much better if everyone made a bunch more of their notes public. I intend to occasionally copy some personal notes on ML(?) papers into this thread. While I hope that the notes which I’ll end up selecting for being posted here will be of interest to some people, and that people will sometimes comment with their thoughts on the same paper and on my thoughts (please do tell me how I’m wrong, etc.), I expect that the notes here will not be significantly more polished than typical notes I write for myself and my reasoning will be suboptimal; also, I expect most of these notes won’t really make sense unless you’re also familiar with the paper — the notes will typically be companions to the paper, not substitutes.
I expect I’ll sometimes be meaner than some norm somewhere in these notes (in fact, I expect I’ll sometimes be simultaneously mean and wrong/confused — exciting!), but I should just say to clarify that I think almost all ML papers/posts/notes are trash, so me being mean to a particular paper might not be evidence that I think it’s worse than some average. If anything, the papers I post notes about had something worth thinking/writing about at all, which seems like a good thing! In particular, they probably contained at least one interesting idea!
So, anyway: I’m warning you that the notes in this thread will be messy and not self-contained, and telling you that reading them might not be a good use of your time :)
@misc{radhakrishnan2023mechanism,
title={Mechanism of feature learning in deep fully connected networks and kernel machines that recursively learn features},
author={Adityanarayanan Radhakrishnan and Daniel Beaglehole and Parthe Pandit and Mikhail Belkin},
year={2023},
url = { https://arxiv.org/pdf/2212.13881.pdf }
}
The ansatz from the paper
Let hi(x)∈Rk denote the activation vector in layer i on input x∈Rd, with the input layer being at index i=1, so h1(x)=x. Let Wi be the weight matrix after activation layer i. Let fi be the function that maps from the ith activation layer to the output. Then their Deep Neural Feature Ansatz says that WTiWi∝∼1|D|∑x∈D∇fi(hi(x))∇fi(hi(x))T
(I’m somewhat confused here about them not mentioning the loss function at all — are they claiming this is reasonable for any reasonable loss function? Maybe just MSE? MSE seems to be the only loss function mentioned in the paper; I think they leave the loss unspecified in a bunch of places though.)
A singular vector version of the ansatz
Letting Wi=UΣVT be a SVD of Wi, we note that this is equivalent to VΣ2VT∝∼1|D|∑x∈D∇fi(hi(x))∇fi(hi(x))T, i.e., that the eigenvectors of the matrix M on the RHS are the right singular vectors. By the variational characterization of eigenvectors and eigenvalues (Courant-Fischer or whatever), this is the same as saying that right singular vectors of Wi are the highest orthonormal vTMv directions for the matrix M on the RHS. Plugging in the definition of M, this is equivalent to saying that the right singular vectors are the sequence of highest-variance directions of the data set of gradients ∇fi(hi(x)).
(I have assumed here that the linearity is precise, whereas really it is approximate. It’s probably true though that with some assumptions, the approximate initial statement implies an approximate conclusion too? Getting approx the same vecs out probably requires some assumption about gaps in singular values being big enough, because the vecs are unstable around equality. But if we’re happy getting a sequence of orthogonal vectors that gets variances which are nearly optimal, we should also be fine without this kind of assumption. (This is guessing atm.))
Getting rid of the Wi dependence on the RHS?
Assuming there isn’t an off-by-one error in the paper, we can pull some Wi term out of the RHS maybe? This is because applying the chain rule to the Jacobians of the transitions i→i+1→end gives ∇fi(hi(x))T=∇fi+1(hi+1(x))TWi, so 1|D|∑x∈D∇fi(hi(x))∇fi(hi(x))T=1|D|∑x∈DWTi∇fi+1(hi+1(x))∇fi+1(hi+1(x))TWi.
Wait, so the claim is just WTiWi∝∼WTi(∑x∈D∇fi+1(hi+1(x))∇fi+1(hi+1(x))T)Wi which, assuming Wi is invertible, should be the same as ∑x∈D∇fi+1(hi+1(x))∇fi+1(hi+1(x))T∝∼I. But also, they claim that it is WTi+1Wi+1? Are they secretly approximating everything with identity matrices?? This doesn’t seem to be the case from their Figure 2 though.
Oh oops I guess I forgot about activation functions here! There should be extra diagonal terms for jacobians of preactivations->activations in ∇fi(hi(x))T=∇fi+1(hi+1(x))TWi, i.e., it should really say ∇fi(hi(x))T=∇fi+1(hi+1(x))TDi+1(x)Wi. We now instead get WTiWi∝∼WTi(∑x∈DDi+1(x)∇fi+1(hi+1(x))∇fi+1(hi+1(x))TDi+1(x))Wi.
This should be the same as ∑x∈DDi+1(x)∇fi+1(hi+1(x))∇fi+1(hi+1(x))TDi+1(x)∝∼I which, with pi denoting preactivations in layer i and fp,i denoting the function from these preactivations to the output, is the same as ∑x∈D∇fp,i+1(pi+1(x))∇fp,i+1(pi+1(x))T∝∼I. This last thing also totally works with activation functions other than ReLU — one can get this directly from the Jacobian calculation. I made the ReLU assumption earlier because I thought for a bit that one can get something further in that case; I no longer think this, but I won’t go back and clean up the presentation atm.
Anyway, a takeaway is that the Deep Neural Feature Ansatz is equivalent to the (imo cleaner) ansatz that the set of gradients of the output wrt the pre-activations of any layer is close to being a tight frame (in other words, the gradients are in isotropic position; in other words still, the data matrix of the gradients is a constant times a semi-orthogonal matrix). (Note that the closeness one immediately gets isn’t in L2 to a tight frame, it’s just in the quantity defining the tightness of a frame, but I’d guess that if it matters, one can also conclude some kind of closeness in L2 from this (related).) This seems like a nicer fundamental condition because (1) we’ve intuitively canceled terms and (2) it now looks like a generic-ish condition, looks less mysterious, though idk how to argue for this beyond some handwaving about genericness, about other stuff being independent, sth like that.
proof of the tight frame claim from the previous condition: Note that ∑x∈D∇fp,i+1(pi+1(x))∇fp,i+1(pi+1(x))T∝∼Iclearly implies that the L2 mass in any direction is the same, but also the L2 mass being the same in any direction implies the above (because then, letting the SVD of the matrix with these gradients in its columns be U′Σ′V′T, the above is U′Σ′Σ′TU′T=σ2I, where we used the fact that Σ=σI).
Some questions
Can one come up with some similar ansatz identity for the left singular vectors of Wi? One point of tension/interest here is that an ansatz identity for WiWTi would constrain the left singular vectors of Wi together with its singular values, but the singular values are constrained already by the deep neural feature ansatz. So if there were another identity for WiWTi in terms of some gradients, we’d get a derived identity from equality between the singular values defined in terms of those gradients and the singular values defined in terms of the Deep Neural Feature Ansatz. Or actually, there probably won’t be an interesting identity here since given the cancellation above, it now feels like nothing about Wi is really pinned down by ‘gradients independent of Wi’ by the DNFA? Of course, some Wi-dependence remains even in the i+1 gradients because the preactivations at which further gradients get evaluated are somewhat Wi-dependent, so I guess it’s not ruled out that the DNFA constrains something interesting about Wi? But anyway, all this seems to undermine the interestingness of the DNFA, as well as the chance of there being an interesting similar ansatz for the left singular vectors of Wi.
Can one heuristically motivate that the preactivation gradients above should indeed be close to being in isotropic position? Can one use this reduction to provide simpler proofs of some of the propositions in the paper which say that the DNFA is exactly true in certain very toy cases?
The authors claim that the DNFA is supposed to somehow elucidate feature learning (indeed, they claim it is a mechanism of feature learning?). I take ‘feature learning’ to mean something like which neuronal functions (from the input) are created or which functions are computed in a layer in some broader sense (maybe which things are made linearly readable?) or which directions in an activation space to amplify or maybe less precisely just the process of some internal functions (from the input to internal activations) being learned of something like that, which happens in finite networks apparently in contrast to infinitely wide networks or NTK models or something like that which I haven’t yet understood? I understand that their heuristic identity on the surface connects something about a weight matrix to something about gradients, but assuming I’ve not made some index-off-by-one error or something, it seems to probably not really be about that at all, since the weight matrix sorta cancels out — if it’s true for one Wi, it would maybe also be true with any other Wi replacing it, so it doesn’t really pin down Wi? (This might turn out to be false if the isotropy of preactivation gradients is only true for a very particular choice of Wi.) But like, ignoring that counter, I guess their point is that the directions which get stretched most by the weight matrix in a layer are the directions along which it would be the best to move locally in that activation space to affect the output? (They don’t explain it this way though — maybe I’m ignorant of some other meaning having been attributed to WTiWi in previous literature or something.) But they say “Informally, this mechanism corresponds to the approach of progressively re-weighting features in proportion to the influence they have on the predictions.”. I guess maybe this is an appropriate description of the math if they are talking about reweighting in the purely linear sense, and they take features in the input layer to be scaleless objects or something? (Like, if we take features in the input activation space to each have some associated scale, then the right singular vector identity no longer says that most influential features get stretched the most.) I wish they were much more precise here, or if there isn’t a precise interesting philosophical thing to be deduced from their math, much more honest about that, much less PR-y.
So, in brief, instead of “informally, this mechanism corresponds to the approach of progressively re-weighting features in proportion to the influence they have on the predictions,” it seems to me that what the math warrants would be sth more like “The weight matrix reweights stuff; after reweighting, the activation space is roughly isotropic wrt affecting the prediction (ansatz); so, the stuff that got the highest weight has most effect on the prediction now.” I’m not that happy with this last statement either, but atm it seems much more appropriate than their claim.
I guess if I’m not confused about something major here (plausibly I am), one could probably add 1000 experiments (e.g. checking that the isotropic version of the ansatz indeed equally holds in a bunch of models) and write a paper responding to them. If you’re reading this and this seems interesting to you, feel free to do that — I’m also probably happy to talk to you about the paper.
typos in the paper
indexing error in the first displaymath in Sec 2: it probably should say WL″, not WL+1″
A thread into which I’ll occasionally post notes on some ML(?) papers I’m reading
I think the world would probably be much better if everyone made a bunch more of their notes public. I intend to occasionally copy some personal notes on ML(?) papers into this thread. While I hope that the notes which I’ll end up selecting for being posted here will be of interest to some people, and that people will sometimes comment with their thoughts on the same paper and on my thoughts (please do tell me how I’m wrong, etc.), I expect that the notes here will not be significantly more polished than typical notes I write for myself and my reasoning will be suboptimal; also, I expect most of these notes won’t really make sense unless you’re also familiar with the paper — the notes will typically be companions to the paper, not substitutes.
I expect I’ll sometimes be meaner than some norm somewhere in these notes (in fact, I expect I’ll sometimes be simultaneously mean and wrong/confused — exciting!), but I should just say to clarify that I think almost all ML papers/posts/notes are trash, so me being mean to a particular paper might not be evidence that I think it’s worse than some average. If anything, the papers I post notes about had something worth thinking/writing about at all, which seems like a good thing! In particular, they probably contained at least one interesting idea!
So, anyway: I’m warning you that the notes in this thread will be messy and not self-contained, and telling you that reading them might not be a good use of your time :)
The Deep Neural Feature Ansatz
@misc{radhakrishnan2023mechanism, title={Mechanism of feature learning in deep fully connected networks and kernel machines that recursively learn features}, author={Adityanarayanan Radhakrishnan and Daniel Beaglehole and Parthe Pandit and Mikhail Belkin}, year={2023}, url = { https://arxiv.org/pdf/2212.13881.pdf } }
The ansatz from the paper
Let hi(x)∈Rk denote the activation vector in layer i on input x∈Rd, with the input layer being at index i=1, so h1(x)=x. Let Wi be the weight matrix after activation layer i. Let fi be the function that maps from the ith activation layer to the output. Then their Deep Neural Feature Ansatz says that WTiWi∝∼1|D|∑x∈D∇fi(hi(x))∇fi(hi(x))T (I’m somewhat confused here about them not mentioning the loss function at all — are they claiming this is reasonable for any reasonable loss function? Maybe just MSE? MSE seems to be the only loss function mentioned in the paper; I think they leave the loss unspecified in a bunch of places though.)
A singular vector version of the ansatz
Letting Wi=UΣVT be a SVD of Wi, we note that this is equivalent to VΣ2VT∝∼1|D|∑x∈D∇fi(hi(x))∇fi(hi(x))T, i.e., that the eigenvectors of the matrix M on the RHS are the right singular vectors. By the variational characterization of eigenvectors and eigenvalues (Courant-Fischer or whatever), this is the same as saying that right singular vectors of Wi are the highest orthonormal vTMv directions for the matrix M on the RHS. Plugging in the definition of M, this is equivalent to saying that the right singular vectors are the sequence of highest-variance directions of the data set of gradients ∇fi(hi(x)).
(I have assumed here that the linearity is precise, whereas really it is approximate. It’s probably true though that with some assumptions, the approximate initial statement implies an approximate conclusion too? Getting approx the same vecs out probably requires some assumption about gaps in singular values being big enough, because the vecs are unstable around equality. But if we’re happy getting a sequence of orthogonal vectors that gets variances which are nearly optimal, we should also be fine without this kind of assumption. (This is guessing atm.))
Getting rid of the Wi dependence on the RHS?
Assuming there isn’t an off-by-one error in the paper, we can pull some Wi term out of the RHS maybe? This is because applying the chain rule to the Jacobians of the transitions i→i+1→end gives ∇fi(hi(x))T=∇fi+1(hi+1(x))TWi, so 1|D|∑x∈D∇fi(hi(x))∇fi(hi(x))T=1|D|∑x∈DWTi∇fi+1(hi+1(x))∇fi+1(hi+1(x))TWi.
Wait, so the claim is just WTiWi∝∼WTi(∑x∈D∇fi+1(hi+1(x))∇fi+1(hi+1(x))T)Wi which, assuming Wi is invertible, should be the same as ∑x∈D∇fi+1(hi+1(x))∇fi+1(hi+1(x))T∝∼I. But also, they claim that it is WTi+1Wi+1? Are they secretly approximating everything with identity matrices?? This doesn’t seem to be the case from their Figure 2 though.
Oh oops I guess I forgot about activation functions here! There should be extra diagonal terms for jacobians of preactivations->activations in ∇fi(hi(x))T=∇fi+1(hi+1(x))TWi, i.e., it should really say ∇fi(hi(x))T=∇fi+1(hi+1(x))TDi+1(x)Wi. We now instead get WTiWi∝∼WTi(∑x∈DDi+1(x)∇fi+1(hi+1(x))∇fi+1(hi+1(x))TDi+1(x))Wi. This should be the same as ∑x∈DDi+1(x)∇fi+1(hi+1(x))∇fi+1(hi+1(x))TDi+1(x)∝∼I which, with pi denoting preactivations in layer i and fp,i denoting the function from these preactivations to the output, is the same as ∑x∈D∇fp,i+1(pi+1(x))∇fp,i+1(pi+1(x))T∝∼I. This last thing also totally works with activation functions other than ReLU — one can get this directly from the Jacobian calculation. I made the ReLU assumption earlier because I thought for a bit that one can get something further in that case; I no longer think this, but I won’t go back and clean up the presentation atm.
Anyway, a takeaway is that the Deep Neural Feature Ansatz is equivalent to the (imo cleaner) ansatz that the set of gradients of the output wrt the pre-activations of any layer is close to being a tight frame (in other words, the gradients are in isotropic position; in other words still, the data matrix of the gradients is a constant times a semi-orthogonal matrix). (Note that the closeness one immediately gets isn’t in L2 to a tight frame, it’s just in the quantity defining the tightness of a frame, but I’d guess that if it matters, one can also conclude some kind of closeness in L2 from this (related).) This seems like a nicer fundamental condition because (1) we’ve intuitively canceled terms and (2) it now looks like a generic-ish condition, looks less mysterious, though idk how to argue for this beyond some handwaving about genericness, about other stuff being independent, sth like that.
proof of the tight frame claim from the previous condition: Note that ∑x∈D∇fp,i+1(pi+1(x))∇fp,i+1(pi+1(x))T∝∼Iclearly implies that the L2 mass in any direction is the same, but also the L2 mass being the same in any direction implies the above (because then, letting the SVD of the matrix with these gradients in its columns be U′Σ′V′T, the above is U′Σ′Σ′TU′T=σ2I, where we used the fact that Σ=σI).
Some questions
Can one come up with some similar ansatz identity for the left singular vectors of Wi? One point of tension/interest here is that an ansatz identity for WiWTi would constrain the left singular vectors of Wi together with its singular values, but the singular values are constrained already by the deep neural feature ansatz. So if there were another identity for WiWTi in terms of some gradients, we’d get a derived identity from equality between the singular values defined in terms of those gradients and the singular values defined in terms of the Deep Neural Feature Ansatz. Or actually, there probably won’t be an interesting identity here since given the cancellation above, it now feels like nothing about Wi is really pinned down by ‘gradients independent of Wi’ by the DNFA? Of course, some Wi-dependence remains even in the i+1 gradients because the preactivations at which further gradients get evaluated are somewhat Wi-dependent, so I guess it’s not ruled out that the DNFA constrains something interesting about Wi? But anyway, all this seems to undermine the interestingness of the DNFA, as well as the chance of there being an interesting similar ansatz for the left singular vectors of Wi.
Can one heuristically motivate that the preactivation gradients above should indeed be close to being in isotropic position? Can one use this reduction to provide simpler proofs of some of the propositions in the paper which say that the DNFA is exactly true in certain very toy cases?
The authors claim that the DNFA is supposed to somehow elucidate feature learning (indeed, they claim it is a mechanism of feature learning?). I take ‘feature learning’ to mean something like which neuronal functions (from the input) are created or which functions are computed in a layer in some broader sense (maybe which things are made linearly readable?) or which directions in an activation space to amplify or maybe less precisely just the process of some internal functions (from the input to internal activations) being learned of something like that, which happens in finite networks apparently in contrast to infinitely wide networks or NTK models or something like that which I haven’t yet understood? I understand that their heuristic identity on the surface connects something about a weight matrix to something about gradients, but assuming I’ve not made some index-off-by-one error or something, it seems to probably not really be about that at all, since the weight matrix sorta cancels out — if it’s true for one Wi, it would maybe also be true with any other Wi replacing it, so it doesn’t really pin down Wi? (This might turn out to be false if the isotropy of preactivation gradients is only true for a very particular choice of Wi.) But like, ignoring that counter, I guess their point is that the directions which get stretched most by the weight matrix in a layer are the directions along which it would be the best to move locally in that activation space to affect the output? (They don’t explain it this way though — maybe I’m ignorant of some other meaning having been attributed to WTiWi in previous literature or something.) But they say “Informally, this mechanism corresponds to the approach of progressively re-weighting features in proportion to the influence they have on the predictions.”. I guess maybe this is an appropriate description of the math if they are talking about reweighting in the purely linear sense, and they take features in the input layer to be scaleless objects or something? (Like, if we take features in the input activation space to each have some associated scale, then the right singular vector identity no longer says that most influential features get stretched the most.) I wish they were much more precise here, or if there isn’t a precise interesting philosophical thing to be deduced from their math, much more honest about that, much less PR-y.
So, in brief, instead of “informally, this mechanism corresponds to the approach of progressively re-weighting features in proportion to the influence they have on the predictions,” it seems to me that what the math warrants would be sth more like “The weight matrix reweights stuff; after reweighting, the activation space is roughly isotropic wrt affecting the prediction (ansatz); so, the stuff that got the highest weight has most effect on the prediction now.” I’m not that happy with this last statement either, but atm it seems much more appropriate than their claim.
I guess if I’m not confused about something major here (plausibly I am), one could probably add 1000 experiments (e.g. checking that the isotropic version of the ansatz indeed equally holds in a bunch of models) and write a paper responding to them. If you’re reading this and this seems interesting to you, feel free to do that — I’m also probably happy to talk to you about the paper.
typos in the paper
indexing error in the first displaymath in Sec 2: it probably should say WL″, not WL+1″