We then can optimize the rotation matrix and its inverse so that local changes in the rotated activation matrix have local effects on the outputted activations.
Could you explain how you are formulating/solving this optimization problem in more detail?
where M3,M2,M1 are matrix multiplies, and NL is our nonlinear layer.
We also define a sparsity measure to minimize, chosen for the fun property that it really really really likes zeros compared to almost all other numbers.
Sparsity(A)=−∑i,j1|ai,j|+0.1
note that lower sparsity according to this measure means more zeros.
There are two reasonable ways of finding the right rotations. I will describe one way in depth, and the other way not-so in depth. Do note that the specifics of all this may change once I run a few experiments to determine whether there’s any short-cuts I’m able to take[1].
We know the input is in a preferred basis. In our MNIST case, it is just the pixels on the screen. These likely interact locally because the relevant first-level features are local. If you want to find a line in the bottom right, you don’t care about the existence of white pixels in the top left.
We choose our first rotation R1 so as to minimize
Sparsity(R1J1)
where
J1:=Jinput(NL∘M1)
Then the second rotation R2 so as to minimize
Sparsity(R2J2)
where
J2:=J(R1∘NL∘M1)(input)(NL∘M2∘R−11)
And finally choosing R3 so as to minimize
Sparsity(R3J3)
where
J3:=J(R2∘NL∘M2∘NL∘M1)(input)(M3∘R−12).
The other way of doing this is to suppose the output is in a preferred basis, instead of the input.
Currently I’m doing this minimization using gradient descent (lr = 0.0001), and parameterizing my rotation matrices using the fact that if A is an antisymmetric matrix[2], then eA is a rotation matrix, and that you can make an antisymmetric matrix by choosing any old matrix B, then doing A:=B−BT. So we just figure out which B gets us an R=eB−BT which has the properties we like.
There is probably a far, far better way of solving this, other than gradient descent. If you are interested in the specifics, you may know a better way. Please, please tell me a better way!
An example of a short cut: I really don’t want to find a rotation which minimizes average sparsity across every input directly. This sounds very computationally expensive! Does minimizing my sparsity metric on a particular input, or only a few inputs generalize to minimizing the sparsity metric on many inputs?
I’ll put a $100 bounty on a better way that either saves Garrett at least 5 hours of research time, or is qualitatively better such that he settles on it.
Empirically, it works better than all the Ln norms for getting me zeros. Theoretically, it really likes zeros, whereas lots of other norms just like low numbers which are different things when talking about sparsity. I want zeros. I don’t just want low numbers.
It’s also possible to use use the Q from the QR decomposition to accomplish this. This has some advantages for me (specifically, you can orthogonalize arbitrary unfolded tensors which are parameterized in factored form), however, I believe the gradients via SGD will be less nice when using QR.
Naively, there probably isn’t a better way to learn than via gradient descent (possible with better initialization etc.). This is ‘just some random non-convex optimization problem’, so what could you hope for? If you minimize sparsity on a single input as opposed to on average, then it seems plausible to me that you could pick a sparsity criteria such that the problem can be optimized in a nicer way (but I’d also expect that minimizing sparsity on a single input isn’t really what you want).
You could hope for more even for a random non-convex optimization problem if you can set up a tight relaxation. E.g. this paper gives you optimality bounds via a semidefinite relaxation, though I am not sure if it would scale to the size of problems relevant here.
Would love to see more in this line of work.
Could you explain how you are formulating/solving this optimization problem in more detail?
Suppose our model has the following format:
Model(input)=(M3∘NL∘M2∘NL∘M1)(input)
where M3,M2,M1 are matrix multiplies, and NL is our nonlinear layer.
We also define a sparsity measure to minimize, chosen for the fun property that it really really really likes zeros compared to almost all other numbers.
Sparsity(A)=−∑i,j1|ai,j|+0.1
note that lower sparsity according to this measure means more zeros.
There are two reasonable ways of finding the right rotations. I will describe one way in depth, and the other way not-so in depth. Do note that the specifics of all this may change once I run a few experiments to determine whether there’s any short-cuts I’m able to take[1].
We know the input is in a preferred basis. In our MNIST case, it is just the pixels on the screen. These likely interact locally because the relevant first-level features are local. If you want to find a line in the bottom right, you don’t care about the existence of white pixels in the top left.
We choose our first rotation R1 so as to minimize
Sparsity(R1J1)
where
J1:=Jinput(NL∘M1)
Then the second rotation R2 so as to minimize
Sparsity(R2J2)
where
J2:=J(R1∘NL∘M1)(input)(NL∘M2∘R−11)
And finally choosing R3 so as to minimize
Sparsity(R3J3)
where
J3:=J(R2∘NL∘M2∘NL∘M1)(input)(M3∘R−12).
The other way of doing this is to suppose the output is in a preferred basis, instead of the input.
Currently I’m doing this minimization using gradient descent (
lr = 0.0001
), and parameterizing my rotation matrices using the fact that if A is an antisymmetric matrix[2], then eA is a rotation matrix, and that you can make an antisymmetric matrix by choosing any old matrix B, then doing A:=B−BT. So we just figure out which B gets us an R=eB−BT which has the properties we like.There is probably a far, far better way of solving this, other than gradient descent. If you are interested in the specifics, you may know a better way. Please, please tell me a better way!
An example of a short cut: I really don’t want to find a rotation which minimizes average sparsity across every input directly. This sounds very computationally expensive! Does minimizing my sparsity metric on a particular input, or only a few inputs generalize to minimizing the sparsity metric on many inputs?
Meaning its a symmetric matrix with its top right half the opposite sign as it’s bottom left half.
I’ll put a $100 bounty on a better way that either saves Garrett at least 5 hours of research time, or is qualitatively better such that he settles on it.
What’s the motivation for that specific sparsity prior/regularizer? Seems interestingly different than standard Ln.
Empirically, it works better than all the Ln norms for getting me zeros. Theoretically, it really likes zeros, whereas lots of other norms just like low numbers which are different things when talking about sparsity. I want zeros. I don’t just want low numbers.
Work I’m doing at redwood involves doing somewhat similar things.
Some observations which you plausibly are already aware of:
You could use geotorch for the parametrization. geotorch has now been ‘upstreamed’ into pytorch as well
It’s also possible to use use the Q from the QR decomposition to accomplish this. This has some advantages for me (specifically, you can orthogonalize arbitrary unfolded tensors which are parameterized in factored form), however, I believe the gradients via SGD will be less nice when using QR.
Naively, there probably isn’t a better way to learn than via gradient descent (possible with better initialization etc.). This is ‘just some random non-convex optimization problem’, so what could you hope for? If you minimize sparsity on a single input as opposed to on average, then it seems plausible to me that you could pick a sparsity criteria such that the problem can be optimized in a nicer way (but I’d also expect that minimizing sparsity on a single input isn’t really what you want).
You could hope for more even for a random non-convex optimization problem if you can set up a tight relaxation. E.g. this paper gives you optimality bounds via a semidefinite relaxation, though I am not sure if it would scale to the size of problems relevant here.
Interesting QR decomposition idea. I’m going to try using the Q as the initialization point of the rotation matrix, and see if this has any effect.