wesg comments on Taking the parameters which seem to matter and rotating them until they don’t

wesg 26 Aug 2022 19:19 UTC
9 points
1
Would love to see more in this line of work.
We then can optimize the rotation matrix and its inverse so that local changes in the rotated activation matrix have local effects on the outputted activations.
Could you explain how you are formulating/solving this optimization problem in more detail?
- Garrett Baker 26 Aug 2022 19:59 UTC
  24 points
  1
  Parent
  Suppose our model has the following format:
  $Model (input) = (M_{3} \circ N L \circ M_{2} \circ N L \circ M_{1}) (input)$
  where $M_{3}, M_{2}, M_{1}$ are matrix multiplies, and $N L$ is our nonlinear layer.
  We also define a sparsity measure to minimize, chosen for the fun property that it really really really likes zeros compared to almost all other numbers.
  $Sparsity (A) = - \sum_{i, j} \frac{1}{| a_{i, j} | + 0.1}$
  note that lower sparsity according to this measure means more zeros.
  There are two reasonable ways of finding the right rotations. I will describe one way in depth, and the other way not-so in depth. Do note that the specifics of all this may change once I run a few experiments to determine whether there’s any short-cuts I’m able to take^[1].
  We know the input is in a preferred basis. In our MNIST case, it is just the pixels on the screen. These likely interact locally because the relevant first-level features are local. If you want to find a line in the bottom right, you don’t care about the existence of white pixels in the top left.
  We choose our first rotation $R_{1}$ so as to minimize
  $Sparsity (R_{1} J_{1})$
  where
  $J_{1} := J_{input} (N L \circ M_{1})$
  Then the second rotation $R_{2}$ so as to minimize
  $Sparsity (R_{2} J_{2})$
  where
  $J_{2} := J_{(R_{1} \circ N L \circ M_{1}) (input)} (N L \circ M_{2} \circ R_{1}^{- 1})$
  And finally choosing $R_{3}$ so as to minimize
  $Sparsity (R_{3} J_{3})$
  where
  $J_{3} := J_{(R_{2} \circ N L \circ M_{2} \circ N L \circ M 1) (input)} (M_{3} \circ R_{2}^{- 1})$ .
  The other way of doing this is to suppose the output is in a preferred basis, instead of the input.
  Currently I’m doing this minimization using gradient descent (lr = 0.0001), and parameterizing my rotation matrices using the fact that if $A$ is an antisymmetric matrix^[2], then $e^{A}$ is a rotation matrix, and that you can make an antisymmetric matrix by choosing any old matrix $B$ , then doing $A := B - B^{T}$ . So we just figure out which $B$ gets us an $R = e^{B - B^{T}}$ which has the properties we like.
  There is probably a far, far better way of solving this, other than gradient descent. If you are interested in the specifics, you may know a better way. Please, please tell me a better way!
  1. ^
    An example of a short cut: I really don’t want to find a rotation which minimizes average sparsity across every input directly. This sounds very computationally expensive! Does minimizing my sparsity metric on a particular input, or only a few inputs generalize to minimizing the sparsity metric on many inputs?
  2. ^
    Meaning its a symmetric matrix with its top right half the opposite sign as it’s bottom left half.
  - Thomas Kwa 26 Aug 2022 20:37 UTC
    30 points
    0
    Parent
    I’ll put a $100 bounty on a better way that either saves Garrett at least 5 hours of research time, or is qualitatively better such that he settles on it.
  - jacob_cannell 27 Aug 2022 4:17 UTC
    6 points
    0
    Parent
    What’s the motivation for that specific sparsity prior/regularizer? Seems interestingly different than standard Ln.
    - Garrett Baker 28 Aug 2022 20:33 UTC
      8 points
      0
      Parent
      Empirically, it works better than all the Ln norms for getting me zeros. Theoretically, it really likes zeros, whereas lots of other norms just like low numbers which are different things when talking about sparsity. I want zeros. I don’t just want low numbers.
  - ryan_greenblatt 27 Aug 2022 17:10 UTC
    5 points
    0
    Parent
    Work I’m doing at redwood involves doing somewhat similar things.
    
    Some observations which you plausibly are already aware of:
    
    You could use geotorch for the parametrization. geotorch has now been ‘upstreamed’ into pytorch as well
    It’s also possible to use use the $Q$ from the $Q R$ decomposition to accomplish this. This has some advantages for me (specifically, you can orthogonalize arbitrary unfolded tensors which are parameterized in factored form), however, I believe the gradients via SGD will be less nice when using $Q R$ .
    Naively, there probably isn’t a better way to learn than via gradient descent (possible with better initialization etc.). This is ‘just some random non-convex optimization problem’, so what could you hope for? If you minimize sparsity on a single input as opposed to on average, then it seems plausible to me that you could pick a sparsity criteria such that the problem can be optimized in a nicer way (but I’d also expect that minimizing sparsity on a single input isn’t really what you want).
    - wesg 30 Aug 2022 15:40 UTC
      2 points
      0
      Parent
      You could hope for more even for a random non-convex optimization problem if you can set up a tight relaxation. E.g. this paper gives you optimality bounds via a semidefinite relaxation, though I am not sure if it would scale to the size of problems relevant here.
    - Garrett Baker 29 Aug 2022 17:48 UTC
      1 point
      0
      Parent
      Interesting $Q R$ decomposition idea. I’m going to try using the $Q$ as the initialization point of the rotation matrix, and see if this has any effect.