A Ray comments on Performing an SVD on a time-series matrix of gradient updates on an MNIST network produces 92.5 singular values

A Ray 21 Dec 2022 7:38 UTC
6 points
0
Thoughts, mostly on an alternative set of next experiments:
I find interpolations of effects to be a more intuitive way to study treatment effects, especially if I can modulate the treatment down to zero in a way that smoothly and predictably approaches the null case. It’s not exactly clear to me what the “nothing going on case is”, but here’s some possible experiments to interpolate between it and your treatment case.
- alpha interpolation noise: A * noise + (A − 1) * MNIST, where the null case is the all-noise case. Worth probably trying out a bunch of different noise models since mnist doesn’t really look at all gaussian.
- shuffle noise: Also worth looking at pixel/row/column shuffles, within an example or across dataset, as a way of preserving some per-pixel statistics while still reducing the structure of the dataset to basically noise. Here the null case is again that “fully noised” data should be the “nothing interesting” case, but we don’t have to do work to keep per-pixel-statistics constant
- data class interpolation: I think the simplest version of this is dropping numbers, and maybe just looking at structurally similar numbers (e.g. 1,7 vs 1,7,9). This doesn’t smoothly interpolate, but still having a ton of different comparisons with different subsets of the numbers. The null case here is that more digits adds more structure
- data size interpolation: downscaling the images, with or without noise, should reduce the structure such that the small / less data an example has, the closer it resembles the null case
- suboptimal initializations: neural networks are pretty hard to train (and can often degenerate) if initialized incorrectly. I think as you move away from optimal initialization (both of model parameters and optimizer parameters), it should approach the null / nothing interesting case.
- model dimensionality reduction: similar to intrinsic dimensionality, you can artificially reduce the (linear) degrees of freedom of the model without significant decrease to its expressivity by projecting into a smaller subspace. I think you’d need to get clever about this, because i think the naive version would just be linear projection before your linear operation (and then basically a no-op).
I mostly say all this because I think it’s hard to evaluate “something is up” (predictions dont match empirical results) in ML that look like single experiments or A-B tests. It’s too easy (IMO) to get bugs/etc. Smoothly interpolating effects, with one side as a well established null case / prior case, and another different case; which vary smoothly with whatever treatment, is IMO strong evidence that “something is up”.
Hope there’s something in those that’s interesting and/or useful. If you haven’t already, I strongly recommend checking out the intrinsic dimensionality paper—you might get some mileage by swapping your cutoff point for their phase change measurement point.