Linear optimization where your model is of the form W1⋅W2....⋅Wn, the Wi being matrices, will likely result in an effective model of low rank if, you randomize the weights. Compared to just a single matrix—to which the problem is naively mathematically identical, but not computationally—this model won’t be able to learn the identity function, or rotations or so on when n is large.
Note: Someone else said this on a gathertown meetup. The context was, that it is a bad idea to think about some ideal way of solving a problem, and then assume a neural net (or indeed any learning algorithm) would learn it. Instead, focus on the concrete details of the model you’re training.
wow I’m not convinced that won’t work. the only thing initializing with random weights should do is add a little noise. the naive mathematical interpretation should be the only possible interpretation up to your float errors, which, to be clear, will be real and cause the model to be invisibly slightly nonlinear. but as long as you’re using float32, you shouldn’t even notice.
[trying it, eta 120 min to be happy with results… oops I’m distractable...]
EDIT: Sorry, I tried something different. I fed in dense layers, followed by batchnorm, then ReLU. I ended it with a sigmoid, because I guess I just wanted to constrain things to the unit interval. I tried up to six layers. The difference in loss was not that large, but it was there. Also, hidden layers were 30 dim.
I tried this, and the results were a monotonic decrease in performance after a single hidden layer. My dataset was 100k samples of a 20 dim tensor sampled randomly from [0,1] for X, with Y being a copy of X. Loss was MSE, optimizer was adam with weight decay 0.05, lr~0.001 , minibatch size was 32, trained for 100,000 steps.
Also, I am doubtful of the mechanism being a big thing (rank loss) for such small depths. But, I do think there’s something to the idea. If you multiply a long sequence of matrices, then I expect them to get extremely large, extremely small, or tend towards some kind of equillibrium. And then you have numerical stability issues and so on, which I think will ultimately make your big old matrix just sort of worthless.
Obvious thing I never thought of before:
Linear optimization where your model is of the form W1⋅W2....⋅Wn, the Wi being matrices, will likely result in an effective model of low rank if, you randomize the weights. Compared to just a single matrix—to which the problem is naively mathematically identical, but not computationally—this model won’t be able to learn the identity function, or rotations or so on when n is large.
Note: Someone else said this on a gathertown meetup. The context was, that it is a bad idea to think about some ideal way of solving a problem, and then assume a neural net (or indeed any learning algorithm) would learn it. Instead, focus on the concrete details of the model you’re training.
wow I’m not convinced that won’t work. the only thing initializing with random weights should do is add a little noise. the naive mathematical interpretation should be the only possible interpretation up to your float errors, which, to be clear, will be real and cause the model to be invisibly slightly nonlinear. but as long as you’re using float32, you shouldn’t even notice.
[trying it, eta 120 min to be happy with results… oops I’m distractable...]
EDIT: Sorry, I tried something different. I fed in dense layers, followed by batchnorm, then ReLU. I ended it with a sigmoid, because I guess I just wanted to constrain things to the unit interval. I tried up to six layers. The difference in loss was not that large, but it was there. Also, hidden layers were 30 dim.
I tried this, and the results were a monotonic decrease in performance after a single hidden layer. My dataset was 100k samples of a 20 dim tensor sampled randomly from [0,1] for X, with Y being a copy of X. Loss was MSE, optimizer was adam with weight decay 0.05, lr~0.001 , minibatch size was 32, trained for 100,000 steps.
Also, I am doubtful of the mechanism being a big thing (rank loss) for such small depths. But, I do think there’s something to the idea. If you multiply a long sequence of matrices, then I expect them to get extremely large, extremely small, or tend towards some kind of equillibrium. And then you have numerical stability issues and so on, which I think will ultimately make your big old matrix just sort of worthless.
oh by linear layer you meant nonlinear layer, oh my god, I hate terminology. I thought you literally just meant matrix multiplies