I wonder if perhaps a weaker and more defensible thesis was that deep learning models are mostly linear (and maybe the few non-linearities could be separated and identified? Has anyone tried applying ReLUs only to some outputs, leaving most of the rest untouched?). It would seem really weird to me if they really were linear. If that was the case it would mean that:
activation functions are essentially unnecessary
forget SGD, you can just do one shot linear regression to train them (well, ok, no, they’re still so big that you probably need gradient descent, but it’s a much more deterministic process if it’s a linear function that you’re fitting)
You wouldn’t even need multiple layers, just one big tensor. It feels weird that an entire field might have just overlooked such a trivial solution.
I wonder if perhaps a weaker and more defensible thesis was that deep learning models are mostly linear (and maybe the few non-linearities could be separated and identified? Has anyone tried applying ReLUs only to some outputs, leaving most of the rest untouched?). It would seem really weird to me if they really were linear. If that was the case it would mean that:
activation functions are essentially unnecessary
forget SGD, you can just do one shot linear regression to train them (well, ok, no, they’re still so big that you probably need gradient descent, but it’s a much more deterministic process if it’s a linear function that you’re fitting)
You wouldn’t even need multiple layers, just one big tensor. It feels weird that an entire field might have just overlooked such a trivial solution.