ReLU activation is the stupidest ML idea I’ve ever heard; everyone knows sigmoid um somehow feels optimal you know it is a real function from like real math. (ReLU only survived because it got a ridiculous acronym word thing and sounds complicated so you feel smart.)
No, ReLU is great, because it induces semantically meaningful sparseness (for the same geometric reason which causes L1-regularization to induce sparseness)!
It’s a nice compromise between the original perceptron stepfunction (which is incompatible with gradient methods) and the sigmoids which have tons of problems (saturate unpleasantly on the ends and don’t want to move from there).
What’s dumb is that instead of discovering the goodness of ReLU in the early 1970-s (natural timeline, given that ReLU has been introduced in the late 1960-s and, in any case, is very natural, being the integral of the step function), people had only discovered the sparseness-inducing properties of ReLU in 2000, published that in Nature of all places, and it was still ignored completely for another decade, and only after people published 3 papers of more applied flavor in 2009-2011, it was adopted, and by 2015 it overcame sigmoids as the most popular activation function in use, because it worked so much better. (See https://en.wikipedia.org/wiki/Rectifier_(neural_networks) for references.)
It’s quite likely that without ReLU AlexNet would not be able to improve the SOTA as spectacularly as it did, triggering the “first deep learning revolution”.
That being said, it is better to use them in pairs (relu(x), relu(-x)); this way you always get signal (e.g. TensorFlow has crelu function which is exactly this pair of relu’s).
Of course ReLU is great!! I was trying to say that if I were a 2009 ANN researcher (unaware of prior ReLU uses like most people probably were at the time) and someone (who had not otherwise demonstrated expertise) came in and asked why we use this particular woosh instead of a bent line or something, then I would’ve thoroughly explained the thought out of them. It’s possible that I would’ve realized how it works but very unlikely IMO. But a dumbworker more likely to say “Go do it. Now. Go. Do it now. Leave. Do it.” as I see it.
No, ReLU is great, because it induces semantically meaningful sparseness (for the same geometric reason which causes L1-regularization to induce sparseness)!
It’s a nice compromise between the original perceptron stepfunction (which is incompatible with gradient methods) and the sigmoids which have tons of problems (saturate unpleasantly on the ends and don’t want to move from there).
What’s dumb is that instead of discovering the goodness of ReLU in the early 1970-s (natural timeline, given that ReLU has been introduced in the late 1960-s and, in any case, is very natural, being the integral of the step function), people had only discovered the sparseness-inducing properties of ReLU in 2000, published that in Nature of all places, and it was still ignored completely for another decade, and only after people published 3 papers of more applied flavor in 2009-2011, it was adopted, and by 2015 it overcame sigmoids as the most popular activation function in use, because it worked so much better. (See https://en.wikipedia.org/wiki/Rectifier_(neural_networks) for references.)
It’s quite likely that without ReLU AlexNet would not be able to improve the SOTA as spectacularly as it did, triggering the “first deep learning revolution”.
That being said, it is better to use them in pairs
(relu(x), relu(-x))
; this way you always get signal (e.g. TensorFlow has crelu function which is exactly this pair ofrelu
’s).Of course ReLU is great!! I was trying to say that if I were a 2009 ANN researcher (unaware of prior ReLU uses like most people probably were at the time) and someone (who had not otherwise demonstrated expertise) came in and asked why we use this particular woosh instead of a bent line or something, then I would’ve thoroughly explained the thought out of them. It’s possible that I would’ve realized how it works but very unlikely IMO. But a dumbworker more likely to say “Go do it. Now. Go. Do it now. Leave. Do it.” as I see it.