On the side—could you elaborate why you think “relu better than sigmoid” is a “weird trick”, if that is implied by this question?
The reason that I thought to be commonly agreed is that it helps with the vanishing gradient problem (this could be shown from the graphs).
On the side—could you elaborate why you think “relu better than sigmoid” is a “weird trick”, if that is implied by this question?
The reason that I thought to be commonly agreed is that it helps with the vanishing gradient problem (this could be shown from the graphs).