Well now I feel kind of dumb (for misremembering how LayerNorm works). I’ve actually spent the past day since making the video wondering why information leakage of the form you describe doesn’t occur in most transformers, so it’s honestly kind of a relief to realize this.
It seems to me that ReLU is a reasonable approximation of GELU, even for networks that are actually using GELU. So one can think about the GELU(x)=xΦ(x) as just having a slightly messy mask function (Φ(x)) that is sort-of-well-approximated by the ReLU binary mask function.
Thanks! Enjoy your holidays!
Well now I feel kind of dumb (for misremembering how LayerNorm works). I’ve actually spent the past day since making the video wondering why information leakage of the form you describe doesn’t occur in most transformers, so it’s honestly kind of a relief to realize this.
It seems to me that ReLU is a reasonable approximation of GELU, even for networks that are actually using GELU. So one can think about the GELU(x)=xΦ(x) as just having a slightly messy mask function (Φ(x)) that is sort-of-well-approximated by the ReLU binary mask function.