I meant “gelu(x) achieves its maximum curvature somewhere near x=0.”
People often interpret relu as a piecewise linear version of functions like elu and gelu, which are curved near x=0 and linear for large |x|. In this sense gelu is like relu.
It sounds like you were, instead, talking about the property of relu that you can get nonlinear behavior for arbitrarily small inputs.
This is indeed unique to relu—I remember some DeepMind (?) paper that used floating point underflow to simulate relu, and then made NNs out of just linear floating point ops. Obviously you can’t simulate a differentiable function with that trick.
Oh that’s not good. Looks like we’d need a version of float that keeps track of an interval of possible floats (by the two floats at the end of the interval). Then we could simulate the behavior of infinite-precision floats so long as the network keeps the bounds tight, and we could train the network to keep the simulation in working order. Then we could see whether, in a network thus linear at small numbers, every visibly large effect has a visibly large cause.
By the way—have you seen what happens when you finetune GPT to reinforce this pattern that you’re observing, that every entry of the table, not just the top right one, predicts an input token?
Ah, I think we miscommunicated.
I meant “gelu(x) achieves its maximum curvature somewhere near x=0.”
People often interpret relu as a piecewise linear version of functions like elu and gelu, which are curved near x=0 and linear for large |x|. In this sense gelu is like relu.
It sounds like you were, instead, talking about the property of relu that you can get nonlinear behavior for arbitrarily small inputs.
This is indeed unique to relu—I remember some DeepMind (?) paper that used floating point underflow to simulate relu, and then made NNs out of just linear floating point ops. Obviously you can’t simulate a differentiable function with that trick.
(OpenAI?)
Oh that’s not good. Looks like we’d need a version of float that keeps track of an interval of possible floats (by the two floats at the end of the interval). Then we could simulate the behavior of infinite-precision floats so long as the network keeps the bounds tight, and we could train the network to keep the simulation in working order. Then we could see whether, in a network thus linear at small numbers, every visibly large effect has a visibly large cause.
By the way—have you seen what happens when you finetune GPT to reinforce this pattern that you’re observing, that every entry of the table, not just the top right one, predicts an input token?