A neural net using rectified linear unit activation functions of any size is unable to approximate the function sin(x) outside a compact interval.
I am reasonably confident that I can prove that any NN with ReLU activation approximates a piecewise linear function. I believe the number of linear pieces that can be achieved is bounded by at most 2^(L*D) where L is the number of nodes per layer and D is the number of layers.
This leads me to two questions:
Is the inability to approximate periodic functions of a single variable important?
If not, why not?
If so, is there practical data augmentation that can be used to improve performance at reasonable compute cost?
E.g., naively, augment the input vector {x_i} with {sin(x_i)} whenever x_i is a scalar.
Since the number of parameters of a NN scales as L*D^2 and the trivial bound on number of linear pieces scales with L*D, is this why neural nets go deep rather than going “wide”?
Are there established scaling hypotheses for the growth of depth vs. layer size?
Are there better (probabilistic) analytic or empirical bounds on the number of linear sections achieved by NNs of given size?
Are there activation functions that would avoid this constraint? I imagine a similar analytic constraint replacing “piecewise linear” with “piecewise strictly increasing” for classic activations like sigmoid or arctan.
Regarding (2a), empirically I found that while approximating sin(x) with small NNs in scikit-learn, increasing the width of the network caused catastrophic failure of learning (starting at approximately L=10 with D=4, at L=30 with D=8, and at L=50 with D=50).
Regarding (1), naively this seems relevant to questions of out-of-distribution performance and especially the problem of what it means for an input to be out-of-distribution in large input spaces.
Is the inability to approximate periodic functions of a single variable important?
Periodic functions are already used as an important encoding in SOTA ANNs, from transformer LLMs to NERFs in graphics. From the instant-ngp paper:
For neural networks, input encodings have proven useful in the attention components of recurrent architectures [Gehring et al. 2017] and, subsequently, transformers [Vaswani et al. 2017], where they help the neural network to identify the location it is currently processing. Vaswani et al. [2017] encode scalar positions 𝑥 ∈ R as a multiresolution sequence of 𝐿 ∈ N sine and cosine functions enc(𝑥) = sin(2 0 𝑥),sin(2 1 𝑥), . . . ,sin(2 𝐿−1 𝑥), cos(2 0 𝑥), cos(2 1 𝑥), . . . , cos(2 𝐿−1 𝑥) . (1) This has been adopted in computer graphics to encode the spatiodirectionally varying light field and volume density in the NeRF algorithm [Mildenhall et al. 2020].
[Question] Nonlinear limitations of ReLUs
A neural net using rectified linear unit activation functions of any size is unable to approximate the function sin(x) outside a compact interval.
I am reasonably confident that I can prove that any NN with ReLU activation approximates a piecewise linear function. I believe the number of linear pieces that can be achieved is bounded by at most 2^(L*D) where L is the number of nodes per layer and D is the number of layers.
This leads me to two questions:
Is the inability to approximate periodic functions of a single variable important?
If not, why not?
If so, is there practical data augmentation that can be used to improve performance at reasonable compute cost?
E.g., naively, augment the input vector {x_i} with {sin(x_i)} whenever x_i is a scalar.
Since the number of parameters of a NN scales as L*D^2 and the trivial bound on number of linear pieces scales with L*D, is this why neural nets go deep rather than going “wide”?
Are there established scaling hypotheses for the growth of depth vs. layer size?
Are there better (probabilistic) analytic or empirical bounds on the number of linear sections achieved by NNs of given size?
Are there activation functions that would avoid this constraint? I imagine a similar analytic constraint replacing “piecewise linear” with “piecewise strictly increasing” for classic activations like sigmoid or arctan.
Something something Fourier transform something something?
Regarding (2a), empirically I found that while approximating sin(x) with small NNs in scikit-learn, increasing the width of the network caused catastrophic failure of learning (starting at approximately L=10 with D=4, at L=30 with D=8, and at L=50 with D=50).
Regarding (1), naively this seems relevant to questions of out-of-distribution performance and especially the problem of what it means for an input to be out-of-distribution in large input spaces.
Periodic functions are already used as an important encoding in SOTA ANNs, from transformer LLMs to NERFs in graphics. From the instant-ngp paper: