A key distinction is between linearity in the weights vs. linearity in the input data.
For example, the function f(a,b,x,y)=asin(x)+bcos(y) is linear in the arguments a and b but nonlinear in the arguments x and y, since sin and cos are nonlinear.
Similarly, we have evidence that wide neural networks f(x;θ) are (almost) linear in the parameters θ, despite being nonlinear in the input data x (due e.g. to nonlinear activation functions such as ReLU). So nonlinear activation functions are not a counterargument to the idea of linearity with respect to the parameters.
If this is so, then neural networks are almost a type of kernel machine, doing linear learning in a space of features which are themselves a fixed nonlinear function of the input data.
A key distinction is between linearity in the weights vs. linearity in the input data.
For example, the function f(a,b,x,y)=asin(x)+bcos(y) is linear in the arguments a and b but nonlinear in the arguments x and y, since sin and cos are nonlinear.
Similarly, we have evidence that wide neural networks f(x;θ) are (almost) linear in the parameters θ, despite being nonlinear in the input data x (due e.g. to nonlinear activation functions such as ReLU). So nonlinear activation functions are not a counterargument to the idea of linearity with respect to the parameters.
If this is so, then neural networks are almost a type of kernel machine, doing linear learning in a space of features which are themselves a fixed nonlinear function of the input data.