One might expect y=f(x,θ0+Δθ) to be a more expressive equation than its linear approximation, but it appears that the parameters of very large neural nets change only by a small amount during training, which means that the overall Δθ found during training is nearly a solution to the linearly-approximated equations.
Note that this has changed over time, as network architectures change; I doubt that it applies to e.g. the latest LLMs. The thing about pruning doing a whole bunch of optimization does still apply independent of whether net training is linear-ish (though I don’t know if anyone’s repro’d the lottery ticket hypothesis-driven pruning experiments on the past couple years’ worth of LLMs).
A bit of a side note, but I don’t even think you need to appeal to new architectures—it looks like the NTK approximation performs substantially worse even with just regular MLPs (see this paper, among others).
Note that this has changed over time, as network architectures change; I doubt that it applies to e.g. the latest LLMs. The thing about pruning doing a whole bunch of optimization does still apply independent of whether net training is linear-ish (though I don’t know if anyone’s repro’d the lottery ticket hypothesis-driven pruning experiments on the past couple years’ worth of LLMs).
A bit of a side note, but I don’t even think you need to appeal to new architectures—it looks like the NTK approximation performs substantially worse even with just regular MLPs (see this paper, among others).