“We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence.” -SwiGLU paper.
I think it varies, a few of these are trying “random” things, but mostly they are educated guesses which are then validated empirically. Often there is a spefic problem we want to solve i.e. exploding gradients or O(n^2) attention and then authors try things which may or may not solve/mitigate the problem.
“We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence.” -SwiGLU paper.
I think it varies, a few of these are trying “random” things, but mostly they are educated guesses which are then validated empirically. Often there is a spefic problem we want to solve i.e. exploding gradients or O(n^2) attention and then authors try things which may or may not solve/mitigate the problem.