Someone with better SLT knowledge might want to correct this, but more specifically:
Studying the “volume scaling” of near-min-loss parameters, as beren does here, is really core to SLT. The rate of change of this volume as you change your epsilon loss tolerance is called the “density of states” (DOS) function, and much of SLT basically boils down to an asymptotic analysis of this function. It also relates the terms in the asymptotic expansion to things you care about, like generalization performance.
You might wonder why SLT needs so much heavy machinery, since this sounds so simple—and it’s basically because SLT can handle the case where the eigenvalues of the Hessian are zero, and the usual formula breaks down. This is actually important in practice, since IIRC real models often have around 90% zero eigenvalues in their Hessian. It also leads to substantially different theory—for instance the “effective number of parameters” (RLCT) can vary depending on the dataset.
Looks like I really need to study some SLT! I will say though that I haven’t seen many cases in transformer language models where the eigenvalues of the Hessian are 90% zeros—that seems extremely high.
Someone with better SLT knowledge might want to correct this, but more specifically:
Studying the “volume scaling” of near-min-loss parameters, as beren does here, is really core to SLT. The rate of change of this volume as you change your epsilon loss tolerance is called the “density of states” (DOS) function, and much of SLT basically boils down to an asymptotic analysis of this function. It also relates the terms in the asymptotic expansion to things you care about, like generalization performance.
You might wonder why SLT needs so much heavy machinery, since this sounds so simple—and it’s basically because SLT can handle the case where the eigenvalues of the Hessian are zero, and the usual formula breaks down. This is actually important in practice, since IIRC real models often have around 90% zero eigenvalues in their Hessian. It also leads to substantially different theory—for instance the “effective number of parameters” (RLCT) can vary depending on the dataset.
Looks like I really need to study some SLT! I will say though that I haven’t seen many cases in transformer language models where the eigenvalues of the Hessian are 90% zeros—that seems extremely high.