The morphogenetic SLT story says that during training the Bayesian posterior concentrates around a series of subspaces W0(1)⇝...⇝W0(n) with rlcts λ1<...<λn and losses L1=L(w1),...,Ln=L(wn),wi∈W0(i). As the size of the data sample N is scaled the Bayesian posterior makes transitions W0(i)⇝W0(i+1) trading off higher complexity (higher λi+1>λi) for better accuracy (lower loss Li+1<Li).
This is the radical new framework of SLT: phase transitions happen in pure Bayesian learning as the data size is scaled.
N.B. The phase transition story actually needs a version of SLT for the nonrealizable case despite most sources focusing solely on the realizable case! The nonrealizable case makes everything more complicated and the formulas from the realizable case have to be altered.
We think of the local RLCT λw at a parameter w as a measure of its inherent complexity. Side-stepping the subtleties with this point of view let us take a look at Watanabe’s formula for the Bayesian generalization error:
GN(W)=LN(w0)+λN+o(1N)≈NL(w0)+λN+o(1N)
where W is a neighborhood of the local minimum w0 and λ is its local RLCT. In our case W=W0(i).
--EH I wanted to say something here but don’t think it makes sense on closer inspection
SLT and phase transitions
The morphogenetic SLT story says that during training the Bayesian posterior concentrates around a series of subspaces W0(1)⇝...⇝W0(n) with rlcts λ1<...<λn and losses L1=L(w1),...,Ln=L(wn),wi∈W0(i). As the size of the data sample N is scaled the Bayesian posterior makes transitions W0(i)⇝W0(i+1) trading off higher complexity (higher λi+1>λi) for better accuracy (lower loss Li+1<Li).
This is the radical new framework of SLT: phase transitions happen in pure Bayesian learning as the data size is scaled.
N.B. The phase transition story actually needs a version of SLT for the nonrealizable case despite most sources focusing solely on the realizable case! The nonrealizable case makes everything more complicated and the formulas from the realizable case have to be altered.
We think of the local RLCT λw at a parameter w as a measure of its inherent complexity. Side-stepping the subtleties with this point of view let us take a look at Watanabe’s formula for the Bayesian generalization error:
GN(W)=LN(w0)+λN+o(1N)≈NL(w0)+λN+o(1N)
where W is a neighborhood of the local minimum w0 and λ is its local RLCT. In our case W=W0(i).
--EH I wanted to say something here but don’t think it makes sense on closer inspection