Nathan Helm-Burger comments on The Hessian rank bounds the learning coefficient

Nathan Helm-Burger 10 Aug 2024 20:00 UTC
2 points
0
When you say “networks” do you mean simple multi-layer perceptrons?

Or is this a more general description which includes stuff like:

transformers, state-space models, diffusion models, recursive looping models, reservoir computing models, next generation reservoir computing models, spiking neural nets, Komolgrov-Arnold networks, FunSearch, etc.
- Lucius Bushnaq 10 Aug 2024 20:22 UTC
  5 points
  0
  Parent
  Anything where you fit parametrised functions to data. So, all of these, except maybe FunSearch? I haven’t looked into what that actually does, but at a quick google it sounds more like an optimisation method than an architecture. Not sure learning theory will be very useful for thinking about that.
  
  You can think of the learning coefficient as a sort of ‘effective parameter count’ in a generalised version of the Bayesian Information Criterion. Unlike the BIC, it’s also applicable to architectures where many parameter configurations can result in the same function. Like the architectures used in DeepLearning.
  This is why models with neural network style architectures can e.g. generalise past the training data even when they have more parameters than training data points. People used to think this made no sense, because they had BIC-based intuitions that said you’d inevitably overfit. But the BIC isn’t actually applicable to these architectures. You need the more general form, the WBIC, which has the learning coefficient $λ$ in the formula in place of the parameter count.