johnswentworth comments on The Big Picture Of Alignment (Talk Part 1)

johnswentworth 21 Feb 2022 18:05 UTC
LW: 3 AF: 3
AF
it seems that any kind of learning we know how to do is dependent on the paths that one can take to reach it, and that abstracting this away can give very different results to any high-dimensional optimization that we actually know how to do.
This is where Mingard et al come in. One of their main results is that SGD training on neural nets does quite well approximate just-randomly-sampling-an-optimal-point. Turns out our methods are not actually very path-dependent in practice!
My intuition is that these highly compressed arrangements would be very sensitive to perturbations, and render them incredibly difficult to reach in practice… There is therefore a competing incentive towards minima which are easy to land on—probably flat minima surrounded by areas of relatively good performance.
There is a mismatch between your intuition and the implications of “flat minima surrounded by areas of relatively good performance”.
Remember, the whole point of the “highly compressed arrangements” is that we only need to lock in a few parameter values in order to get optimal behavior; once those few values are locked in, the rest of the parameters can mostly vary however they want without screwing stuff up. “Flat minimum surrounded by areas of relatively good performance” is synonymous with compression: if we can vary the parameters in lots of ways without losing much performance, that implies that all the info needed for optimal performance has been compressed into whatever-we-can’t-vary-without-losing-performance.
Now, your intuition is correct in the sense that info may be spread over many parameters; the relevant “ways to vary things” may not just be “adjust one param while holding others constant”. For instance, it might be more useful to look at parameter variation along local eigendirections of the Hessian. Then the claim would be something like “flat optimum = performance is flat along lots of eigendirections, therefore we can project the parameter-values onto the non-flat eigendirections and those projections are the ‘compressed info’”. (Tbc, I still don’t know what the best way is to characterize this sort of thing, but eigendirections are an obvious approximation which will probably work.)
- Hoagy 21 Feb 2022 19:13 UTC
  LW: 5 AF: 5
  AF Parent
  Turns out our methods are not actually very path-dependent in practice!
  Yeah I get that’s what Mingard et al are trying to show but the meaning of their empirical results isn’t clear to me—but I’ll try and properly read the actual paper rather than the blog post before saying any more in that direction.
  “Flat minimum surrounded by areas of relatively good performance” is synonymous with compression. if we can vary the parameters in lots of ways without losing much performance, that implies that all the info needed for optimal performance has been compressed into whatever-we-can’t-vary-without-losing-performance.
  I get that a truly flat area is synonymous with compression—but I think being surrounded by areas of good performance is anti-correlated with compression because it indicates redundancy and less-than-maximal sensitivity.
  I agree that viewing it as flat eigendimensions in parameter space is the right way to think about it, I still worry that the same concerns apply that maximal compression in this space is traded against ease of finding what would be a flat plain in many dimensions, but a maximally steep ravine in all of the other directions. I can imagine this could be investigated with some small experiments, or they may well already exist but I can’t promise I’ll follow up, if anyone is interested let me know.