Qria comments on Information Loss --> Basin flatness

Qria 23 May 2022 0:11 UTC
LW: 0 AF: -1
AF
Does this framework also explain grokking phenomenon?
I haven’t yet fully understood your hypothesis except that behaviour gradient is useful for measuring something related to inductive bias, but above paper seems to touch a similar topic (generalization) with similar methods (experiments on fully known toy examples such as SO5).
- Vivek Hebbar 23 May 2022 5:11 UTC
  LW: 2 AF: 2
  AF Parent
  I’m pretty sure my framework doesn’t apply to grokking. I usually think about training as ending once we hit zero training loss, whereas grokking happens much later.
- Quintin Pope 23 May 2022 8:05 UTC
  1 point
  Parent
  If you’re interested in grokking, I’d suggest my post on the topic.