While in theory, ERM will result in a model utilising spurious correlations to get lower training loss, in practice this won’t be a big issue.
Specifically, dropout layers in neural networks are a form of regularization that should eliminate spurious correlations. Will there be a slightly improved version of regularization before we reach AGI? Probably. Does this seriously affect the credibility of the Scaling Hypothesis? No.
I am on team “Objection 1.”
Specifically, dropout layers in neural networks are a form of regularization that should eliminate spurious correlations. Will there be a slightly improved version of regularization before we reach AGI? Probably. Does this seriously affect the credibility of the Scaling Hypothesis? No.
Regularization doesn’t eliminate spurious correlations.