Ten months later, which papers would you recommend for SOTA explanations of how generalisation works?From my quick research: - “Explaining grokking through circuit efficiency” seems great at explaining and describing grokking- “Unified View of Grokking, Double Descent and Emergent Abilities: A Comprehensive Study on Algorithm Task” proposes a plausible unified view of grokking and double descent (and a guess at a link with emergent capabilities and multi-task training). I especially like their summary plot:
Ten months later, which papers would you recommend for SOTA explanations of how generalisation works?
From my quick research:
- “Explaining grokking through circuit efficiency” seems great at explaining and describing grokking
- “Unified View of Grokking, Double Descent and Emergent Abilities: A Comprehensive Study on Algorithm Task” proposes a plausible unified view of grokking and double descent (and a guess at a link with emergent capabilities and multi-task training). I especially like their summary plot: