Physics of Language models (part 2.1)

Link post

This is perhaps the best interpretability work I’ve seen outside of Chris Olah’s team.