I was previously pretty dubious about interpretability results leading to capabilities advances. I’ve only really seen twopapers which did this for LMs and they came from the same lab in the past few months. It seemed to me like most of the advances in modern ML (other than scale) came from people tinkering with architectures and seeing which modifications increased performance.[1]
But in a conversation with Oliver Habryka and others, it was brought up that as AI models are getting larger and more expensive, this tinkering will get more difficult and expensive. This might cause researchers to look for additional places for capabilities insights, and one of the obvious places to find such insights might be interpretability research.
I was previously pretty dubious about interpretability results leading to capabilities advances. I’ve only really seen two papers which did this for LMs and they came from the same lab in the past few months.
It seemed to me like most of the advances in modern ML (other than scale) came from people tinkering with architectures and seeing which modifications increased performance.[1]
But in a conversation with Oliver Habryka and others, it was brought up that as AI models are getting larger and more expensive, this tinkering will get more difficult and expensive. This might cause researchers to look for additional places for capabilities insights, and one of the obvious places to find such insights might be interpretability research.
I’d be interested to hear if this isn’t actually how modern ML advances work.