This is a great reference for the importance and excitement in Interpretability.
I just read this for the first time today. I’m currently learning about Interpretability in hopes I can participate, and this post solidified my understanding of how Interpretability might help.
The whole field of Interpretability is a test of this post. Some of the theories of change won’t pan out. Hopefully many will. Perhaps more theories not listed will be discovered.
One idea I’m surprised wasn’t mentioned is the potential for Interpretability to supercharge all of the sciences by allowing humans to extract the things that machine learning models discovered to make their predictions. I remember Chris Olah being excited about this possibility on the 80k Podcast, and that excitement meme has spread to me. Current AIs know so much about how the world works, but we can only indirectly use that knowledge indirectly through their black box interface. I want that knowledge for myself and for humanity! This is another incentive for Interpretability, and although it isn’t a development that clearly leads to “AI less likely to kill us” it will make humanity wiser, more prosperous, and on more even footing with the AIs.
Nanda’s post probably deserves a spot in a compilation of Alignment plans.
Thanks for the kind words! I’d class “interp supercharging other sciences” under:
Microscope AI: Maybe we can avoid deploying agents at all, by training systems to do complex tasks, then interpreting how they do it and doing it ourselves
I’d call that “underselling it”! Your description of Microscope AI may be accurate, but even I didn’t realize you meant “supercharging science”, and I was looking for it in the list!
This is a great reference for the importance and excitement in Interpretability.
I just read this for the first time today. I’m currently learning about Interpretability in hopes I can participate, and this post solidified my understanding of how Interpretability might help.
The whole field of Interpretability is a test of this post. Some of the theories of change won’t pan out. Hopefully many will. Perhaps more theories not listed will be discovered.
One idea I’m surprised wasn’t mentioned is the potential for Interpretability to supercharge all of the sciences by allowing humans to extract the things that machine learning models discovered to make their predictions. I remember Chris Olah being excited about this possibility on the 80k Podcast, and that excitement meme has spread to me. Current AIs know so much about how the world works, but we can only indirectly use that knowledge indirectly through their black box interface. I want that knowledge for myself and for humanity! This is another incentive for Interpretability, and although it isn’t a development that clearly leads to “AI less likely to kill us” it will make humanity wiser, more prosperous, and on more even footing with the AIs.
Nanda’s post probably deserves a spot in a compilation of Alignment plans.
Thanks for the kind words! I’d class “interp supercharging other sciences” under:
This might just be semantics though
I’d call that “underselling it”! Your description of Microscope AI may be accurate, but even I didn’t realize you meant “supercharging science”, and I was looking for it in the list!