It depends somewhat on what you mean by ‘near term interpretability’ - if you apply that term to research into, for example, improving the stability and ability to access the ‘inner world models’ held by large opaque langauge models like GPT-3, then there’s a strong argument that ML based ‘interpretability’ research might be one of the best ways of directly working on alignment research,
Evan Hubinger: +1 I continue to think that language model transparency research is the single most valuable current research direction within the class of standard ML research, for similar reasons to what Eliezer said above.
Ajeya Cotra: Thanks! I’m also excited about language model transparency, and would love to find ways to make it more tractable as a research statement / organizing question for a field. I’m not personally excited about the connotations of transparency because it evokes the neuroscience-y interpretability tools, which don’t feel scalable to situations when we don’t get the concepts the model is using, and I’m very interested in finding slogans to keep researchers focused on the superhuman stuff.
So language model transparency/interpretability tools might be useful on the basis of pro 2) and also 1) to some extent, because it will help build tools for intereting TAI systems and alos help align them ahead of time.
1. Most importantly, the more we align systems ahead of time, the more likely that researchers will be able to put thought and consideration into new issues like treacherous turns, rather than spending all their time putting out fires.
2. We can build practical know-how and infrastructure for alignment techniques like learning from human feedback.
3. As the world gets progressively faster and crazier, we’ll have better AI assistants helping us to navigate the world.
4. It improves our chances of discovering or verifying a long-term or “full” alignment solution.
It depends somewhat on what you mean by ‘near term interpretability’ - if you apply that term to research into, for example, improving the stability and ability to access the ‘inner world models’ held by large opaque langauge models like GPT-3, then there’s a strong argument that ML based ‘interpretability’ research might be one of the best ways of directly working on alignment research,
https://www.alignmentforum.org/posts/29QmG4bQDFtAzSmpv/an-141-the-case-for-practicing-alignment-work-on-gpt-3-and
And see this discussion for more,
https://www.lesswrong.com/posts/AyfDnnAdjG7HHeD3d/miri-comments-on-cotra-s-case-for-aligning-narrowly
So language model transparency/interpretability tools might be useful on the basis of pro 2) and also 1) to some extent, because it will help build tools for intereting TAI systems and alos help align them ahead of time.