successful interpretability tools want to be debugging/analysis tools of the type known to be very useful for capability progress
Give one example of a substantial state-of-the-art advance that decisively influenced by transparency; I ask since you said “known to be.” Saying that it’s conceivable isn’t evidence they’re actually highly entangled in practice. The track record is that transparency research gives us differential technological progress and pretty much zero capabilities externalities.
In the DL paradigm you can’t easily separate capabilities and alignment
This is true for conceptual analysis. Empirically they can be separated by measurement. Record general capabilities metrics (e.g., generally downstream accuracy) and record safety metrics (e.g., trojan detection performance); see whether an intervention improves a safety goal and whether it improves general capabilities or not. For various safety research areas there aren’t externalities. (More discussion of on this topic here.)
forcing that separation seems to constrain us
I think the poor epistemics on this topic has encouraged risk taking, have reduced the pressure to find clear safety goals, and allowed researchers to get away with “trust me I’m making the right utility calculations and have the right empirical intuitions” which is a very unreliable standard of evidence in deep learning.
The probably-canonical example at the moment is Hyena Hierarchy, which cites a bunch of interpretability research, including Anthropic’s stuff on Induction Heads. If HH actually gives what it promises in the paper, it might enable way longer context.
I don’t think you even need to cite that though. If interpretability wants to be useful someday, I think interpretability has to be ultimately aimed at helping steer and build more reliable DL systems. Like that’s the whole point, right? Steer a reliable ASI.
Give one example of a substantial state-of-the-art advance that decisively influenced by transparency; I ask since you said “known to be.” Saying that it’s conceivable isn’t evidence they’re actually highly entangled in practice. The track record is that transparency research gives us differential technological progress and pretty much zero capabilities externalities.
This is true for conceptual analysis. Empirically they can be separated by measurement. Record general capabilities metrics (e.g., generally downstream accuracy) and record safety metrics (e.g., trojan detection performance); see whether an intervention improves a safety goal and whether it improves general capabilities or not. For various safety research areas there aren’t externalities. (More discussion of on this topic here.)
I think the poor epistemics on this topic has encouraged risk taking, have reduced the pressure to find clear safety goals, and allowed researchers to get away with “trust me I’m making the right utility calculations and have the right empirical intuitions” which is a very unreliable standard of evidence in deep learning.
The probably-canonical example at the moment is Hyena Hierarchy, which cites a bunch of interpretability research, including Anthropic’s stuff on Induction Heads. If HH actually gives what it promises in the paper, it might enable way longer context.
I don’t think you even need to cite that though. If interpretability wants to be useful someday, I think interpretability has to be ultimately aimed at helping steer and build more reliable DL systems. Like that’s the whole point, right? Steer a reliable ASI.