I’m working on an in-depth analysis of interpretability research, which is largely about its impacts as a safety research agenda. I think it would be a useful companion to your “Transparency” section in this post. I’m writing it up in this sequence of posts: Interpretability Research for the Most Important Century. (I’m glad I found your post and its “Transparency” section too, because now I can refer to it as I continue writing the sequence.)
The sequence isn’t finished yet, but a couple of the posts are done already. In particular the second post Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios contains a substantial part of the analysis. The “Closing thoughts” section of that post gets at most of the cruxes for interpretability research as I see them so far, excerpted here:
In this post, we investigated whether interpretability has property of #1 of High-leverage Alignment Research[1]. We discussed the four most important parts AI alignment, and which seem to be the hardest. Then we explored interpretability’s relevance to these areas by analyzing seven specific scenarios focused on major interpretability breakthroughs that could have great impacts on the four alignment components. We also looked at interpretability’s potential relevance to deconfusion research and yet-unknown scenarios for solving alignment.
It seems clear that there are many ways interpretability will be valuable or even essential for AI alignment.[26] It is likely to be the best resource available for addressing inner alignment issues across a wide range of alignment techniques and proposals, some of which look quite promising from an outer alignment and performance competitiveness perspective.
However, it doesn’t look like it will be easy to realize the potential of interpretability research. The most promising scenarios analyzed above tend to rely on near-perfection of interpretability techniques that we have barely begun to develop. Interpretability also faces serious potential obstacles from things like distributed representations (e.g. polysemanticity), the likely-alien ontologies of advanced AIs, and the possibility that those AIs will attempt to obfuscate their own cognition. Moreover, interpretability doesn’t offer many great solutions for suboptimality alignment and training competitiveness, at least not that I could find yet.
Still, interpretability research may be one of the activities that most strongly exhibits property #1 of High-leverage Alignment Research[1]. This will become more clear if we can resolve some of the Further investigation questions above, such as developing more concrete paths to achieving the scenarios in this post and estimating probabilities that we could achieve them. It would also help if, rather than considering interpretability just on its own terms, we could do a side-by-side-comparison of interpretability with other research directions, as the Alignment Research Activities Question[5] suggests.
(Pasting in the most important/relevant footnotes referenced above:)
[1]: High-leverage Alignment Research is my term for what Karnofsky (2022)[6] defines as:
“Activity that is [1] likely to be relevant for the hardest and most important parts of the problem, while also being [2] the sort of thing that researchers can get up to speed on and contribute to relatively straightforwardly (without having to take on an unusual worldview, match other researchers’ unarticulated intuitions to too great a degree, etc.)”
[5]: The Alignment Research Activities Question is my term for a question posed by Karnofsky (2022)[6]. The short version is: “What relatively well-scoped research activities are particularly likely to be useful for longtermism-oriented AI alignment?”
I’m working on an in-depth analysis of interpretability research, which is largely about its impacts as a safety research agenda. I think it would be a useful companion to your “Transparency” section in this post. I’m writing it up in this sequence of posts: Interpretability Research for the Most Important Century. (I’m glad I found your post and its “Transparency” section too, because now I can refer to it as I continue writing the sequence.)
The sequence isn’t finished yet, but a couple of the posts are done already. In particular the second post Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios contains a substantial part of the analysis. The “Closing thoughts” section of that post gets at most of the cruxes for interpretability research as I see them so far, excerpted here:
(Pasting in the most important/relevant footnotes referenced above:)
If any of this is confusing, please let me know—it should also help to reference details in the post itself to clarify. Additionally there are some useful sections in that post for thinking about the high-level impact of interpretability not fully expressed in the “Closing thoughts” above, for example the positive list of Reasons to think interpretability will go well with enough funding and talent and the negative list of Reasons to think interpretability won’t go far enough even with lots of funding and talent.