If it fails once we are well past AlphaZero, or even just more moderate superhuman AI research, this is good, as this means the “automate AI alignment” plan has a safe buffer zone.
If it fails before AI automates AI research, this is also good, because it forces them to invest in alignment.
That assumes AI firms learn the lessons needed from the failures. Our experience shows that they don’t, and they keep making systems that predictably are unsafe and exploitable, and they don’t have serious plans to change their deployments, much less actually build a safety-oriented culture.
First, strongly agreed on the central point—I think that as a community, we’ve been too heavily investing in the tractable approaches (interpretability, testing, etc.) without having the broader alignment issues taking front stage. This has led to lots of bikeshedding, lots of capabilities work, and yes, some partial solutions to problems.
That said, I am concerned about what happens if interpretability is wildly successful—against your expectations. That is, I see interpretability as a concerning route to attempted alignment even if it succeeds in getting past the issues you note on “miss things,” “measuring progress,” and “scalability,” partly for reasons you discuss under obfuscation and reliability. Wildly successful and scalable interpretability without solving other parts of alignment would very plausibly function as a very dangerously misaligned system, and the methods for detection themselves arguably exacerbate the problem. I outlined my potential concerns about this case in more detail in a post here. I would be very interested in your thoughts about this. (And thoughts from @Buck / @Adam Shai as well!)