In my opinion, much of the value of interpretability is not related to AI alignment but to AI capabilities evaluations instead.
For example, the Othello paper shows that a transformer trained on the next-word prediction of Othello moves learns a world model of the board rather than just statistics of the training text. This knowledge is useful because it suggests that transformer language models are more capable than they might initially seem.
In my opinion, much of the value of interpretability is not related to AI alignment but to AI capabilities evaluations instead.
For example, the Othello paper shows that a transformer trained on the next-word prediction of Othello moves learns a world model of the board rather than just statistics of the training text. This knowledge is useful because it suggests that transformer language models are more capable than they might initially seem.