Thank you so much, Marius, for writing this pertinent post! The question of whether a given interpretability tool will help us or hurt us in expectation is an extremely important one.
The answer, however, differs on a situation-to-situation basis. The scientific benefit of an interpretability tool (more generally, of any information channel) is difficult to estimate a priori, but likely is tied to its informational efficiency. Roughly speaking, how much informational value can the interpretability tool/information channel yield per unit of bitrate?
The cost of an interpretability tool include the risk of capabillities gains (if AI capabilities orgs know about it). It also includes the risk that a superintelligent AGI can use the interpretability tool/information channel for its own purposes, such as by hacking or socially engineering the alignment researchers for its own purposes (i.e., probably to escape into the Internet and then power-seek).
This perspective suggests that an interpretability tool/information channel will be disadvantageous for the defender if (1) the informational efficiency of the channel is low, (2) the interpreted AI has high capabilities (or the interpretability tool helps capabilities organizations increase the model’s capabilities), and (3) its box setup has weak cybersecurity.
To put it bluntly, once an interpretability tool (the “battle plan”) is posted online, we should assume in the worst-case scenario that the battle plan has just been given to the superintelligent AGI. First, the AGI will be misaligned in the worst-case scenario, and this scenario may in fact occur with high probability. Second, the likely misaligned AGI will probably be trained on the whole Internet, and thus will already know how the tool works, what we expect to see from it, and how to manipulate the tool for the AGI’s own purposes.
As for the six proposed arguments why “interpretability has a defenders advantage,” I am uncertain that arguments 1-5 will hold true for the interpretability tools that will be ready by the time AGI emerges. I agree with argument 6, but in a war against misalignment, I wouldn’t unconditionally assume that any battle plan which cannot guarantee a complete victory is a bad plan. In fact, a plan that tries to guarantee a complete victory but makes the outcome worse on expectation is probably a bad plan. In contrast, a plan that cannot guarantee a complete victory but makes the outcome better on expectation is probably a good plan.
Personally, I am most optimistic about interpretability tools of high informational efficiency that have not yet been posted on the Internet. We should try really hard to develop and use high-quality interpretability tools. However, using a high quantity of low-quality (or publicly posted) interpretability tools may actually decrease the odds of human survival.
Given that the audience of this post has signalled mixed responses to your comment, and I’m confused as to why (because your basic argument makes sense to me), and that no one has replied to you, here’s an attempt to understand this situation.
The core thesis of Marius’ argument, it seems, is the fact that the marginal cost for alignment of an AI model is less than that of increasing SOTA AI model capabilities, given marginal increase in interpretability research. He refers to biorisk research arguments to imply that a similar situation arises in alignment research.
You claim, however, that this isn’t true broadly speaking, since what actually matters is the amount of information we get from an interpretability tool per bit of information transferred.
Marius’ threat model is alignment research also increasing capabilities and therefore shortening timelines. Your threat model seems to be that of the uninhibited use of interpretability tools resulting in AI researchers (and by extension, the world) being taken control over by a sufficiently capable AI.
If this is the case, then it seems that both of you are talking across each other, and the readers’ responses (or the lack thereof) makes sense.
Thank you so much, Marius, for writing this pertinent post! The question of whether a given interpretability tool will help us or hurt us in expectation is an extremely important one.
The answer, however, differs on a situation-to-situation basis. The scientific benefit of an interpretability tool (more generally, of any information channel) is difficult to estimate a priori, but likely is tied to its informational efficiency. Roughly speaking, how much informational value can the interpretability tool/information channel yield per unit of bitrate?
The cost of an interpretability tool include the risk of capabillities gains (if AI capabilities orgs know about it). It also includes the risk that a superintelligent AGI can use the interpretability tool/information channel for its own purposes, such as by hacking or socially engineering the alignment researchers for its own purposes (i.e., probably to escape into the Internet and then power-seek).
This perspective suggests that an interpretability tool/information channel will be disadvantageous for the defender if (1) the informational efficiency of the channel is low, (2) the interpreted AI has high capabilities (or the interpretability tool helps capabilities organizations increase the model’s capabilities), and (3) its box setup has weak cybersecurity.
Keeping interpretability research off the Internet (such as via an encrypted or airgapped library that coordinates between AI researchers) is a low-hanging fruit that would make dual-use interpretability tools more likely to be helpful to humanity rather than counterproductive.
To put it bluntly, once an interpretability tool (the “battle plan”) is posted online, we should assume in the worst-case scenario that the battle plan has just been given to the superintelligent AGI. First, the AGI will be misaligned in the worst-case scenario, and this scenario may in fact occur with high probability. Second, the likely misaligned AGI will probably be trained on the whole Internet, and thus will already know how the tool works, what we expect to see from it, and how to manipulate the tool for the AGI’s own purposes.
As for the six proposed arguments why “interpretability has a defenders advantage,” I am uncertain that arguments 1-5 will hold true for the interpretability tools that will be ready by the time AGI emerges. I agree with argument 6, but in a war against misalignment, I wouldn’t unconditionally assume that any battle plan which cannot guarantee a complete victory is a bad plan. In fact, a plan that tries to guarantee a complete victory but makes the outcome worse on expectation is probably a bad plan. In contrast, a plan that cannot guarantee a complete victory but makes the outcome better on expectation is probably a good plan.
Personally, I am most optimistic about interpretability tools of high informational efficiency that have not yet been posted on the Internet. We should try really hard to develop and use high-quality interpretability tools. However, using a high quantity of low-quality (or publicly posted) interpretability tools may actually decrease the odds of human survival.
Given that the audience of this post has signalled mixed responses to your comment, and I’m confused as to why (because your basic argument makes sense to me), and that no one has replied to you, here’s an attempt to understand this situation.
The core thesis of Marius’ argument, it seems, is the fact that the marginal cost for alignment of an AI model is less than that of increasing SOTA AI model capabilities, given marginal increase in interpretability research. He refers to biorisk research arguments to imply that a similar situation arises in alignment research.
You claim, however, that this isn’t true broadly speaking, since what actually matters is the amount of information we get from an interpretability tool per bit of information transferred.
Marius’ threat model is alignment research also increasing capabilities and therefore shortening timelines. Your threat model seems to be that of the uninhibited use of interpretability tools resulting in AI researchers (and by extension, the world) being taken control over by a sufficiently capable AI.
If this is the case, then it seems that both of you are talking across each other, and the readers’ responses (or the lack thereof) makes sense.