This is indeed a vital but underdiscussed problem. My SERI MATS team published a post about a game-theoretic model of alignment where the expected scientific benefit of an interpretability tool can be weighed against its expected cost due to its enabling of AGI escape risks. The expected cost can be reduced by limiting the capabilities of the AGI and by increasing the quality of security, and the expected scientific benefit can be increased by prioritizing informational efficiency of the interpretability tool.
Conditional on an organization dead set on building a superintelligent AGI (which I would strongly oppose, but may be forced to help align if we cannot dissuade the organization in any way), I think efforts to apply security, alignment, and positive-EV interpretability should be targeted at all capability levels, both high and low. Alignment efforts at high-capability levels run into the issue of heightened AGI escape risk. Alignment efforts at low-capability levels run into the issue that alignment gains, if any, may phase-transition out of existence after the AGI moves into a higher-capability regime. We should try our best at both and hope to get lucky.
This is indeed a vital but underdiscussed problem. My SERI MATS team published a post about a game-theoretic model of alignment where the expected scientific benefit of an interpretability tool can be weighed against its expected cost due to its enabling of AGI escape risks. The expected cost can be reduced by limiting the capabilities of the AGI and by increasing the quality of security, and the expected scientific benefit can be increased by prioritizing informational efficiency of the interpretability tool.
Conditional on an organization dead set on building a superintelligent AGI (which I would strongly oppose, but may be forced to help align if we cannot dissuade the organization in any way), I think efforts to apply security, alignment, and positive-EV interpretability should be targeted at all capability levels, both high and low. Alignment efforts at high-capability levels run into the issue of heightened AGI escape risk. Alignment efforts at low-capability levels run into the issue that alignment gains, if any, may phase-transition out of existence after the AGI moves into a higher-capability regime. We should try our best at both and hope to get lucky.