(Sorry for the long delay here!) The post articulates a number of specific ways in which some AIs can help to supervise others (e.g., patching security holes, generating inputs for adversarial training, finding scary inputs/training processes for threat assessment), and these don’t seem to rely on the idea that an AI can automatically fully understand the internals/arguments/motivations/situation of a sufficiently close-in-capabilities other AI. The claim is not that a single supervisory arrangement of that type wipes out all risks, but that enough investment in AI checks and balances can significantly reduce them.
I also think it’s possible that interpretability research is going to make it easier over time to interpret an AI’s internals—the growth in “difficulty of interpretation” with model size/capability won’t necessarily outpace the growth in tools for interpretability. (I think this could go either way.)
(Sorry for the long delay here!) The post articulates a number of specific ways in which some AIs can help to supervise others (e.g., patching security holes, generating inputs for adversarial training, finding scary inputs/training processes for threat assessment), and these don’t seem to rely on the idea that an AI can automatically fully understand the internals/arguments/motivations/situation of a sufficiently close-in-capabilities other AI. The claim is not that a single supervisory arrangement of that type wipes out all risks, but that enough investment in AI checks and balances can significantly reduce them.
I also think it’s possible that interpretability research is going to make it easier over time to interpret an AI’s internals—the growth in “difficulty of interpretation” with model size/capability won’t necessarily outpace the growth in tools for interpretability. (I think this could go either way.)