If the AGI is substantially smarter than the interpretability tools, then it will probably have an easier time outmaneuvering them than it would with humans.
Close calls, e.g. catching an AGI before it’s too late, are possible. But that’s luck-based, and at some point you’ll just need some really, really good tools anyway, such as tools that are smarter than the AGI (while somehow not being a significantly bigger threat themselves).
Why wouldn’t people (and maybe even AIs, at least up to a point) be applying these ever-advancing AI capabilities to developing better and better interpretability tools as well? I.e., what reason is there to expect an “interpretability gap” to develop (unless you believe interpretability is a fundamentally unsolvable problem, in which case no amount of AI power is going to help)?
If the AGI is substantially smarter than the interpretability tools, then it will probably have an easier time outmaneuvering them than it would with humans.
Close calls, e.g. catching an AGI before it’s too late, are possible. But that’s luck-based, and at some point you’ll just need some really, really good tools anyway, such as tools that are smarter than the AGI (while somehow not being a significantly bigger threat themselves).
Why wouldn’t people (and maybe even AIs, at least up to a point) be applying these ever-advancing AI capabilities to developing better and better interpretability tools as well? I.e., what reason is there to expect an “interpretability gap” to develop (unless you believe interpretability is a fundamentally unsolvable problem, in which case no amount of AI power is going to help)?