My guess would be that they’re not saying that well-designed control evaluations become untrustworthy
It’s a bit messy because we have some ability to check whether we should be able to evaluate things.
So, there are really three relevant “failure” states for well done control:
We can’t find countermeasures such that our control evaluations indicate any real safety.
We no longer think that control evaluations work to evaluate safety as models greatly exceed our planning capabilites and/or can sandbag sufficiently well.
We think control evaluations can work well, but we’re wrong and they actually don’t.
I think (1) or (2) will likely happen prior to (3) if you do a good job.
(Really, the actual system is more complex because we can vary how conservative control evaluations are and it is easier to be confident in the results of more conservative evaluations. E.g., we can be conservative about our AIs planning ability. So, there is really a somewhat more continuous tradeoff than the above list makes it seem.)
I basically agree with everything @Lukas Finnveden and @ryan_greenblatt said above. Another relevant “failure” state is that we might only be able to come up with expensive or capability-reducing mitigations, meaning everyone will have an incentive to disable or work around them.
Our evaluations indicate that we have very poor levels of safety (e.g. the AI would probably be able to escape if it wanted to) and we can’t find countermeasures which suffice to ensure any level of safety (without basically giving up on using this model).
It’s a bit messy because we have some ability to check whether we should be able to evaluate things.
So, there are really three relevant “failure” states for well done control:
We can’t find countermeasures such that our control evaluations indicate any real safety.
We no longer think that control evaluations work to evaluate safety as models greatly exceed our planning capabilites and/or can sandbag sufficiently well.
We think control evaluations can work well, but we’re wrong and they actually don’t.
I think (1) or (2) will likely happen prior to (3) if you do a good job.
We discuss this more here.
(Really, the actual system is more complex because we can vary how conservative control evaluations are and it is easier to be confident in the results of more conservative evaluations. E.g., we can be conservative about our AIs planning ability. So, there is really a somewhat more continuous tradeoff than the above list makes it seem.)
I basically agree with everything @Lukas Finnveden and @ryan_greenblatt said above. Another relevant “failure” state is that we might only be able to come up with expensive or capability-reducing mitigations, meaning everyone will have an incentive to disable or work around them.
@ryan_greenblatt can you say more about what you mean by this one?
We can’t find countermeasures such that our control evaluations indicate any real safety.
Our evaluations indicate that we have very poor levels of safety (e.g. the AI would probably be able to escape if it wanted to) and we can’t find countermeasures which suffice to ensure any level of safety (without basically giving up on using this model).