It is pretty plausible to me that AI control is quite easy, because you actually can remove affordances from the AIs that are doing AI R&D such that it’s hard for them to cause problems.
Here’s one (somewhat handwavy) reason for optimism w.r.t. automated AI safety research: most safety research has probably come from outside the big labs (see e.g. https://www.beren.io/2023-11-05-Open-source-AI-has-been-vital-for-alignment/) and thus has likely mostly used significantly sub-SOTA models. It seems quite plausible then that we could have the vast majority of (controlled) automated AI safety research done on much smaller and less dangerous (e.g. trusted) models only, without this leading to intolerably-large losses in productivity; and perhaps have humans only/strongly in the loop when applying the results of that research to SOTA, potentially untrusted, models.
Here’s one (somewhat handwavy) reason for optimism w.r.t. automated AI safety research: most safety research has probably come from outside the big labs (see e.g. https://www.beren.io/2023-11-05-Open-source-AI-has-been-vital-for-alignment/) and thus has likely mostly used significantly sub-SOTA models. It seems quite plausible then that we could have the vast majority of (controlled) automated AI safety research done on much smaller and less dangerous (e.g. trusted) models only, without this leading to intolerably-large losses in productivity; and perhaps have humans only/strongly in the loop when applying the results of that research to SOTA, potentially untrusted, models.