Ok. So I haven’t thought through these proposals in much detail, and I don’t claim any confident take, but my first response is “holy fuck, that’s a lot of complexity. It really seems like there will be some flaw in our control scheme that we don’t notice, if we’re stacking a bunch of clever ideas like this one on top of each other.”
This is not at all to be taken as a disparagement of the authors. I salute them for their contribution. We should definitely explore ideas like these, and test them, and use the best ideas we have at AGI time.
To the extent that complexity makes succesfully red-teaming our approaches too difficult, then we could try to mostly depend on simple approaches and only use more complex approaches only if we are confident that at least they won’t can’t make the situation worse.
My general view is that complex methods are only productive if you have a good feedback loop (either empirical feedback or a very robust theoretical feedback (e.g. proofs)). In the case of control, so long as we do a good job with red teaming[1], we can build a good feedback loop such that complex methods can be productive. I think a similar process is often used in computer security and I think computer security often involves complex measures[2].
As an aside, this also makes me relatively pessimistic about developing complex and unprincipled alignment (non-control) approaches for hard-to-experiment-on threat models like scheming (aka deceptive alignment). It will be difficult to make scheming model organisms which realistically reproduce the internal structure and capability level of models we’re worried about and thus it will be hard to effectively iterate on complex and unprincipled measures.
An alternative approach for empirics is to focus on highly principled measures where we have a story for why they should generalize to scheming that we can test and iterate against (e.g. we’ll understand how the model works which we can test by seeing if we can understand models in general).
Which isn’t clearly possible given that the relevant AIs could have very different capability profiles from humans. I think this will be fine in practice, but this is non-obvious.
Ok. So I haven’t thought through these proposals in much detail, and I don’t claim any confident take, but my first response is “holy fuck, that’s a lot of complexity. It really seems like there will be some flaw in our control scheme that we don’t notice, if we’re stacking a bunch of clever ideas like this one on top of each other.”
This is not at all to be taken as a disparagement of the authors. I salute them for their contribution. We should definitely explore ideas like these, and test them, and use the best ideas we have at AGI time.
But my intuitive first order response is “fuck.”
To the extent that complexity makes succesfully red-teaming our approaches too difficult, then we could try to mostly depend on simple approaches and only use more complex approaches only if we are confident that at least they won’t can’t make the situation worse.
My general view is that complex methods are only productive if you have a good feedback loop (either empirical feedback or a very robust theoretical feedback (e.g. proofs)). In the case of control, so long as we do a good job with red teaming[1], we can build a good feedback loop such that complex methods can be productive. I think a similar process is often used in computer security and I think computer security often involves complex measures[2].
As an aside, this also makes me relatively pessimistic about developing complex and unprincipled alignment (non-control) approaches for hard-to-experiment-on threat models like scheming (aka deceptive alignment). It will be difficult to make scheming model organisms which realistically reproduce the internal structure and capability level of models we’re worried about and thus it will be hard to effectively iterate on complex and unprincipled measures.
An alternative approach for empirics is to focus on highly principled measures where we have a story for why they should generalize to scheming that we can test and iterate against (e.g. we’ll understand how the model works which we can test by seeing if we can understand models in general).
Which isn’t clearly possible given that the relevant AIs could have very different capability profiles from humans. I think this will be fine in practice, but this is non-obvious.
I don’t think the ideas we consider are more complex than typical approaches in security, though Buck would know better than me.