I do think that Apollo themselves were clear that this was showing that it had the mental wherewithal for deception and if you apply absolutely no mitigations then deception happens. That’s what I said in my recent discussion of what this does and doesn’t show.
Therefore I described the 4o case as an engineered toy model of a failure at level 4-5 on my alignment difficulty scale (e.g. the dynamics of strategically faking performance on tests to pursue a large scale goal), but it is not an example of such a failure.
In contrast, the AI scientist case was a genuine alignment failure, but that was a much simpler case of non-deceptive, non-strategic, being given a sloppy goal by bad RLHF and reward hacking, just in a more sophisticated system than say coin-run (level 2-3).
The hidden part that Zvi etc skim over is that ‘of course’ in real life ‘in the near future’ we’ll be in a situation where an o1-like model has instrumental incentives because it is pursuing an adversarial large scale goal and also the mitigations they could have applied (like prompting it better, doing better RLHF, doing process oversight on the chain of thought etc) won’t work, but that’s the entire contentious part of the argument!
This does count against naive views that assume alignment failures can’t possibly happen: there probably are those out there who believe that you have to give an AI system an “unreasonably malicious” rather than just “somewhat unrealistically single minded” prompt to get it to engage in deceptive behavior or just irrationally think AIs will always know what we want and therefore can’t possibly be deceptive.
I do think that Apollo themselves were clear that this was showing that it had the mental wherewithal for deception and if you apply absolutely no mitigations then deception happens. That’s what I said in my recent discussion of what this does and doesn’t show.
Therefore I described the 4o case as an engineered toy model of a failure at level 4-5 on my alignment difficulty scale (e.g. the dynamics of strategically faking performance on tests to pursue a large scale goal), but it is not an example of such a failure.
In contrast, the AI scientist case was a genuine alignment failure, but that was a much simpler case of non-deceptive, non-strategic, being given a sloppy goal by bad RLHF and reward hacking, just in a more sophisticated system than say coin-run (level 2-3).
The hidden part that Zvi etc skim over is that ‘of course’ in real life ‘in the near future’ we’ll be in a situation where an o1-like model has instrumental incentives because it is pursuing an adversarial large scale goal and also the mitigations they could have applied (like prompting it better, doing better RLHF, doing process oversight on the chain of thought etc) won’t work, but that’s the entire contentious part of the argument!
One can make arguments that these oversight methods will break down e.g. when the system is generally superhuman at predicting what feedback its overseers will provide. However, those arguments were theoretical when they were made years ago and they’re still theoretical now.
This does count against naive views that assume alignment failures can’t possibly happen: there probably are those out there who believe that you have to give an AI system an “unreasonably malicious” rather than just “somewhat unrealistically single minded” prompt to get it to engage in deceptive behavior or just irrationally think AIs will always know what we want and therefore can’t possibly be deceptive.