Within the first section (prompting/RLHF/Constitutional):
I’d guess that Constitutional AI would work only in the very easiest worlds
RLHF would work in slightly less-easy worlds
Prompting would work in worlds where alignment is easy, but too hard for RLHF or Constitutional AI
The core reasoning here is that human feedback directly selects for deception. Furthermore, deception induced by human feedback does not require strategic awareness—e.g. that thing with the hand which looks like it’s grabbing a ball but isn’t is a good example. So human-feedback-induced deception is more likely to occur, and to occur earlier in development, than deception from strategic awareness. Among the three options, “Constitutional” AI applies the most optimization pressure toward deceiving humans (IIUC), RLHF the next most, whereas prompting alone provides zero direct selection pressure for deception; it is by far the safest option of the three. (Worlds Where Iterative Design Fails talks more broadly about the views behind this.)
Next up, I’d put “Experiments with Potentially Catastrophic Systems to Understand Misalignment” as 4th-hardest world. If we can safely experiment with potentially-dangerous systems in e.g. a sandbox, and that actually works (i.e. the system doesn’t notice when it’s in testing and deceptively behave itself, or otherwise generalize in ways the testing doesn’t reveal), then we don’t really need oversight tools in the first place. Just test the thing and see if it misbehaves.
The oversight stuff would be the next three hardest worlds (5th-7th). As written I think they’re correctly ordered, though I’d flag that “AI research assistance” as a standalone seems far safer than using AI for oversight. The last three seem correctly-ordered to me.
I’d also add that all of these seem very laser-focused on intentional deception as the failure mode, which is a reasonable choice for limiting scope, but sure does leave out an awful lot.
The phenomenon that a ‘better’ technique is actually worse than a ‘worse’ technique if both are insufficient is something I talk about in a later section of the post and I specifically mention RLHF. I think this holds true in general throughout the scale, e.g. Eliezer and Nate have said that even complex interpretability-based oversight with robustness testing and AI research assistance is also just incentivizing more and better deception, so this isn’t unique to RLHF.
But I tend to agree with Richard’s view in his discussion with you under that post that while if you condition on deception occurring by default RLHF is worse than just prompting (i.e. prompting is better in harder worlds), RLHF is better than just prompting in easy worlds. I also wouldn’t call non-strategically aware pursuit of inaccurate proxies for what we want ‘deception’, because in this scenario the system isn’t being intentionally deceptive.
In easy worlds, the proxies RLHF learns are good enough in practice and cases like the famous thing with the hand which looks like it’s grabbing a ball but isn’t just disappear if you’re diligent enough with how you provide feedback. In that world, not using RLHF would get systems pursuing cruder and worse proxies for what we want that fail often (e.g. systems just overtly lie to you all the time, say and do random things etc.). I think that’s more or less the situation we’re in right now with current AIs!
If the proxies that RLHF ends up pursuing are in fact close enough, then RLHF works and will make systems behave more reliably and be harder to e.g. jailbreak or provoke into random antisocial behavior than with just prompting. I did flag in a footnote that the ‘you get what you measure’ problem that RLHF produces could also be very difficult to deal with for structural or institutional reasons.
Next up, I’d put “Experiments with Potentially Catastrophic Systems to Understand Misalignment” as 4th-hardest world. If we can safely experiment with potentially-dangerous systems in e.g. a sandbox, and that actually works (i.e. the system doesn’t notice when it’s in testing and deceptively behave itself, or otherwise generalize in ways the testing doesn’t reveal), then we don’t really need oversight tools in the first place.
I’m assuming you meant fourth-easiest here not fourth hardest. It’s important to note that I’m not here talking about testing systems to see if they misbehave in a sandbox and then if they don’t assuming you’ve solved the problem and deploying. Rather, I’m talking about doing science with models that exhibit misaligned power seeking, with the idea being that we learn general rules about e.g. how specific architectures generalize, why certain phenomena arise etc. that are theoretically sound and we expect to hold true even post deployment with much more powerful systems. Incidentally this seems quite similar to what the OpenAI superalignment team is apparently planning.
So it’s basically, “can we build a science of alignment through a mix of experimentation and theory”. So if e.g. we study in a lab setting a model that’s been fooled into thinking it’s been deployed, then commits a treacherous turn, enough times we can figure out the underlying cause of the behavior and maybe get new foundational insights? Maybe we can try to deliberately get AIs to exhibit misalignment and learn from that. It’s hard to anticipate in advance what scientific discoveries will and won’t tell you about systems, and I think we’ve already seen cases of experiment-driven theoretical insights, like simulator theory, that seem to offer new handles for solving alignment. How much quicker and how much more useful will these be if we get the chance to experiment on very powerful systems?
I would order these differently.
Within the first section (prompting/RLHF/Constitutional):
I’d guess that Constitutional AI would work only in the very easiest worlds
RLHF would work in slightly less-easy worlds
Prompting would work in worlds where alignment is easy, but too hard for RLHF or Constitutional AI
The core reasoning here is that human feedback directly selects for deception. Furthermore, deception induced by human feedback does not require strategic awareness—e.g. that thing with the hand which looks like it’s grabbing a ball but isn’t is a good example. So human-feedback-induced deception is more likely to occur, and to occur earlier in development, than deception from strategic awareness. Among the three options, “Constitutional” AI applies the most optimization pressure toward deceiving humans (IIUC), RLHF the next most, whereas prompting alone provides zero direct selection pressure for deception; it is by far the safest option of the three. (Worlds Where Iterative Design Fails talks more broadly about the views behind this.)
Next up, I’d put “Experiments with Potentially Catastrophic Systems to Understand Misalignment” as 4th-hardest world. If we can safely experiment with potentially-dangerous systems in e.g. a sandbox, and that actually works (i.e. the system doesn’t notice when it’s in testing and deceptively behave itself, or otherwise generalize in ways the testing doesn’t reveal), then we don’t really need oversight tools in the first place. Just test the thing and see if it misbehaves.
The oversight stuff would be the next three hardest worlds (5th-7th). As written I think they’re correctly ordered, though I’d flag that “AI research assistance” as a standalone seems far safer than using AI for oversight. The last three seem correctly-ordered to me.
I’d also add that all of these seem very laser-focused on intentional deception as the failure mode, which is a reasonable choice for limiting scope, but sure does leave out an awful lot.
The phenomenon that a ‘better’ technique is actually worse than a ‘worse’ technique if both are insufficient is something I talk about in a later section of the post and I specifically mention RLHF. I think this holds true in general throughout the scale, e.g. Eliezer and Nate have said that even complex interpretability-based oversight with robustness testing and AI research assistance is also just incentivizing more and better deception, so this isn’t unique to RLHF.
But I tend to agree with Richard’s view in his discussion with you under that post that while if you condition on deception occurring by default RLHF is worse than just prompting (i.e. prompting is better in harder worlds), RLHF is better than just prompting in easy worlds. I also wouldn’t call non-strategically aware pursuit of inaccurate proxies for what we want ‘deception’, because in this scenario the system isn’t being intentionally deceptive.
In easy worlds, the proxies RLHF learns are good enough in practice and cases like the famous thing with the hand which looks like it’s grabbing a ball but isn’t just disappear if you’re diligent enough with how you provide feedback. In that world, not using RLHF would get systems pursuing cruder and worse proxies for what we want that fail often (e.g. systems just overtly lie to you all the time, say and do random things etc.). I think that’s more or less the situation we’re in right now with current AIs!
If the proxies that RLHF ends up pursuing are in fact close enough, then RLHF works and will make systems behave more reliably and be harder to e.g. jailbreak or provoke into random antisocial behavior than with just prompting. I did flag in a footnote that the ‘you get what you measure’ problem that RLHF produces could also be very difficult to deal with for structural or institutional reasons.
I’m assuming you meant fourth-easiest here not fourth hardest. It’s important to note that I’m not here talking about testing systems to see if they misbehave in a sandbox and then if they don’t assuming you’ve solved the problem and deploying. Rather, I’m talking about doing science with models that exhibit misaligned power seeking, with the idea being that we learn general rules about e.g. how specific architectures generalize, why certain phenomena arise etc. that are theoretically sound and we expect to hold true even post deployment with much more powerful systems. Incidentally this seems quite similar to what the OpenAI superalignment team is apparently planning.
So it’s basically, “can we build a science of alignment through a mix of experimentation and theory”. So if e.g. we study in a lab setting a model that’s been fooled into thinking it’s been deployed, then commits a treacherous turn, enough times we can figure out the underlying cause of the behavior and maybe get new foundational insights? Maybe we can try to deliberately get AIs to exhibit misalignment and learn from that. It’s hard to anticipate in advance what scientific discoveries will and won’t tell you about systems, and I think we’ve already seen cases of experiment-driven theoretical insights, like simulator theory, that seem to offer new handles for solving alignment. How much quicker and how much more useful will these be if we get the chance to experiment on very powerful systems?