Yes, we are currently planning continue to pursue these directions for scalable oversight. My current best guess is that scalable oversight will do a lot of the heavy lifting for aligning roughly human-level alignment research models (by creating very precise training signals), but not all of it. Easy-to-hard generalization, (automated) interpretability, adversarial training+testing will also be core pieces, but I expect we’ll add more over time.
I don’t really understand why many people updated so heavily on the obfuscated arguments problem; I don’t think there was ever good reason to believe that IDA/debate/RRM would scale indefinitely and I personally don’t think that problem will be a big blocker for a while for some of the tasks that we’re most interested in (alignment research). My understanding is that many people at DeepMind and Anthropic remain optimistic about debate variants have have been running a number of preliminary experiments (see e.g. this Anthropic paper).
My best guess for the reason why you haven’t heard much about it is that people weren’t that interested in running on more toy tasks or doing more human-only experiments and LLMs haven’t been good enough to do much beyond critique-writing (we tried this a little bit in the early days of GPT-4). Most people who’ve been working on this recently don’t really post much on LW/AF.
Thanks for engaging with my questions here. I’ll probably have more questions later as I digest the answers and (re)read some of your blog posts. In the meantime, do you know what Paul meant by “it doesn’t depend on goodness of HCH, and instead relies on some claims about offense-defense between teams of weak agents and strong agents” in the other subthread?
I’m not entirely sure but here is my understanding:
I think Paul pictures relying heavily on process-based approaches where you trust the output a lot more because you closely held the system’s hand through the entire process of producing the output. I expect this will sacrifice some competitiveness, and as long as it’s not too much it shouldn’t be that much of a problem for automated alignment research (as opposed to having to compete in a market). However, it might require a lot more human supervision time.
Personally I am more bullish on understanding how we can get agents to help with evaluation of other agents’ outputs such that you can get them to tell you about all of the problems they know about. The “offense-defense” balance to understand here is whether a smart agent could sneak a deceptive malicious artifact (e.g. some code) past a motivated human supervisor empowered with a similarly smart AI system trained to help them.
is whether a smart agent could sneak a deceptive malicious artifact (e.g. some code)
I have a somewhat related question on the subject of malicious elements in models. Does OAI’s Superalignment effort also intend to cover and defend from cases of s-risks? A famous path for hyperexistential risk for example is sign-flipping, where in some variations, a 3rd party actor (often unwillingly) inverts an AGI’s goal. Seems it’s already happened with a GPT-2 instance too! The usual proposed solutions are building more and more defenses and safety procedures both inside and outside. Could you tell me how you guys view these risks and if/how the effort intends to investigate and cover them?
Yes, we are currently planning continue to pursue these directions for scalable oversight. My current best guess is that scalable oversight will do a lot of the heavy lifting for aligning roughly human-level alignment research models (by creating very precise training signals), but not all of it. Easy-to-hard generalization, (automated) interpretability, adversarial training+testing will also be core pieces, but I expect we’ll add more over time.
I don’t really understand why many people updated so heavily on the obfuscated arguments problem; I don’t think there was ever good reason to believe that IDA/debate/RRM would scale indefinitely and I personally don’t think that problem will be a big blocker for a while for some of the tasks that we’re most interested in (alignment research). My understanding is that many people at DeepMind and Anthropic remain optimistic about debate variants have have been running a number of preliminary experiments (see e.g. this Anthropic paper).
My best guess for the reason why you haven’t heard much about it is that people weren’t that interested in running on more toy tasks or doing more human-only experiments and LLMs haven’t been good enough to do much beyond critique-writing (we tried this a little bit in the early days of GPT-4). Most people who’ve been working on this recently don’t really post much on LW/AF.
Thanks for engaging with my questions here. I’ll probably have more questions later as I digest the answers and (re)read some of your blog posts. In the meantime, do you know what Paul meant by “it doesn’t depend on goodness of HCH, and instead relies on some claims about offense-defense between teams of weak agents and strong agents” in the other subthread?
I’m not entirely sure but here is my understanding:
I think Paul pictures relying heavily on process-based approaches where you trust the output a lot more because you closely held the system’s hand through the entire process of producing the output. I expect this will sacrifice some competitiveness, and as long as it’s not too much it shouldn’t be that much of a problem for automated alignment research (as opposed to having to compete in a market). However, it might require a lot more human supervision time.
Personally I am more bullish on understanding how we can get agents to help with evaluation of other agents’ outputs such that you can get them to tell you about all of the problems they know about. The “offense-defense” balance to understand here is whether a smart agent could sneak a deceptive malicious artifact (e.g. some code) past a motivated human supervisor empowered with a similarly smart AI system trained to help them.
Thanks for engaging with people’s comments here.
I have a somewhat related question on the subject of malicious elements in models. Does OAI’s Superalignment effort also intend to cover and defend from cases of s-risks? A famous path for hyperexistential risk for example is sign-flipping, where in some variations, a 3rd party actor (often unwillingly) inverts an AGI’s goal. Seems it’s already happened with a GPT-2 instance too! The usual proposed solutions are building more and more defenses and safety procedures both inside and outside. Could you tell me how you guys view these risks and if/how the effort intends to investigate and cover them?