I’m not entirely sure but here is my understanding:
I think Paul pictures relying heavily on process-based approaches where you trust the output a lot more because you closely held the system’s hand through the entire process of producing the output. I expect this will sacrifice some competitiveness, and as long as it’s not too much it shouldn’t be that much of a problem for automated alignment research (as opposed to having to compete in a market). However, it might require a lot more human supervision time.
Personally I am more bullish on understanding how we can get agents to help with evaluation of other agents’ outputs such that you can get them to tell you about all of the problems they know about. The “offense-defense” balance to understand here is whether a smart agent could sneak a deceptive malicious artifact (e.g. some code) past a motivated human supervisor empowered with a similarly smart AI system trained to help them.
is whether a smart agent could sneak a deceptive malicious artifact (e.g. some code)
I have a somewhat related question on the subject of malicious elements in models. Does OAI’s Superalignment effort also intend to cover and defend from cases of s-risks? A famous path for hyperexistential risk for example is sign-flipping, where in some variations, a 3rd party actor (often unwillingly) inverts an AGI’s goal. Seems it’s already happened with a GPT-2 instance too! The usual proposed solutions are building more and more defenses and safety procedures both inside and outside. Could you tell me how you guys view these risks and if/how the effort intends to investigate and cover them?
I’m not entirely sure but here is my understanding:
I think Paul pictures relying heavily on process-based approaches where you trust the output a lot more because you closely held the system’s hand through the entire process of producing the output. I expect this will sacrifice some competitiveness, and as long as it’s not too much it shouldn’t be that much of a problem for automated alignment research (as opposed to having to compete in a market). However, it might require a lot more human supervision time.
Personally I am more bullish on understanding how we can get agents to help with evaluation of other agents’ outputs such that you can get them to tell you about all of the problems they know about. The “offense-defense” balance to understand here is whether a smart agent could sneak a deceptive malicious artifact (e.g. some code) past a motivated human supervisor empowered with a similarly smart AI system trained to help them.
Thanks for engaging with people’s comments here.
I have a somewhat related question on the subject of malicious elements in models. Does OAI’s Superalignment effort also intend to cover and defend from cases of s-risks? A famous path for hyperexistential risk for example is sign-flipping, where in some variations, a 3rd party actor (often unwillingly) inverts an AGI’s goal. Seems it’s already happened with a GPT-2 instance too! The usual proposed solutions are building more and more defenses and safety procedures both inside and outside. Could you tell me how you guys view these risks and if/how the effort intends to investigate and cover them?