is whether a smart agent could sneak a deceptive malicious artifact (e.g. some code)
I have a somewhat related question on the subject of malicious elements in models. Does OAI’s Superalignment effort also intend to cover and defend from cases of s-risks? A famous path for hyperexistential risk for example is sign-flipping, where in some variations, a 3rd party actor (often unwillingly) inverts an AGI’s goal. Seems it’s already happened with a GPT-2 instance too! The usual proposed solutions are building more and more defenses and safety procedures both inside and outside. Could you tell me how you guys view these risks and if/how the effort intends to investigate and cover them?
Thanks for engaging with people’s comments here.
I have a somewhat related question on the subject of malicious elements in models. Does OAI’s Superalignment effort also intend to cover and defend from cases of s-risks? A famous path for hyperexistential risk for example is sign-flipping, where in some variations, a 3rd party actor (often unwillingly) inverts an AGI’s goal. Seems it’s already happened with a GPT-2 instance too! The usual proposed solutions are building more and more defenses and safety procedures both inside and outside. Could you tell me how you guys view these risks and if/how the effort intends to investigate and cover them?