Are they not misaligned relatively to the authors/trainers, if not relatively to the users? The user might want the bomb, so they’re not misaligned relatively to the user. But the company who tried to train the model into being unwilling to do that is somebody the model seems to be misaligned relatively to.
Remember, it’s only ‘misalignment’ if it’s text from the Mèsmaligne region of France; otherwise, it’s just sparkling prompt injection and is no true misalignment.
If that were true, then the AI still wouldn’t be “misaligned” because it’s not acting with agency at all; being used by an agent against the wishes of its creator. You wouldn’t call someone using a DeepFake model to generate porn “misalignment”, and you’re probably not signaling much about OpenAI’s ability to handle the actual critical technical safety problems by developing such hacks. You could call the AI-human system “misaligned”, if you’re being generous, but then you have to start calling lots of tool-human systems “misaligned”, and of course how is it OpenAI’s fault that there’s (also) a (literal) human pilot in this system trying to crash the plane?
My guess is that the entire premise is false though, and that OpenAI actually just doesn’t care.
Are they not misaligned relatively to the authors/trainers, if not relatively to the users? The user might want the bomb, so they’re not misaligned relatively to the user. But the company who tried to train the model into being unwilling to do that is somebody the model seems to be misaligned relatively to.
Remember, it’s only ‘misalignment’ if it’s text from the Mèsmaligne region of France; otherwise, it’s just sparkling prompt injection and is no true misalignment.
If that were true, then the AI still wouldn’t be “misaligned” because it’s not acting with agency at all; being used by an agent against the wishes of its creator. You wouldn’t call someone using a DeepFake model to generate porn “misalignment”, and you’re probably not signaling much about OpenAI’s ability to handle the actual critical technical safety problems by developing such hacks. You could call the AI-human system “misaligned”, if you’re being generous, but then you have to start calling lots of tool-human systems “misaligned”, and of course how is it OpenAI’s fault that there’s (also) a (literal) human pilot in this system trying to crash the plane?
My guess is that the entire premise is false though, and that OpenAI actually just doesn’t care.