green_leaf comments on Stop posting prompt injections on Twitter and calling it “misalignment”

green_leaf 19 Feb 2023 2:42 UTC
24 points
13
Are they not misaligned relatively to the authors/trainers, if not relatively to the users? The user might want the bomb, so they’re not misaligned relatively to the user. But the company who tried to train the model into being unwilling to do that is somebody the model seems to be misaligned relatively to.
- gwern 19 Feb 2023 21:29 UTC
  29 points
  −1
  Parent
  Remember, it’s only ‘misalignment’ if it’s text from the Mèsmaligne region of France; otherwise, it’s just sparkling prompt injection and is no true misalignment.
- lc 19 Feb 2023 2:47 UTC
  14 points
  9
  Parent
  If that were true, then the AI still wouldn’t be “misaligned” because it’s not acting with agency at all; being used by an agent against the wishes of its creator. You wouldn’t call someone using a DeepFake model to generate porn “misalignment”, and you’re probably not signaling much about OpenAI’s ability to handle the actual critical technical safety problems by developing such hacks. You could call the AI-human system “misaligned”, if you’re being generous, but then you have to start calling lots of tool-human systems “misaligned”, and of course how is it OpenAI’s fault that there’s (also) a (literal) human pilot in this system trying to crash the plane?
  
  My guess is that the entire premise is false though, and that OpenAI actually just doesn’t care.