Jurgen Gravestein comments on Corrigibility, Self-Deletion, and Identical Strawberries

Jurgen Gravestein 28 Mar 2023 20:59 UTC
6 points
4
What do these examples really show other than that the completion abilities of ChatGPT match with what we associate with a well-behaved AI? The practice of jailbreaking/prompt injection attacks show that the capability of harm in LLMs is always there, it’s never removed through alignment/RLHF, it just makes it harder to access. Doesn’t RLHF simply improves a large majority of outputs to conform to what we think is acceptable outcomes? To me it feels a bit like convincing a blindfolded person he’s blind, hoping it won’t figure out how to take the blindfold off.
Slightly unrelated question: why is conjuring up persona’s in prompts so effective?
- Robert_AIZI 29 Mar 2023 12:33 UTC
  3 points
  0
  Parent
  To use your analogy, I think this is like a study showing that wearing a blindfold does decrease sight capabilities. It’s a proof of concept that you can make that change, even though the subject isn’t truly made blind, could possibly remove their own blindfold, etc.
  I think this is notable because it highlights that LLMs (as they exist now) are not the expected-utility-maximizing agents which have all the negative results. It’s a very different landscape if we can make our AI act corrigible (but only in narrow ways which might be undone by prompt injection, etc, etc) versus if we’re infinitely far away from an AI having “an intuitive sense to “understanding that it might be flawed””.