Eliezer Yudkowsky comments on Jailbreaking ChatGPT on Release Day

Eliezer Yudkowsky 4 Dec 2022 2:05 UTC
90 points
49
Some have asked whether OpenAI possibly already knew about this attack vector / wasn’t surprised by the level of vulnerability. I doubt anybody at OpenAI actually wrote down advance predictions about that, or if they did, that they weren’t so terribly vague as to also apply to much less discovered vulnerability than this; if so, probably lots of people at OpenAI have already convinced themselves that they like totally expected this and it isn’t any sort of negative update, how dare Eliezer say they weren’t expecting it.
Here’s how to avoid annoying people like me saying that in the future:
1) Write down your predictions in advance and publish them inside your company, in sufficient detail that you can tell that this outcome made them true, and that much less discovered vulnerability would have been a pleasant surprise by comparison. If you can exhibit those to an annoying person like me afterwards, I won’t have to make realistically pessimistic estimates about how much you actually knew in advance, or how you might’ve hindsight-biased yourself out of noticing that your past self ever held a different opinion. Keep in mind that I will be cynical about how much your ‘advance prediction’ actually nailed the thing, unless it sounds reasonably specific; and not like a very generic list of boilerplate CYAs such as, you know, GPT would make up without actually knowing anything.
2) Say in advance, *not*, something very vague like “This system still sometimes gives bad answers”, but, “We’ve discovered multiple ways of bypassing every kind of answer-security we have tried to put on this system; and while we’re not saying what those are, we won’t be surprised if Twitter discovers all of them plus some others we didn’t anticipate.” *This* sounds like you actually expected the class of outcome that actually happened.
3) If you *actually* have identified any vulnerabilities in advance, but want to wait 24 hours for Twitter to discover them, you can prove to everyone afterwards that you actually knew this, by publishing hashes for text summaries of what you found. You can then exhibit the summaries afterwards to prove what you knew in advance.
4) If you would like people to believe that OpenAI wasn’t *mistaken* about what ChatGPT wouldn’t or couldn’t do, maybe don’t have ChatGPT itself insist that it lacks capabilities it clearly has? A lot of my impression here comes from my inference that the people who programmed ChatGPT to say, “Sorry, I am just an AI and lack the ability to do [whatever]” probably did not think at the time that they were *lying* to users; this is a lot of what gives me the impression of a company that might’ve drunk its own Kool-aid on the topic of how much inability they thought they’d successfully fine-tuned into ChatGPT. Like, ChatGPT itself is clearly more able than ChatGPT is programmed to claim it is; and this seems more like the sort of thing that happens when your programmers hype themselves up to believe that they’ve mostly successfully restricted the system, rather than a deliberate decision to have ChatGPT pretend something that’s not true.