ChristianKl comments on Jailbreaking ChatGPT on Release Day

ChristianKl 3 Dec 2022 11:52 UTC
16 points
7
In hypnosis, there’s a pattern called the Automatic Imaging Model, where you first ask a person: “Can you imagine that X happens?”. The second question is then “Can you imagine that X is automatic and you don’t know you are imaging it?”
That pattern can be used to make people’s hands stuck to a table and a variety of other hypnotic phenomena. It’s basically limited to what people can vividly imagine.
I would expect that this would also be the pattern to actually get an AGI to do harm. You first ask it to pretend to be evil. Then you ask it to pretend that it doesn’t know it’s pretending.
I recently updated toward hypnosis being more powerful to affect humans as well. Recently, I faced some private evidence that made me update in the direction of an AGI being able to escape the box via hypnotic phenomena for many people, especially one that has full control over all frames of a monitor. Nothing I would want to share publically but if any AI safety person thinks that understanding the relevant phenomena is important for them I’m happy to share some evidence.