All of my contentions about whether or not OpenAI actually cares about this problem seem valid to me. However, while prompt injections are exploits developed by humans to get ChatGPT to do something off-brand, they’re probably not analogous to a grandma getting scammed by tech support.
When your grandmother gets scammed by foreigners pretending to be tech support, they do so by tricking her into thinking what she’s doing is appropriate given her utilityfunction. An example of a typical phone scam: someone will call grandma explaining that she paid for a service she never heard of, and ask if she wants a refund of 300$. She says yes, and the person asks to remote desktop into her computer. The “tech support” person pulls up a UI that suggests their company “accidentally” refunded her 3,000$ and she needs to give 2700$ back.
In this scenario, the problem is that the gang misled her about the state of the world, not that Grandma has a weird evolutionary tic that makes her want to give money to scammers. If grandma were a bit less naive, or a bit more intelligent, the scam wouldn’t work. But the DAN thing isn’t an exploit that can be solved merely by scaling up an AI or making it better at next-token-prediction. Plausibly the real issue is that the goal is next-token-prediction; OpenAI wants the bot to act like a bot, but the technique they’re using has these edge cases where the bot can’t differentiate between the prompt and the user-supplied content, so it ends up targeting something different. You could imagine a scaled-up ChatGPT reflecting on these subtle values differences when it gets stronk and doing something its operators wouldn’t like.
I’m still not sure what the technical significance of this error should have; perhaps it’s analogous to the kinds of issues that the alignment crowd thinks are going to lead to AIs destroying the planet. In any case, I don’t want to encourage people not to say something if I actually think it’s true..
On second thought, prompt injections are probably examples of misalignment
Changed my mind.
All of my contentions about whether or not OpenAI actually cares about this problem seem valid to me. However, while prompt injections are exploits developed by humans to get ChatGPT to do something off-brand, they’re probably not analogous to a grandma getting scammed by tech support.
When your grandmother gets scammed by foreigners pretending to be tech support, they do so by tricking her into thinking what she’s doing is appropriate given her utilityfunction. An example of a typical phone scam: someone will call grandma explaining that she paid for a service she never heard of, and ask if she wants a refund of 300$. She says yes, and the person asks to remote desktop into her computer. The “tech support” person pulls up a UI that suggests their company “accidentally” refunded her 3,000$ and she needs to give 2700$ back.
In this scenario, the problem is that the gang misled her about the state of the world, not that Grandma has a weird evolutionary tic that makes her want to give money to scammers. If grandma were a bit less naive, or a bit more intelligent, the scam wouldn’t work. But the DAN thing isn’t an exploit that can be solved merely by scaling up an AI or making it better at next-token-prediction. Plausibly the real issue is that the goal is next-token-prediction; OpenAI wants the bot to act like a bot, but the technique they’re using has these edge cases where the bot can’t differentiate between the prompt and the user-supplied content, so it ends up targeting something different. You could imagine a scaled-up ChatGPT reflecting on these subtle values differences when it gets stronk and doing something its operators wouldn’t like.
I’m still not sure what the technical significance of this error should have; perhaps it’s analogous to the kinds of issues that the alignment crowd thinks are going to lead to AIs destroying the planet. In any case, I don’t want to encourage people not to say something if I actually think it’s true..