If grandma were a bit less naive, or a bit more intelligent, the scam wouldn’t work. But the DAN thing isn’t an exploit that can be solved merely by scaling up an AI or making it better at next-token-prediction.
It seems quite plausible to me that the DAN thing, or whatever other specific circa-2023 prompt injection method we pick, may actually be solved merely by making the AI more capable along the relevant dimensions. I think that the analogous intervention to “making grandma a bit less naive / a bit more intelligent” is already in progress (i.e. plain GPT-3 → + better pre-training → + instruction-tuning → + PPO based on a preference model → + Constitutional AI → … etc. etc.).
Some of those things sound like alignment solutions.
You can understand why “seems quite plausible” is like, the sort of anti-security-mindset thing Eliezer talks about. You might as well ask “maybe it will all just work out?” Maybe, but that doesn’t make what’s happening now not a safety issue, and it also “seems quite plausible” that naive scaling fails.
Some of those things sound like alignment solutions.
These are “alignment methods” and also “capability-boosting methods” that are progressively unlocked with increasing model scale.
You can understand why “seems quite plausible” is like, the sort of anti-security-mindset thing Eliezer talks about. You might as well ask “maybe it will all just work out?” Maybe, but that doesn’t make what’s happening now not a safety issue, and it also “seems quite plausible” that naive scaling fails.
Wait, hold up: insecure =/= unsafe =/= misaligned. My contention is that prompt injection is an example of bad security and lack of robustness, but not an example of misalignment. I am also making a prediction (not an assumption) that the next generation of naive, nonspecific methods will make near-future systems significantly more resistant to prompt injection, such that the current generation of prompt injection attacks will not work against them.
It seems quite plausible to me that the DAN thing, or whatever other specific circa-2023 prompt injection method we pick, may actually be solved merely by making the AI more capable along the relevant dimensions. I think that the analogous intervention to “making grandma a bit less naive / a bit more intelligent” is already in progress (i.e. plain GPT-3 → + better pre-training → + instruction-tuning → + PPO based on a preference model → + Constitutional AI → … etc. etc.).
Some of those things sound like alignment solutions.
You can understand why “seems quite plausible” is like, the sort of anti-security-mindset thing Eliezer talks about. You might as well ask “maybe it will all just work out?” Maybe, but that doesn’t make what’s happening now not a safety issue, and it also “seems quite plausible” that naive scaling fails.
These are “alignment methods” and also “capability-boosting methods” that are progressively unlocked with increasing model scale.
Wait, hold up: insecure =/= unsafe =/= misaligned. My contention is that prompt injection is an example of bad security and lack of robustness, but not an example of misalignment. I am also making a prediction (not an assumption) that the next generation of naive, nonspecific methods will make near-future systems significantly more resistant to prompt injection, such that the current generation of prompt injection attacks will not work against them.