cfoster0 comments on On second thought, prompt injections are probably examples of misalignment

cfoster0 21 Feb 2023 0:20 UTC
6 points
4

If grandma were a bit less naive, or a bit more intelligent, the scam wouldn’t work. But the DAN thing isn’t an exploit that can be solved merely by scaling up an AI or making it better at next-token-prediction.

It seems quite plausible to me that the DAN thing, or whatever other specific circa-2023 prompt injection method we pick, may actually be solved merely by making the AI more capable along the relevant dimensions. I think that the analogous intervention to “making grandma a bit less naive / a bit more intelligent” is already in progress (i.e. plain GPT-3 → + better pre-training → + instruction-tuning → + PPO based on a preference model → + Constitutional AI → … etc. etc.).
- lc 21 Feb 2023 17:29 UTC
  4 points
  1
  Parent
  1. Some of those things sound like alignment solutions.
  2. You can understand why “seems quite plausible” is like, the sort of anti-security-mindset thing Eliezer talks about. You might as well ask “maybe it will all just work out?” Maybe, but that doesn’t make what’s happening now not a safety issue, and it also “seems quite plausible” that naive scaling fails.
  - cfoster0 21 Feb 2023 18:24 UTC
    1 point
    0
    Parent
    Some of those things sound like alignment solutions.
    These are “alignment methods” and also “capability-boosting methods” that are progressively unlocked with increasing model scale.
    You can understand why “seems quite plausible” is like, the sort of anti-security-mindset thing Eliezer talks about. You might as well ask “maybe it will all just work out?” Maybe, but that doesn’t make what’s happening now not a safety issue, and it also “seems quite plausible” that naive scaling fails.
    Wait, hold up: insecure =/= unsafe =/= misaligned. My contention is that prompt injection is an example of bad security and lack of robustness, but not an example of misalignment. I am also making a prediction (not an assumption) that the next generation of naive, nonspecific methods will make near-future systems significantly more resistant to prompt injection, such that the current generation of prompt injection attacks will not work against them.