Some of those things sound like alignment solutions.
You can understand why “seems quite plausible” is like, the sort of anti-security-mindset thing Eliezer talks about. You might as well ask “maybe it will all just work out?” Maybe, but that doesn’t make what’s happening now not a safety issue, and it also “seems quite plausible” that naive scaling fails.
Some of those things sound like alignment solutions.
These are “alignment methods” and also “capability-boosting methods” that are progressively unlocked with increasing model scale.
You can understand why “seems quite plausible” is like, the sort of anti-security-mindset thing Eliezer talks about. You might as well ask “maybe it will all just work out?” Maybe, but that doesn’t make what’s happening now not a safety issue, and it also “seems quite plausible” that naive scaling fails.
Wait, hold up: insecure =/= unsafe =/= misaligned. My contention is that prompt injection is an example of bad security and lack of robustness, but not an example of misalignment. I am also making a prediction (not an assumption) that the next generation of naive, nonspecific methods will make near-future systems significantly more resistant to prompt injection, such that the current generation of prompt injection attacks will not work against them.
Some of those things sound like alignment solutions.
You can understand why “seems quite plausible” is like, the sort of anti-security-mindset thing Eliezer talks about. You might as well ask “maybe it will all just work out?” Maybe, but that doesn’t make what’s happening now not a safety issue, and it also “seems quite plausible” that naive scaling fails.
These are “alignment methods” and also “capability-boosting methods” that are progressively unlocked with increasing model scale.
Wait, hold up: insecure =/= unsafe =/= misaligned. My contention is that prompt injection is an example of bad security and lack of robustness, but not an example of misalignment. I am also making a prediction (not an assumption) that the next generation of naive, nonspecific methods will make near-future systems significantly more resistant to prompt injection, such that the current generation of prompt injection attacks will not work against them.