Some of those things sound like alignment solutions.
These are “alignment methods” and also “capability-boosting methods” that are progressively unlocked with increasing model scale.
You can understand why “seems quite plausible” is like, the sort of anti-security-mindset thing Eliezer talks about. You might as well ask “maybe it will all just work out?” Maybe, but that doesn’t make what’s happening now not a safety issue, and it also “seems quite plausible” that naive scaling fails.
Wait, hold up: insecure =/= unsafe =/= misaligned. My contention is that prompt injection is an example of bad security and lack of robustness, but not an example of misalignment. I am also making a prediction (not an assumption) that the next generation of naive, nonspecific methods will make near-future systems significantly more resistant to prompt injection, such that the current generation of prompt injection attacks will not work against them.
These are “alignment methods” and also “capability-boosting methods” that are progressively unlocked with increasing model scale.
Wait, hold up: insecure =/= unsafe =/= misaligned. My contention is that prompt injection is an example of bad security and lack of robustness, but not an example of misalignment. I am also making a prediction (not an assumption) that the next generation of naive, nonspecific methods will make near-future systems significantly more resistant to prompt injection, such that the current generation of prompt injection attacks will not work against them.