If you only solve deceptive alignment, we still lose because we haven’t solved the problem of how to use non-deceptive AI to do good things but not bad things.
Some people are real big on the “humans will just look at its plans and stop it if we think they’re bad” strategy, but humans do things that sound good but are bad on reflection all the time.
It’s nontrivial to generate any good plans in the first place—if you’re trying to generalize human preferences over policies to superhumanly competent policies, either you’re solving the non-deceptive parts of the alignment problem or you’re not generating many good policies.
In short, making a misaligned AI not deceive you is very not the same as making an aligned AI.
If you only solve deceptive alignment, we still lose because we haven’t solved the problem of how to use non-deceptive AI to do good things but not bad things.
Some people are real big on the “humans will just look at its plans and stop it if we think they’re bad” strategy, but humans do things that sound good but are bad on reflection all the time.
It’s nontrivial to generate any good plans in the first place—if you’re trying to generalize human preferences over policies to superhumanly competent policies, either you’re solving the non-deceptive parts of the alignment problem or you’re not generating many good policies.
In short, making a misaligned AI not deceive you is very not the same as making an aligned AI.