I don’t think {not noticing that mindcrime is a problem} is a violation of alignment: the AI is trying to do what you want but makes a moral error.
I do think {if the AI is too weak, it secretly plots to kill everyone} is a violation of alignment: the AI isn’t trying to do what you want. It knows that you don’t want it to kill everyone, that’s why it’s trying to keep it secret.
(It’s technically possible for an AI to kill everyone, and even to secretly kill everyone, because it is trying to do what you want but makes a mistake. This seems like an inevitable feature of any sensible definition of alignment. I expect to now have an involved discussion about what the difference is.)
(Wei Dai and I discussed my definition of alignment offline, leading to this post which hopefully clarifies things a little bit in addition to summarizing the takeaways from this thread.)
I don’t think {not noticing that mindcrime is a problem} is a violation of alignment: the AI is trying to do what you want but makes a moral error.
I do think {if the AI is too weak, it secretly plots to kill everyone} is a violation of alignment: the AI isn’t trying to do what you want. It knows that you don’t want it to kill everyone, that’s why it’s trying to keep it secret.
(It’s technically possible for an AI to kill everyone, and even to secretly kill everyone, because it is trying to do what you want but makes a mistake. This seems like an inevitable feature of any sensible definition of alignment. I expect to now have an involved discussion about what the difference is.)
(Wei Dai and I discussed my definition of alignment offline, leading to this post which hopefully clarifies things a little bit in addition to summarizing the takeaways from this thread.)