There are at lest two meaning in alignment: 1. do what I want and 2. don’t have catastrophic failure.
I think that “alignment is easy’ works only for the first requirement. But there could be many catastrophic failures modes, e.g. even humans can rebel or drift away from initial goals.
There are at lest two meaning in alignment: 1. do what I want and 2. don’t have catastrophic failure.
I think that “alignment is easy’ works only for the first requirement. But there could be many catastrophic failures modes, e.g. even humans can rebel or drift away from initial goals.
Yes, I think this objection captures something important.
I have proven that aligned AI must exist and also that it must be practically implementable.
But some kind of failure, i.e. a “near miss” on achieving a desired goal can happen even if success was possible.
I will address these near misses in future posts.