The question is that without perfect alignment, we really couldn’t be sure to direct AI to even one of your 3 examples which you consider easy—they’re probably not. Not without “perfect alignment”, because an advanced AI will probably be constantly re-writing its own code (recursive self improvement), so there are no certainties. We need formal proof of safety/control.
If an advanced AI is editing its own code, it would only do so as part of its internal utility function, which it will want to keep stable (since changing its utility function would make achieving it much less likely). Therefore—at least as far as I can tell—we only need to worry about the initial utility function we assign it.
I’d place significant probability on us living in a world where a large chunk of alignment failures end up looking vaguely like one of the three examples I brought up, or at least convergent on a relatively small number of “attractors,” if you will.
If an advanced AI is editing its own code, it would only do so as part of its internal utility function, which it will want to keep stable (since changing its utility function would make achieving it much less likely). Therefore—at least as far as I can tell—we only need to worry about the initial utility function we assign it.
I’d place significant probability on us living in a world where a large chunk of alignment failures end up looking vaguely like one of the three examples I brought up, or at least convergent on a relatively small number of “attractors,” if you will.