I think recursive self improvement is probably the wrong frame, but I’ll make a best effort attempt to answer the questions.
I think corrigibility basically does that. Properly corrigible agents should remain corrigible under amplification, and competent corrigible agents should design/instantiate corrigible successors.
It does seem to me like the kernel is aligned (as much as is feasible at its limited capability level).
No it does not. Your scenario only works because we’ve already solved the hard parts of alignment; we’ve already succeeded at making the AI corrigible in a way that’s robust to scale/capability amplification or succeeded in targeting it at “human values” in a way that’s robust to scale/capability amplification.
Of course if you solve alignment more capable systems would be more competent in acting in accordance with their alignment target(s).
I think recursive self improvement is probably the wrong frame, but I’ll make a best effort attempt to answer the questions.
I think corrigibility basically does that. Properly corrigible agents should remain corrigible under amplification, and competent corrigible agents should design/instantiate corrigible successors.
Another way this could be attained is if values are robust or we get alignment by default.
It does seem to me like the kernel is aligned (as much as is feasible at its limited capability level).
No it does not. Your scenario only works because we’ve already solved the hard parts of alignment; we’ve already succeeded at making the AI corrigible in a way that’s robust to scale/capability amplification or succeeded in targeting it at “human values” in a way that’s robust to scale/capability amplification.
Of course if you solve alignment more capable systems would be more competent in acting in accordance with their alignment target(s).