Hey yrimon, thanks for the post! Overall, I think it generally represents some core concerns of AI safety, although I disagree with some of the claims made. One that stuck out particularly is highlighted below.
In order for this to occur, the AGI must trust that it will retain its identity or goals through the self improvement process, or that the new improved AGI it will build (or you will build) will share its goals. In other words, the AGI must be able to solve at least some form of the alignment problem.
I place low probability on this being true. Although AGI may be helpful in solving the alignment problem, I doubt it would refuse further capability research if it was unsure it could retain the same goals. This is especially true if AGI occurs through the current paradigm of scaling up LLM assistants, where they are trained as you state to be locally purposeful. It seems entirely possible that a locally purposeful AI assistant is able to self-improve (possibly very rapidly) without solving the alignment problem.
Hey yrimon, thanks for the post! Overall, I think it generally represents some core concerns of AI safety, although I disagree with some of the claims made. One that stuck out particularly is highlighted below.
I place low probability on this being true. Although AGI may be helpful in solving the alignment problem, I doubt it would refuse further capability research if it was unsure it could retain the same goals. This is especially true if AGI occurs through the current paradigm of scaling up LLM assistants, where they are trained as you state to be locally purposeful. It seems entirely possible that a locally purposeful AI assistant is able to self-improve (possibly very rapidly) without solving the alignment problem.