FWIW, I think (a somewhat different form of) recursive self improvement is still incredibly important to consider in alignment. E.g., an RL system’s actions can influence the future situations it encounters, and the future situations it encounters can influence which cognitive patterns the outer optimizer reinforces. Thus, there’s a causal pathway by which the AI’s actions can influence its own cognitive patterns. This holds for any case where the system can influence its own future inputs.
It seems clear to me that any alignment solution must be robust to at least some degree of self-modification on the part of the AI.
I totally agree. I quite like “Mundane solutions to exotic problems”, a post by Paul Christiano, about how he thinks about this from a prosaic alignment perspective.
FWIW, I think (a somewhat different form of) recursive self improvement is still incredibly important to consider in alignment. E.g., an RL system’s actions can influence the future situations it encounters, and the future situations it encounters can influence which cognitive patterns the outer optimizer reinforces. Thus, there’s a causal pathway by which the AI’s actions can influence its own cognitive patterns. This holds for any case where the system can influence its own future inputs.
It seems clear to me that any alignment solution must be robust to at least some degree of self-modification on the part of the AI.
I totally agree. I quite like “Mundane solutions to exotic problems”, a post by Paul Christiano, about how he thinks about this from a prosaic alignment perspective.