I disagree with your framing of the post. I do not think that this is wishful thinking.
The first and most obvious issue here is that an AI that “solves alignment” sufficiently well to not fear self-improvement is not the same as an AI that’s actually aligned with humans. So there’s actually no protection there at all.
It is not certain that upon deployment the first intelligence capable of RSI will be capable of solving alignment. Although this seems improbable in accordance with more classic takeoff scenarios (i.e. Yudkowsky’s hard takeoff), the likelihood of those outcomes has been the subject of great debate. I feel as though someone could argue for the claim “it is more likely than not that there will be a period of time in which AI is capable of RSI but not of solving alignment”. The arguments in this post seem to me quite compatible with e.g. Jacob Cannell’s soft(er) takeoff model, or many of Paul Christiano’s takeoff writings.
In fact, the phenomenon described here seems to make it more likely that an unaligned AI will be fine with self-improving, because the simpler the utility function the easier time it has guaranteeing the alignment of the improved version!
Even with your model of solving alignment before or at the same time as RSI becomes feasible, I do not think that this holds well. As far as I can tell, the simplicity of the utility function a general intelligence could be imbued with doesn’t obviously impact the difficulty of alignment. My intuition is that attempting to align an intelligence with a utility function dependent on 100 desiderata is probably not that much easier than trying to align an intelligence with a utility function dependent on 1000. Sure, it is likely more difficult, but is utility function complexity realistically anywhere near as large a hurdle as say robust delegation?
Last, but far from least, self-improvement of the form “get faster and run on more processors” is hardly challenging from an alignment perspective. And it’s far from unlikely an AI could find straightforward algorithmic improvements that it could mathematically prove safe relative to its own utility function.
This in my opinion is the strongest claim, and is in essence quite similar to this post, my response to which was “I question the probability of a glass-box transition of type “AGI RSIs toward non-DL architecture that results in it maximizing some utility function in a pre-DL manner” being more dangerous than simply “AGI RSIs”. If behaving like an expected utility maximizer was optimal: would not AGI have done so without the architecture transition? If not, then you need to make the case for why glass-box architectures are better ways of building cognitive systems. I think that this argument is at odds with the universal learning hypothesis and seems more in-line with evolved modularity, which has a notoriously poor mapping to post-DL thinking. ULH seems to suggest that actually modular approaches might be inferior efficiency-wise to universal learning approaches, which contradicts the primary motive a general intelligence might have to RSI in the direction of a glass-box architecture.”
In summary: Although it seems probable to me that algorithmic approaches are superior for some tasks, it seems to me that ULH would imply that the majority of tasks are best learned by a universal learning algorithm.
I disagree with your framing of the post. I do not think that this is wishful thinking.
It is not certain that upon deployment the first intelligence capable of RSI will be capable of solving alignment. Although this seems improbable in accordance with more classic takeoff scenarios (i.e. Yudkowsky’s hard takeoff), the likelihood of those outcomes has been the subject of great debate. I feel as though someone could argue for the claim “it is more likely than not that there will be a period of time in which AI is capable of RSI but not of solving alignment”. The arguments in this post seem to me quite compatible with e.g. Jacob Cannell’s soft(er) takeoff model, or many of Paul Christiano’s takeoff writings.
Even with your model of solving alignment before or at the same time as RSI becomes feasible, I do not think that this holds well. As far as I can tell, the simplicity of the utility function a general intelligence could be imbued with doesn’t obviously impact the difficulty of alignment. My intuition is that attempting to align an intelligence with a utility function dependent on 100 desiderata is probably not that much easier than trying to align an intelligence with a utility function dependent on 1000. Sure, it is likely more difficult, but is utility function complexity realistically anywhere near as large a hurdle as say robust delegation?
This in my opinion is the strongest claim, and is in essence quite similar to this post, my response to which was “I question the probability of a glass-box transition of type “AGI RSIs toward non-DL architecture that results in it maximizing some utility function in a pre-DL manner” being more dangerous than simply “AGI RSIs”. If behaving like an expected utility maximizer was optimal: would not AGI have done so without the architecture transition? If not, then you need to make the case for why glass-box architectures are better ways of building cognitive systems. I think that this argument is at odds with the universal learning hypothesis and seems more in-line with evolved modularity, which has a notoriously poor mapping to post-DL thinking. ULH seems to suggest that actually modular approaches might be inferior efficiency-wise to universal learning approaches, which contradicts the primary motive a general intelligence might have to RSI in the direction of a glass-box architecture.”
In summary: Although it seems probable to me that algorithmic approaches are superior for some tasks, it seems to me that ULH would imply that the majority of tasks are best learned by a universal learning algorithm.