The first and most obvious issue here is that an AI that “solves alignment” sufficiently well to not fear self-improvement is not the same as an AI that’s actually aligned with humans. So there’s actually no protection there at all.
In fact, the phenomenon described here seems to make it more likely that an unaligned AI will be fine with self-improving, because the simpler the utility function the easier time it has guaranteeing the alignment of the improved version!
Last, but far from least, self-improvement of the form “get faster and run on more processors” is hardly challenging from an alignment perspective. And it’s far from unlikely an AI could find straightforward algorithmic improvements that it could mathematically prove safe relative to its own utility function.
In short, the overall approach seems like wishful thinking of the form, “maybe if it’s smart enough it won’t want to kill us.”
The argument in this post does provide a harder barrier to takeoff, though. In order to have dangers from a self-improving ai, you would have to first make an ai which could scale to uncontrollability using ‘safe’ scaling techniques relative to its reward function, where ‘safe’ is relative to its own ability to prove to itself that it’s safe. (Or in the next round of self-improvements it considers ‘safe’ after the first round, and so on).
Regardless I think self-improving ai is more likely to come from humans designing a self-improving ai, which might render this kind of motive argument moot. And anyway the ai might not be this uber-rational creature which has to prove to itself that it’s self-improved version won’t change its reward function—it might just try it anyway (like humans are doing now).
What I was pointing out is that the barrier is asymmetrical: it’s biased towards AIs with more-easily-aligned utility functions. A paperclipper is more likely to be able to create an improved paperclipper that it’s certain enough will massively increase its utility, while a more human-aligned AI would have to be more conservative.
In other words, this paper seems to say, “if we can create human-aligned AI, it will be cautious about self-improvement, but dangerously unaligned AIs will probably have no issues.”
the simpler the utility function the easier time it has guaranteeing the alignment of the improved version
If we are talking about a theoretical argmaxaE(U|a) AI, where E(U|a) (expectation of utility given the action a) somehow points to the external world, then sure. If we are talking about a real AI with aspiration to become the physical embodiment of the aforementioned theoretical concept (with the said aspiration somehow encoded outside of U, because U is simple), then things get more hairy.
I disagree with your framing of the post. I do not think that this is wishful thinking.
The first and most obvious issue here is that an AI that “solves alignment” sufficiently well to not fear self-improvement is not the same as an AI that’s actually aligned with humans. So there’s actually no protection there at all.
It is not certain that upon deployment the first intelligence capable of RSI will be capable of solving alignment. Although this seems improbable in accordance with more classic takeoff scenarios (i.e. Yudkowsky’s hard takeoff), the likelihood of those outcomes has been the subject of great debate. I feel as though someone could argue for the claim “it is more likely than not that there will be a period of time in which AI is capable of RSI but not of solving alignment”. The arguments in this post seem to me quite compatible with e.g. Jacob Cannell’s soft(er) takeoff model, or many of Paul Christiano’s takeoff writings.
In fact, the phenomenon described here seems to make it more likely that an unaligned AI will be fine with self-improving, because the simpler the utility function the easier time it has guaranteeing the alignment of the improved version!
Even with your model of solving alignment before or at the same time as RSI becomes feasible, I do not think that this holds well. As far as I can tell, the simplicity of the utility function a general intelligence could be imbued with doesn’t obviously impact the difficulty of alignment. My intuition is that attempting to align an intelligence with a utility function dependent on 100 desiderata is probably not that much easier than trying to align an intelligence with a utility function dependent on 1000. Sure, it is likely more difficult, but is utility function complexity realistically anywhere near as large a hurdle as say robust delegation?
Last, but far from least, self-improvement of the form “get faster and run on more processors” is hardly challenging from an alignment perspective. And it’s far from unlikely an AI could find straightforward algorithmic improvements that it could mathematically prove safe relative to its own utility function.
This in my opinion is the strongest claim, and is in essence quite similar to this post, my response to which was “I question the probability of a glass-box transition of type “AGI RSIs toward non-DL architecture that results in it maximizing some utility function in a pre-DL manner” being more dangerous than simply “AGI RSIs”. If behaving like an expected utility maximizer was optimal: would not AGI have done so without the architecture transition? If not, then you need to make the case for why glass-box architectures are better ways of building cognitive systems. I think that this argument is at odds with the universal learning hypothesis and seems more in-line with evolved modularity, which has a notoriously poor mapping to post-DL thinking. ULH seems to suggest that actually modular approaches might be inferior efficiency-wise to universal learning approaches, which contradicts the primary motive a general intelligence might have to RSI in the direction of a glass-box architecture.”
In summary: Although it seems probable to me that algorithmic approaches are superior for some tasks, it seems to me that ULH would imply that the majority of tasks are best learned by a universal learning algorithm.
The first and most obvious issue here is that an AI that “solves alignment” sufficiently well to not fear self-improvement is not the same as an AI that’s actually aligned with humans. So there’s actually no protection there at all.
In fact, the phenomenon described here seems to make it more likely that an unaligned AI will be fine with self-improving, because the simpler the utility function the easier time it has guaranteeing the alignment of the improved version!
Last, but far from least, self-improvement of the form “get faster and run on more processors” is hardly challenging from an alignment perspective. And it’s far from unlikely an AI could find straightforward algorithmic improvements that it could mathematically prove safe relative to its own utility function.
In short, the overall approach seems like wishful thinking of the form, “maybe if it’s smart enough it won’t want to kill us.”
The argument in this post does provide a harder barrier to takeoff, though. In order to have dangers from a self-improving ai, you would have to first make an ai which could scale to uncontrollability using ‘safe’ scaling techniques relative to its reward function, where ‘safe’ is relative to its own ability to prove to itself that it’s safe. (Or in the next round of self-improvements it considers ‘safe’ after the first round, and so on). Regardless I think self-improving ai is more likely to come from humans designing a self-improving ai, which might render this kind of motive argument moot. And anyway the ai might not be this uber-rational creature which has to prove to itself that it’s self-improved version won’t change its reward function—it might just try it anyway (like humans are doing now).
What I was pointing out is that the barrier is asymmetrical: it’s biased towards AIs with more-easily-aligned utility functions. A paperclipper is more likely to be able to create an improved paperclipper that it’s certain enough will massively increase its utility, while a more human-aligned AI would have to be more conservative.
In other words, this paper seems to say, “if we can create human-aligned AI, it will be cautious about self-improvement, but dangerously unaligned AIs will probably have no issues.”
If we are talking about a theoretical argmaxaE(U|a) AI, where E(U|a) (expectation of utility given the action a) somehow points to the external world, then sure. If we are talking about a real AI with aspiration to become the physical embodiment of the aforementioned theoretical concept (with the said aspiration somehow encoded outside of U, because U is simple), then things get more hairy.
I disagree with your framing of the post. I do not think that this is wishful thinking.
It is not certain that upon deployment the first intelligence capable of RSI will be capable of solving alignment. Although this seems improbable in accordance with more classic takeoff scenarios (i.e. Yudkowsky’s hard takeoff), the likelihood of those outcomes has been the subject of great debate. I feel as though someone could argue for the claim “it is more likely than not that there will be a period of time in which AI is capable of RSI but not of solving alignment”. The arguments in this post seem to me quite compatible with e.g. Jacob Cannell’s soft(er) takeoff model, or many of Paul Christiano’s takeoff writings.
Even with your model of solving alignment before or at the same time as RSI becomes feasible, I do not think that this holds well. As far as I can tell, the simplicity of the utility function a general intelligence could be imbued with doesn’t obviously impact the difficulty of alignment. My intuition is that attempting to align an intelligence with a utility function dependent on 100 desiderata is probably not that much easier than trying to align an intelligence with a utility function dependent on 1000. Sure, it is likely more difficult, but is utility function complexity realistically anywhere near as large a hurdle as say robust delegation?
This in my opinion is the strongest claim, and is in essence quite similar to this post, my response to which was “I question the probability of a glass-box transition of type “AGI RSIs toward non-DL architecture that results in it maximizing some utility function in a pre-DL manner” being more dangerous than simply “AGI RSIs”. If behaving like an expected utility maximizer was optimal: would not AGI have done so without the architecture transition? If not, then you need to make the case for why glass-box architectures are better ways of building cognitive systems. I think that this argument is at odds with the universal learning hypothesis and seems more in-line with evolved modularity, which has a notoriously poor mapping to post-DL thinking. ULH seems to suggest that actually modular approaches might be inferior efficiency-wise to universal learning approaches, which contradicts the primary motive a general intelligence might have to RSI in the direction of a glass-box architecture.”
In summary: Although it seems probable to me that algorithmic approaches are superior for some tasks, it seems to me that ULH would imply that the majority of tasks are best learned by a universal learning algorithm.