The argument in this post does provide a harder barrier to takeoff, though. In order to have dangers from a self-improving ai, you would have to first make an ai which could scale to uncontrollability using ‘safe’ scaling techniques relative to its reward function, where ‘safe’ is relative to its own ability to prove to itself that it’s safe. (Or in the next round of self-improvements it considers ‘safe’ after the first round, and so on).
Regardless I think self-improving ai is more likely to come from humans designing a self-improving ai, which might render this kind of motive argument moot. And anyway the ai might not be this uber-rational creature which has to prove to itself that it’s self-improved version won’t change its reward function—it might just try it anyway (like humans are doing now).
What I was pointing out is that the barrier is asymmetrical: it’s biased towards AIs with more-easily-aligned utility functions. A paperclipper is more likely to be able to create an improved paperclipper that it’s certain enough will massively increase its utility, while a more human-aligned AI would have to be more conservative.
In other words, this paper seems to say, “if we can create human-aligned AI, it will be cautious about self-improvement, but dangerously unaligned AIs will probably have no issues.”
The argument in this post does provide a harder barrier to takeoff, though. In order to have dangers from a self-improving ai, you would have to first make an ai which could scale to uncontrollability using ‘safe’ scaling techniques relative to its reward function, where ‘safe’ is relative to its own ability to prove to itself that it’s safe. (Or in the next round of self-improvements it considers ‘safe’ after the first round, and so on). Regardless I think self-improving ai is more likely to come from humans designing a self-improving ai, which might render this kind of motive argument moot. And anyway the ai might not be this uber-rational creature which has to prove to itself that it’s self-improved version won’t change its reward function—it might just try it anyway (like humans are doing now).
What I was pointing out is that the barrier is asymmetrical: it’s biased towards AIs with more-easily-aligned utility functions. A paperclipper is more likely to be able to create an improved paperclipper that it’s certain enough will massively increase its utility, while a more human-aligned AI would have to be more conservative.
In other words, this paper seems to say, “if we can create human-aligned AI, it will be cautious about self-improvement, but dangerously unaligned AIs will probably have no issues.”