I’ll first summarize the parts I agree with in what I believe you are saying.
First, you are saying, effectively that there are two theoretically possible paths to success:
Prevent the situation where an ASI takes over the world.
Make sure that ASI that takes over the world is fully aligned.
You are then saying that the likelihood on winning on path one is so small as to not be worth discussing in this post.
The issue is that you then conclude that since the P(win) on path one is so close to 0, we ought to focus on path 2. The fallacy here is the P(win) appears very close to 0 on both paths, so we have to focus on whatever path that has a higher P(win), no matter how impossibly low it is. And to do that, we need to directly compare the P(win) on both.
Consider this—what is the harder task—to create a fully aligned ASI that would remain fully aligned for the rest of the lifetime of the universe, regardless of whatever weird state the universe ends up in as a result of that ASI, or to create an AI (not necessarily superhuman) that is capable of correctly making one pivotal action that is sufficient for preventing ASI takeover in the future (Elizer’s placeholder example—go ahead and destroy all GPUs in the world, self-destructing in the process) without killing humanity in the process? Would not you agree that when the question is posed that way, it seems a lot more likely that the latter is something we’d actually be able to accomplish?
I’ve axiomatically set P(win) on path one equal to zero. I know this isn’t true in reality and discussing how large that P(win) is and what other scenarios may result from this is indeed worthwhile, but it’s a different discussion.
Although the idea of a “GPU nuke” that you described is interesting, I would hardly consider this a best-case scenario. Think about the ramifications of all GPUs worldwide failing at the same time. At best, this could be a Plan B.
I’m toying with the idea of an AI doomsday clock. Imagine a 12-hour clock where the time to midnight halves with each milestone we hit before accidentally or intentionally creating a misaligned ASI. At one second to midnight, that misaligned ASI is switched on a second later, everything is over. I think the best-case scenario for us would be to figure out how to align an ASI, build an aligned ASI but not turn it on and then wait until two seconds to midnight.
The apparent contradiction is that we don’t know how to build an aligned ASI without knowing how to build a misaligned one, but there is a difference in knowing how to do something and actually doing it. This difference between knowing and doing can theoretically give us the one second advantage to reach this state.
However if we are at two seconds before midnight and we don’t have an aligned ASI by then, that’s the point at which we’d have to say alright, we failed, let’s fry all the GPUs instead.
I’ve axiomatically set P(win) on path one equal to zero. I know this isn’t true in reality and discussing how large that P(win) is and what other scenarios may result from this is indeed worthwhile, but it’s a different discussion.
Your title says “we must”. You are allowed to make conditional arguments from assumptions, but if your assumptions are demonstratively take away most of the P(win) paths out of consideration, yoour claim that the conclusions derived in your skewed model apply to real life is erroneous. If your title was “Unless we can prevent the creation of AGI capable of taking over the human society, …”, you would not have been downvotes as much as you have been.
The clock would not be possible in any reliable way. For all we know, we could be a second before midnight already, we could very well be one unexpected clever idea away from ASI. From now on, new evidence might update P(current time is >= 11:59:58) in one direction or another, but extremely unlikely that it would ever get back to being close enough to 0, and it’s also unlikely that we will have any certainty of it before it’s too late.
That would be a very long title then. Also, it’s not the only assumption. The other assumption is that p(win) with a misaligned ASI is equal to zero, which may also be false. I have added that this is a thought experiment, is that OK?
I’m also thinking about rewriting the entire post and adding some more context about what Eliezer wrote and from the comments I have received here (thank you all btw). Can I make a new post out of this, or would that be considered spam? I’m new to LessWrong, so I’m not familiar with this community yet.
About the “doomsday clock”: I agree that it would be incredibly hard, if not outright impossible to actually model such a clock accurately. Again, it’s a thought experiment to help us find the theoretically optimal point in time to make our decision. But maybe an AI can, so that would be another idea: Build a GPU nuke and make it autonomously explode when it senses that an AI apocalypse is imminent.
I’ll first summarize the parts I agree with in what I believe you are saying.
First, you are saying, effectively that there are two theoretically possible paths to success:
Prevent the situation where an ASI takes over the world.
Make sure that ASI that takes over the world is fully aligned.
You are then saying that the likelihood on winning on path one is so small as to not be worth discussing in this post.
The issue is that you then conclude that since the P(win) on path one is so close to 0, we ought to focus on path 2. The fallacy here is the P(win) appears very close to 0 on both paths, so we have to focus on whatever path that has a higher P(win), no matter how impossibly low it is. And to do that, we need to directly compare the P(win) on both.
Consider this—what is the harder task—to create a fully aligned ASI that would remain fully aligned for the rest of the lifetime of the universe, regardless of whatever weird state the universe ends up in as a result of that ASI, or to create an AI (not necessarily superhuman) that is capable of correctly making one pivotal action that is sufficient for preventing ASI takeover in the future (Elizer’s placeholder example—go ahead and destroy all GPUs in the world, self-destructing in the process) without killing humanity in the process? Would not you agree that when the question is posed that way, it seems a lot more likely that the latter is something we’d actually be able to accomplish?
I’ve axiomatically set P(win) on path one equal to zero. I know this isn’t true in reality and discussing how large that P(win) is and what other scenarios may result from this is indeed worthwhile, but it’s a different discussion.
Although the idea of a “GPU nuke” that you described is interesting, I would hardly consider this a best-case scenario. Think about the ramifications of all GPUs worldwide failing at the same time. At best, this could be a Plan B.
I’m toying with the idea of an AI doomsday clock. Imagine a 12-hour clock where the time to midnight halves with each milestone we hit before accidentally or intentionally creating a misaligned ASI. At one second to midnight, that misaligned ASI is switched on a second later, everything is over. I think the best-case scenario for us would be to figure out how to align an ASI, build an aligned ASI but not turn it on and then wait until two seconds to midnight.
The apparent contradiction is that we don’t know how to build an aligned ASI without knowing how to build a misaligned one, but there is a difference in knowing how to do something and actually doing it. This difference between knowing and doing can theoretically give us the one second advantage to reach this state.
However if we are at two seconds before midnight and we don’t have an aligned ASI by then, that’s the point at which we’d have to say alright, we failed, let’s fry all the GPUs instead.
Your title says “we must”. You are allowed to make conditional arguments from assumptions, but if your assumptions are demonstratively take away most of the P(win) paths out of consideration, yoour claim that the conclusions derived in your skewed model apply to real life is erroneous. If your title was “Unless we can prevent the creation of AGI capable of taking over the human society, …”, you would not have been downvotes as much as you have been.
The clock would not be possible in any reliable way. For all we know, we could be a second before midnight already, we could very well be one unexpected clever idea away from ASI. From now on, new evidence might update P(current time is >= 11:59:58) in one direction or another, but extremely unlikely that it would ever get back to being close enough to 0, and it’s also unlikely that we will have any certainty of it before it’s too late.
That would be a very long title then. Also, it’s not the only assumption. The other assumption is that p(win) with a misaligned ASI is equal to zero, which may also be false. I have added that this is a thought experiment, is that OK?
I’m also thinking about rewriting the entire post and adding some more context about what Eliezer wrote and from the comments I have received here (thank you all btw). Can I make a new post out of this, or would that be considered spam? I’m new to LessWrong, so I’m not familiar with this community yet.
About the “doomsday clock”: I agree that it would be incredibly hard, if not outright impossible to actually model such a clock accurately. Again, it’s a thought experiment to help us find the theoretically optimal point in time to make our decision. But maybe an AI can, so that would be another idea: Build a GPU nuke and make it autonomously explode when it senses that an AI apocalypse is imminent.