It seems likely that governments who witnessed an “almost AI takeover” will try to develop their own AGI. This could happen even if they understand the risks. They might think that other people are rushing it, and that they could do a better job.
But if they don’t really understand the risks, like right now, then they are even more likely to do it. I don’t count this as a pivotal act. If you can get the outcome you described then it would be a pivotal act, but the actions you propose would not have that outcome with high probability. I would guess with much less than 50%. Probably less than 10%.
There might be a version of this, with a much more concrete plan, such that we can see that the outcome would actually follow from executing the plan.
On Having an AI explain how Alignment is Hard
I think your second suggestion is interesting. I’d like to see you write a post about it exploring this idea further.
If we build a powerful AI and then have it tell us about all the things that can go wrong with an AI, then we might be able to generate enough scientific evidence about how hard alignment is, and how likely we will die, e.g. in the current paradigm, such that people would stop.
I am not talking about conceptual arguments. To me at least I think the current best conceptual arguments already strongly point in that direction. But extremely concrete rigorous mathematical arguments, or specific experimental setups that show how specific phenomena do in fact arise. For example, if you had an experiment that showed that Eliezer’s arguments are correct, that when you train hard enough on a general enough objective, you will in fact get out more and more general cognitive algorithms, up to and including AGI. If the system also figures out some rigorous formalisms to correctly present these results, then this could be valuable.
The reason why this seems good to me, at first sight, is that false positives are not that big of an issue. If an AI finds all the things that can go wrong, but 50% of them are false positives in the sense that they would not be a problem in practice, we may get saved, because we are aware of all the ways things can go wrong. When solving alignment, false positives, i.e. thinking that something is safe when it is not, kill you.
Intuitively it also seems that evaluating whether something describes a failure case is a lot easier than evaluating whether something can’t fail.
When doing this you are much less prone to the temptation of delaying a system in practice with the insights you got. Understanding failure modes does not necessarily imply that you know how to solve them (though it is the first step, and definitely can do this).
Pitfalls
That being said there are a lot of potential pitfalls with this idea, but I don’t think they disqualify the idea:
An AI that could tell you how alignment is hard might already be so capable that it is dangerous.
When telling you how things are dangerous, it might formulate concepts that are also very useful for advancing capabilities.
If the AI is very smart it could probably trick you. It could present to you a set of problems with AI, such that it looks like if you solved all the problems you would have solved alignment. But in fact, you would still get a misaligned AGI that would then reward the AI that deceived you.
E.g. if the AI roughly understands its own cognitive reasoning process, and notices how it is not really aligned, it would give the AI information about what parts of alignment the humans have figured out already.
Can we make the AI figure out the useful failure modes? There are tons of failure modes, but ideally, we would like to discover new failure modes such that eventually we can paint a convincing picture of the hardness of the problem. An even better would be a list of problems corresponding to having new important insights (though this would go beyond what this proposal tries to do).
Prefered Planning
Let’s go back to your first pivotal act proposal. I think I might have figured out where you miss-stepped.
Missing step plans are a fallacy, and thinking of them I realize that I think you probably committed another type of planning fallacy here. I think you generated a plan and then assumed some preferred outcome would occur. That outcome might be possible in principle but not what would happen in practice. This seems very related to The Tragedy of Group Selectionism.
This fallacy probably shows up when generating plans, because if there are no other people involved and the situation is not that complex, it is probably a very good heuristic. When you are generating a plan, you don’t want to fill in all the details of the plan. You want to make the planning problem as easy as possible. So our brain might implicitly make the assumption that we are going to optimize for the successful completion of the plan. That means that the plan can be useful as long as it roughly points in the correct direction. Mispredicting an outcome is fine, because later on when you realize that the outcome is not what you wanted, you can just apply more optimization pressure, changing the plan, such that now the plan again has the desired outcome. As long as you were walking roughly in the right direction, and things you have been doing so far don’t turn out to be completely useless, this heuristic is great for reducing the computational load of the planning task.
Details can be filled in later, corrections can be made later. At least as long as you will reevaluate your plan later on. You could do this by reevaluating the plan when:
A step is completed
You notice a failure when executing the current step
You notice that the next step has not been filled in yet.
After a specific amount of time passed.
Sidenote: Making an abstract step more concrete might seem like a different operation from regenerating in the case where you notice that the plan does not work. But it could just involve the same planning procedure. In one case with a different starting point, and in the other with a different set of constraints.
I expect part of the failure mode here is that you generate a plan and then to evaluate the consequences of the plan, you implicitly plug yourself into the role of the people who would be impacted by the plan, to predict their reaction. Without words, you think “What would I do if I observed a rouge AI almost taking over the world, if I were China?” Probably without realizing, that this is what you are doing. But the resulting prediction is wrong.
Letting Loose a Rogue AI is not a Pivotal Act
It seems likely that governments who witnessed an “almost AI takeover” will try to develop their own AGI. This could happen even if they understand the risks. They might think that other people are rushing it, and that they could do a better job.
But if they don’t really understand the risks, like right now, then they are even more likely to do it. I don’t count this as a pivotal act. If you can get the outcome you described then it would be a pivotal act, but the actions you propose would not have that outcome with high probability. I would guess with much less than 50%. Probably less than 10%.
There might be a version of this, with a much more concrete plan, such that we can see that the outcome would actually follow from executing the plan.
On Having an AI explain how Alignment is Hard
I think your second suggestion is interesting. I’d like to see you write a post about it exploring this idea further.
If we build a powerful AI and then have it tell us about all the things that can go wrong with an AI, then we might be able to generate enough scientific evidence about how hard alignment is, and how likely we will die, e.g. in the current paradigm, such that people would stop.
I am not talking about conceptual arguments. To me at least I think the current best conceptual arguments already strongly point in that direction. But extremely concrete rigorous mathematical arguments, or specific experimental setups that show how specific phenomena do in fact arise. For example, if you had an experiment that showed that Eliezer’s arguments are correct, that when you train hard enough on a general enough objective, you will in fact get out more and more general cognitive algorithms, up to and including AGI. If the system also figures out some rigorous formalisms to correctly present these results, then this could be valuable.
The reason why this seems good to me, at first sight, is that false positives are not that big of an issue. If an AI finds all the things that can go wrong, but 50% of them are false positives in the sense that they would not be a problem in practice, we may get saved, because we are aware of all the ways things can go wrong. When solving alignment, false positives, i.e. thinking that something is safe when it is not, kill you.
Intuitively it also seems that evaluating whether something describes a failure case is a lot easier than evaluating whether something can’t fail.
When doing this you are much less prone to the temptation of delaying a system in practice with the insights you got. Understanding failure modes does not necessarily imply that you know how to solve them (though it is the first step, and definitely can do this).
Pitfalls
That being said there are a lot of potential pitfalls with this idea, but I don’t think they disqualify the idea:
An AI that could tell you how alignment is hard might already be so capable that it is dangerous.
When telling you how things are dangerous, it might formulate concepts that are also very useful for advancing capabilities.
If the AI is very smart it could probably trick you. It could present to you a set of problems with AI, such that it looks like if you solved all the problems you would have solved alignment. But in fact, you would still get a misaligned AGI that would then reward the AI that deceived you.
E.g. if the AI roughly understands its own cognitive reasoning process, and notices how it is not really aligned, it would give the AI information about what parts of alignment the humans have figured out already.
Can we make the AI figure out the useful failure modes? There are tons of failure modes, but ideally, we would like to discover new failure modes such that eventually we can paint a convincing picture of the hardness of the problem. An even better would be a list of problems corresponding to having new important insights (though this would go beyond what this proposal tries to do).
Prefered Planning
Let’s go back to your first pivotal act proposal. I think I might have figured out where you miss-stepped.
Missing step plans are a fallacy, and thinking of them I realize that I think you probably committed another type of planning fallacy here. I think you generated a plan and then assumed some preferred outcome would occur. That outcome might be possible in principle but not what would happen in practice. This seems very related to The Tragedy of Group Selectionism.
This fallacy probably shows up when generating plans, because if there are no other people involved and the situation is not that complex, it is probably a very good heuristic. When you are generating a plan, you don’t want to fill in all the details of the plan. You want to make the planning problem as easy as possible. So our brain might implicitly make the assumption that we are going to optimize for the successful completion of the plan. That means that the plan can be useful as long as it roughly points in the correct direction. Mispredicting an outcome is fine, because later on when you realize that the outcome is not what you wanted, you can just apply more optimization pressure, changing the plan, such that now the plan again has the desired outcome. As long as you were walking roughly in the right direction, and things you have been doing so far don’t turn out to be completely useless, this heuristic is great for reducing the computational load of the planning task.
Details can be filled in later, corrections can be made later. At least as long as you will reevaluate your plan later on. You could do this by reevaluating the plan when:
A step is completed
You notice a failure when executing the current step
You notice that the next step has not been filled in yet.
After a specific amount of time passed.
Sidenote: Making an abstract step more concrete might seem like a different operation from regenerating in the case where you notice that the plan does not work. But it could just involve the same planning procedure. In one case with a different starting point, and in the other with a different set of constraints.
I expect part of the failure mode here is that you generate a plan and then to evaluate the consequences of the plan, you implicitly plug yourself into the role of the people who would be impacted by the plan, to predict their reaction. Without words, you think “What would I do if I observed a rouge AI almost taking over the world, if I were China?” Probably without realizing, that this is what you are doing. But the resulting prediction is wrong.