So to recap: It seems to me that for the stop button problem, we want humans to control whether the AI stops or runs freely, which is a causal notion, and so we should use counterfactuals in our utility function to describe it. (Dunno why most people don’t do this.) That is, if we say that the AI’s utility should depend on the counterfactuals related to human behavior, then it will want to observe humans to get input on what to do, rather than manipulate them, because this is the only way for it to be dependent on the counterfactuals.
But So8res pointed out, just using counterfactuals directly is a big problem because it rapidly brings us out of distribution. A practical implementation of this beyond the stop button problem up having an exponential amount of counterfactuals to cover, and the vast majority of those counterfactuals will be far outside of the real-world distribution. This means that the AI might not get aligned at all, because the real-world applications don’t get nonnegligible weight in the utility function.
But I think I’ve figured out a solution now, which I’d call conditional+counterfactual corrigibility. As usual let’s use B to denote that the stop button gets pressed and the AI shuts down, V to denote whichever non-corrigible utility function that we want to make corrigible, and Xs/Xf to denote a counterfactual where people do (s) or do not (f) want to press the stop button. However, we will also use S and F to denote the conditions where people do or do not want to press the stop button. In that case, the utility function should be. In that case, we can define Control(C) to mean that humans can control whether the AI stops or runs in condition C:
Control(C)=E[Bs|C]+E[Vf|C]
and then we simply want to define the utility as saying that people can control the AI in both the S and the F condition:
U=Control(S)+Control(F)
Previously, I strongly emphasized the need to keep the AI “under a counterfactual”—that is, if it believed it could control whether humans want to stop it or not, then it would be incentivized to manipulate humans. But this is what brings us out of distribution. However, counterfactuals aren’t the only way to keep the appearance of a phenomenon constant—conditionals work too. And conditionals keep you nicely on distribution, so that’s now my solution to the distribution issues. This means that we can use much less invasive counterfactuals.
That said, this approach I going to have a hard time with chaotic phenomena, as combining conditionals and counterfactuals in the presence of chaos can get pretty weird.
It also might be vulnerable to some variant of the critiques that were first raised against it, because now the conditional introduces a link between its policy and the scenarios it faces, but I can’t immediately construct a case where it happens, because the conditional would still somewhat tend to sabotage the obvious approaches. This sort of unclarity is kind of concerning when it comes to the idea.
Like suppose the AI immediately very publically does something that looks very unsafe. Say grabs control over the stop button and starts mass-producing paperclips in an extremely publically visible way. This would probably lead to people wanting to stop it. So therefore, if it has a policy like that, the |S conditional would lead to people quickly wanting to stop it. This means that in the |S branch, it can quickly determine whether it is in the f|S branch or the s|S branch; in the f|S case, it can then keep going with whatever optimization V specified, while in the s|S case, it can then immediately shut down itself.
But the reason I think the AI *wouldn’t* do this is, what about the |F branch? If you condition on humans not wanting to press the stop button even though there’s a clearly unaligned AI, what sort of situation could produce this? I have trouble imagining it, because it seems like it would need to be pretty extreme. The best ideas I can come up with is stuff like “black hole swallows the earth”, but this would rank pretty low in the AI’s utility function, and therefore it would avoid acting this way in order to have a reasonable |F branch.
But this does not seem like sane reasoning on the AI’s side to me, so it seems like this should be fixed. And of course, fixed in a principled rather than unprincipled way.
I think I’ve got it, the fix to the problem in my corrigibility thing!
So to recap: It seems to me that for the stop button problem, we want humans to control whether the AI stops or runs freely, which is a causal notion, and so we should use counterfactuals in our utility function to describe it. (Dunno why most people don’t do this.) That is, if we say that the AI’s utility should depend on the counterfactuals related to human behavior, then it will want to observe humans to get input on what to do, rather than manipulate them, because this is the only way for it to be dependent on the counterfactuals.
But So8res pointed out, just using counterfactuals directly is a big problem because it rapidly brings us out of distribution. A practical implementation of this beyond the stop button problem up having an exponential amount of counterfactuals to cover, and the vast majority of those counterfactuals will be far outside of the real-world distribution. This means that the AI might not get aligned at all, because the real-world applications don’t get nonnegligible weight in the utility function.
But I think I’ve figured out a solution now, which I’d call conditional+counterfactual corrigibility. As usual let’s use B to denote that the stop button gets pressed and the AI shuts down, V to denote whichever non-corrigible utility function that we want to make corrigible, and Xs/Xf to denote a counterfactual where people do (s) or do not (f) want to press the stop button. However, we will also use S and F to denote the conditions where people do or do not want to press the stop button. In that case, the utility function should be. In that case, we can define Control(C) to mean that humans can control whether the AI stops or runs in condition C:
Control(C)=E[Bs|C]+E[Vf|C]
and then we simply want to define the utility as saying that people can control the AI in both the S and the F condition:
U=Control(S)+Control(F)
Previously, I strongly emphasized the need to keep the AI “under a counterfactual”—that is, if it believed it could control whether humans want to stop it or not, then it would be incentivized to manipulate humans. But this is what brings us out of distribution. However, counterfactuals aren’t the only way to keep the appearance of a phenomenon constant—conditionals work too. And conditionals keep you nicely on distribution, so that’s now my solution to the distribution issues. This means that we can use much less invasive counterfactuals.
That said, this approach I going to have a hard time with chaotic phenomena, as combining conditionals and counterfactuals in the presence of chaos can get pretty weird.
It also might be vulnerable to some variant of the critiques that were first raised against it, because now the conditional introduces a link between its policy and the scenarios it faces, but I can’t immediately construct a case where it happens, because the conditional would still somewhat tend to sabotage the obvious approaches. This sort of unclarity is kind of concerning when it comes to the idea.
Like suppose the AI immediately very publically does something that looks very unsafe. Say grabs control over the stop button and starts mass-producing paperclips in an extremely publically visible way. This would probably lead to people wanting to stop it. So therefore, if it has a policy like that, the |S conditional would lead to people quickly wanting to stop it. This means that in the |S branch, it can quickly determine whether it is in the f|S branch or the s|S branch; in the f|S case, it can then keep going with whatever optimization V specified, while in the s|S case, it can then immediately shut down itself.
But the reason I think the AI *wouldn’t* do this is, what about the |F branch? If you condition on humans not wanting to press the stop button even though there’s a clearly unaligned AI, what sort of situation could produce this? I have trouble imagining it, because it seems like it would need to be pretty extreme. The best ideas I can come up with is stuff like “black hole swallows the earth”, but this would rank pretty low in the AI’s utility function, and therefore it would avoid acting this way in order to have a reasonable |F branch.
But this does not seem like sane reasoning on the AI’s side to me, so it seems like this should be fixed. And of course, fixed in a principled rather than unprincipled way.