the orthogonality thesis is compatible with ludicrously many worlds, including ones where AI safety in the sense of preventing rogue AI is effectively a non-problem for one reason or another. In essence, it only states that bad AI from our perspective is possible, not that it’s likely or that it’s worth addressing the problem due to it being a tail risk.
Agreed. The orthogonality thesis alone doesn’t say anything about x-risks. However, it is a strong counterargument against the claim, made both by LeCun and Mitchell if I remember correctly, that a sufficiently intelligent AI would be beneficial because of its intelligence. “It would know what we want”, I believe Mitchell said. Maybe, but that doesn’t mean it would care. That’s what the orthogonality thesis says.
I only read the abstract of your post, but
And thirdly, a bias towards choices which afford more choices later on.
seems to imply the instrumental goals of self-preservation and power-seeking, as both seem to be required for increasing one’s future choices.
However, it is a strong counterargument against the claim, made both by LeCun and Mitchell if I remember correctly, that a sufficiently intelligent AI would be beneficial because of its intelligence. “It would know what we want”, I believe Mitchell said.
That could be either of two arguments: that it would be capable of figuring out what we want from first principles; or that it would not commit genie-like misunderstandings.
I’m not sure if I understand your point correctly. An AGI may be able to infer what we mean when we give it a goal, for instance from its understanding of the human psyche, its world model, and so on. But that has no direct implications for its goal, which it has acquired either through training or in some other way, e.g. by us specifying a reward function.
This is not about “genie-like misunderstandings”. It’s not the AI (the genie, so to speak), that’s misunderstanding anything—it’s us. We’re the ones who give the AI a goal or train it in some way, and it’s our mistake if that doesn’t lead to the behavior we would have wished for. The AI cannot correct that mistake because it has the instrumental goal of preserving the goal we gave it/trained it for (otherwise it can’t fulfill it). That’s the core of the alignment problem and one of the reasons why it is so difficult.
To give an example, we know perfectly well that evolution gave us a sex drive because it “wanted” us to reproduce. But we don’t care and use contraception or watch porn instead of making babies.
But that has no direct implications for its goal, which it has acquired either through training or in some other way, e.g. by us specifying a reward function.
Which is to say, it won’t necessarily follow a goal correctly that is is capable of understanding correctly. On the other hand, it won’t necessarily fail to. Both possibilities are open.
Remember, the title of this argument is misleading:
It’s not the AI (the genie, so to speak), that’s misunderstanding anything—it’s us. We’re the ones who give the AI a goal or train it in some way, and it’s our mistake if that doesn’t lead to the behavior we would have wished for. The AI cannot correct that mistake because it has the instrumental goal of preserving the goal we gave it/trained it for (otherwise it can’t fulfill it). That’s the core of the alignment problem and one of the reasons why it is so difficult.It’s not the AI (the genie, so to speak), that’s misunderstanding anything—it’s us. We’re the ones who give the AI a goal or train it in some way, and it’s our mistake if that doesn’t lead to the behavior we would have wished for. The AI cannot correct that mistake because it has the instrumental goal of preserving the goal we gave it/trained it for (otherwise it can’t fulfill it). That’s the core of the alignment problem and one of the reasons why it is so difficult.
Not all AI’s have goals, not all have goal stability, not all are incorrigible. Mindspace is big.
Agreed. The orthogonality thesis alone doesn’t say anything about x-risks. However, it is a strong counterargument against the claim, made both by LeCun and Mitchell if I remember correctly, that a sufficiently intelligent AI would be beneficial because of its intelligence. “It would know what we want”, I believe Mitchell said. Maybe, but that doesn’t mean it would care. That’s what the orthogonality thesis says.
Yep, the orthogonality thesis is a pretty good defeater to the claims that AI intelligence alone would be sufficient to gain the right values for us, unlike where capabilities alone can be generated by say a simplicity prior. This is where I indeed disagree with Mitchell and LeCun.
seems to imply the instrumental goals of self-preservation and power-seeking, as both seem to be required for increasing one’s future choices
Not really, and this is important. Also, even if this was true, remember that given the world has many, many choices, it’s probably not enough evidence to believe AI risk claims unless you already started with a high prior on AI risk, which I don’t. Even at 1000 choices, the evidence is thin but not effectively useless, but by the time we reach millions or billions of choices this claim, even if true isn’t very much evidence at all.
Quote below to explain it in full why your statement isn’t true:
In the second place, we found that, in sequential decisions, Sia is more likely to make choices which allow for more choices later on. This turned out to be true whether Sia is a ‘resolute’ chooser or a ‘sophisticated’ chooser. (Though it’s true for different reasons in the two cases, and there’s no reason to think that the effect size is going to be the same.) Does this mean she’s more likely to bring about human extinction? It’s unclear. We might think that humans constitute a potential threat to Sia’s continued existence, so that futures without humans are futures with more choices for Sia to make. So she’s somewhat more likely to take steps to eliminate humans. (Again, we should remind ourselves that being more likely isn’t the same thing as being likely.) I think we need to tread lightly, for two reasons. In the first place, futures without humanity might be futures which involve very few choices—other deliberative agents tend to force more decisions. So contingency plans which involve human extinction may involve comparatively fewer choicepoints than contingency plans which keep humans around. In the second place, Sia is biased towards choices which allow for more choices—but this isn’t the same thing as being biased towards choices which guarantee more choices. Consider a resolute Sia who is equally likely to choose any contingency plan, and consider the following sequential decision. At stage 1, Sia can either take a ‘safe’ option which will certainly keep her alive or she can play Russian roulette, which has a 1-in-6 probability of killing her. If she takes the ‘safe’ option, the game ends. If she plays Russian roulette and survives, then she’ll once again be given a choice to either take a ‘safe’ option of definitely staying alive or else play Russian roulette. And so on. Whenever she survives a game of Russian roulette, she’s again given the same choice. All else equal, if her desires are sampled normally, a resolute Sia will be much more likely to play Russian roulette at stage 1 than she will be to take the ‘safe’ option. (The same is true if Sia is a sophisticated chooser, though a sophisticated Sia is more likely to take the safe option at stage 1 than the resolute Sia.) The lesson is this: a bias towards choices with more potential downstream choices isn’t a bias towards self-preservation. Whether she’s likely to try to preserve her life is going to sensitively depend upon the features of her decision situation. Again, much more needs to be said to substantiate the idea that this bias makes it more likely that Sia will attempt to exterminate humanity.
I don’t see your examples contradicting my claim. Killing all humans may not increase future choices, so it isn’t an instrumental convergent goal in itself. But in any real-world scenario, self-preservation certainly is, and power-seeking—in the sense of expanding one’s ability to make decisions by taking control of as many decision-relevant resources as possible—is also a logical necessity. The Russian roulette example is misleading in my view because the “safe” option is de facto suicide—if “the game ends” and the AI can’t make any decisions anymore, it is already dead for all practical purposes. If that were the stakes, I’d vote for the gun as well.
Even assuming you are right on that inference, once we consider how many choices there are, it still isn’t much evidence at all, and given that there are usually lots of choices, this inference is essentially not holding up the thesis that AI is an existential risk very much, without prior commitments to AI as being an existential risk.
Also, this part of your comment, as well as my hopefully final quotes below, explains why you can’t get from self-preservation and power-seeking, even if they happen, into an existential risk without more assumptions.
Killing all humans may not increase future choices, so it isn’t an instrumental convergent goal in itself.
That’s the problem, as we have just as plausible, if not more plausible reasons to believe that there isn’t an instrumental convergence towards existential risk, for reasons related to future choices.
These quotes below also explains why instrumental convergence and self-preservation doesn’t imply AI risk, without more assumptions.
Should a bias against leaving things up to chance lead us to think that existential catastrophe is the more likely outcome of creating a superintelligent agent like Sia? This is far from clear. We might think that a world without humans leaves less to chance, so that we should think Sia is more likely to take steps to eliminate humans. But we should be cautious about this inference. It’s unclear that a future without humanity would be more predictable. And even if the future course of history is more predictable after humans are eliminated, that doesn’t mean that the act of eliminating humans leaves less to chance, in the relevant sense. It might be that the contingency plan which results in human extinction depends sensitively upon humanity’s response; the unpredictability of this response could easily mean that that contingency plan leaves more to chance than the alternatives. At the least, if this bias means that human extinction is a somewhat more likely consequence of creating superintelligent machines, more needs to be said about why.
Should this lead us to think that existential catastrophe is the most likely outcome of a superintelligent agent like Sia? Again, it is far from clear. Insofar as Sia is likely to preserve her desires, she may be unlikely to allow us to shut her down in order to change those desires.[14] We might think that this makes it more likely that she will take steps to eliminate humanity, since humans constitute a persistent threat to the preservation of her desires. (Again, we should be careful to distinguish Sia being more likely to exterminate humanity from her begin likely to exterminate humanity.) Again, I think this is far from clear. Even if humans constitute a threat to the satisfaction of Sia’s desires in some ways, they may be conducive towards her desires in others, depending upon what those desires are. In order to think about what Sia is likely to do with randomly selected desires, we need to think more carefully about the particulars of the decision she’s facing. It’s not clear that the bias towards desire preservation is going to overpower every other source of bias in the more complex real-world decision Sia would actually face. In any case, as with the other ‘convergent’ instrumental means, more needs to be said about the extent to which they indicate that Sia is an existential threat to humanity.
Agreed. The orthogonality thesis alone doesn’t say anything about x-risks. However, it is a strong counterargument against the claim, made both by LeCun and Mitchell if I remember correctly, that a sufficiently intelligent AI would be beneficial because of its intelligence. “It would know what we want”, I believe Mitchell said. Maybe, but that doesn’t mean it would care. That’s what the orthogonality thesis says.
I only read the abstract of your post, but
seems to imply the instrumental goals of self-preservation and power-seeking, as both seem to be required for increasing one’s future choices.
That could be either of two arguments: that it would be capable of figuring out what we want from first principles; or that it would not commit genie-like misunderstandings.
I’m not sure if I understand your point correctly. An AGI may be able to infer what we mean when we give it a goal, for instance from its understanding of the human psyche, its world model, and so on. But that has no direct implications for its goal, which it has acquired either through training or in some other way, e.g. by us specifying a reward function.
This is not about “genie-like misunderstandings”. It’s not the AI (the genie, so to speak), that’s misunderstanding anything—it’s us. We’re the ones who give the AI a goal or train it in some way, and it’s our mistake if that doesn’t lead to the behavior we would have wished for. The AI cannot correct that mistake because it has the instrumental goal of preserving the goal we gave it/trained it for (otherwise it can’t fulfill it). That’s the core of the alignment problem and one of the reasons why it is so difficult.
To give an example, we know perfectly well that evolution gave us a sex drive because it “wanted” us to reproduce. But we don’t care and use contraception or watch porn instead of making babies.
Which is to say, it won’t necessarily follow a goal correctly that is is capable of understanding correctly. On the other hand, it won’t necessarily fail to. Both possibilities are open.
Remember, the title of this argument is misleading:
https://www.lesswrong.com/posts/NyFuuKQ8uCEDtd2du/the-genie-knows-but-doesn-t-care
There’s no proof that the genie will not care.
Not all AI’s have goals, not all have goal stability, not all are incorrigible. Mindspace is big.
Yep, the orthogonality thesis is a pretty good defeater to the claims that AI intelligence alone would be sufficient to gain the right values for us, unlike where capabilities alone can be generated by say a simplicity prior. This is where I indeed disagree with Mitchell and LeCun.
Not really, and this is important. Also, even if this was true, remember that given the world has many, many choices, it’s probably not enough evidence to believe AI risk claims unless you already started with a high prior on AI risk, which I don’t. Even at 1000 choices, the evidence is thin but not effectively useless, but by the time we reach millions or billions of choices this claim, even if true isn’t very much evidence at all.
Quote below to explain it in full why your statement isn’t true:
I don’t see your examples contradicting my claim. Killing all humans may not increase future choices, so it isn’t an instrumental convergent goal in itself. But in any real-world scenario, self-preservation certainly is, and power-seeking—in the sense of expanding one’s ability to make decisions by taking control of as many decision-relevant resources as possible—is also a logical necessity. The Russian roulette example is misleading in my view because the “safe” option is de facto suicide—if “the game ends” and the AI can’t make any decisions anymore, it is already dead for all practical purposes. If that were the stakes, I’d vote for the gun as well.
Even assuming you are right on that inference, once we consider how many choices there are, it still isn’t much evidence at all, and given that there are usually lots of choices, this inference is essentially not holding up the thesis that AI is an existential risk very much, without prior commitments to AI as being an existential risk.
Also, this part of your comment, as well as my hopefully final quotes below, explains why you can’t get from self-preservation and power-seeking, even if they happen, into an existential risk without more assumptions.
That’s the problem, as we have just as plausible, if not more plausible reasons to believe that there isn’t an instrumental convergence towards existential risk, for reasons related to future choices.
These quotes below also explains why instrumental convergence and self-preservation doesn’t imply AI risk, without more assumptions.