That is not what the orthogonality thesis is about. It just claims that intelligence is independent of the goal one has. This is obviously true in my opinion—it is absolutely possible that a very intelligent system may pursue a goal that we would call “stupid”. The paperclip example Bostrom gave may not be the best choice, as it sounds too ridiculous, but it illustrates the point. To claim that the orthogonality thesis is “too weak” would require proof that a paperclip maximizer cannot exist even in theory.
I accept the orthogonality thesis, in the sense that a paperclip maximizer can exist, at least in theory, in the sense of being logically and physically possible. The reason I view it as too weak evidence is that the orthogonality thesis is compatible with ludicrously many worlds, including ones where AI safety in the sense of preventing rogue AI is effectively a non-problem for one reason or another. In essence, it only states that bad AI from our perspective is possible, not that it’s likely or that it’s worth addressing the problem due to it being a tail risk.
Imagine if someone wrote an article in thein New York Times claiming that halting oracles are possible, and that this would be very bad news for us, amounting to extinction, solely because it’s possible for us to go extinct via this way.
The correct response here is that you should ignore the evidence and go with your priors. I see the orthogonality thesis a lot like this: It’s right, but the implied actions require way more evidence than it presents.
Given that the orthogonality thesis, even if true shouldn’t shift our priors much, due to it being very, very weak evidence, the fact that the orthogonality thesis is true doesn’t mean that Lecun is wrong, without something else assumed.
IMO, this also characterizes why I don’t find AI risk that depends on instrumental convergence compelling, due to the fact that even if it’s true, contra Bostrom without more assumptions that need to be tested empirically, and this is still very compatible with a world where instrumental convergence is a non-problem in any number of ways, and means that without more assumptions, LeCun could still be right that in practice instrumental convergence does not lead to existential risk or even leave us in a bad future.
Some choice quotes to illustrate why instrumental convergence doesn’t buy us much evidence at all:
A few things to note. Firstly, when I say that there’s a ‘bias’ towards a certain kind of choice, I just mean that the probability that a superintelligent agent with randomly sampled desires (Sia) would make that choice is greater than 1/N, where N is the number of choices available. So, just to emphasize the scale of the effect: even if you were right about that inference, you should still assign very low probability to Sia taking steps to eliminate other agents.
It’s also worth emphasising that this bias only tells us that Sia is more likely to perform acts that leave less to chance she is to perform acts which leave more to chance. It doesn’t tell us that she is overall likely to perform any particular act. Ask me to pick a number between one and one billion, and I’m more likely to select 500,000,000 than I am to select 456,034---humans have a bias towards round numbers. But that doesn’t mean I’m at all likely to select 500,000,000. So even if this tells us that Sia is somewhat more likely to exterminate humanity than she is to dedicate herself to dancing the Macarena, or gardening, or what-have-you, that doesn’t mean that she’s particularly likely to exterminate humanity.
So my point is even accepting the orthogonality thesis, and now instrumental convergence as defined in the post above isn’t enough to lead to the conclusion that AI existential risk is very probable, without more assumptions. In particular, Bostrom’s telling of the instrumental convergence story is mostly invalid. In essence, even if LWers are right, the evidence it buys them is far less than they think, and most of the worrisome conclusions aren’t supported unless you already have a high prior on AI risk.
So my point is even accepting the orthogonality thesis, and now instrumental convergence as defined in the post above isn’t enough to lead to the conclusion that AI existential risk is very probable, without more assumptions.
Strong agree. The OT itself is not an argument for AI danger: it needs to be combined with other claims.
The random potshot version of the OT argument is one way of turning possibilities into probabilities.
Many of the minds in mindpsace are indeed weird and unfriendly to humans, but that does not make it likely that the AIs we will construct will be. You can argue for the likelihood of eldritch AI on on the assumption that any attempt to build an AI is a random potshot into mindspace, in which the chance of building an eldrich AI is high, because there are a lot of them, and a random potshot hits any individual mind with the same likelihood as any other. But the random potshot assumption is obviously false. We dont’ want to take a random potshot, and couldn’t if we wanted to becasue we are constrained by our limitations and biases.
To reply in Stuart Russell’s words: “One of the most common patterns involves omitting something from the objective that you do actually care about. In such cases … the AI system will often find an optimal solution that sets the thing you do care about, but forgot to mention, to an extreme value.”
There are vastly more possible worlds that we humans can’t survive in than those we can, let alone live comfortably in. Agreed, “we don’t want to make a random potshot”, but making an agent that transforms our world into one of these rare ones where we want to live in is hard because we don’t know how to describe that world precisely.
Eliezer Yudkowsky’s rocket analogy also illustrates this very vividly: If you want to land on Mars, it’s not enough to point a rocket in the direction where you can currently see the planet and launch it. You need to figure out all kinds of complicated things about gravity, propulsion, planetary motions, solar winds, etc. But our knowledge of these things is about as detailed as that of the ancient Romans, to stay in the analogy.
I agree with that, and I also agree with Yann LeCun’s intention to “not being stupid enough to create something that we couldn’t control”. I even think not creating an uncontrollable AI is our only hope. I’m just not sure whether I trust humanity (including Meta) to be “not stupid”.
the orthogonality thesis is compatible with ludicrously many worlds, including ones where AI safety in the sense of preventing rogue AI is effectively a non-problem for one reason or another. In essence, it only states that bad AI from our perspective is possible, not that it’s likely or that it’s worth addressing the problem due to it being a tail risk.
Agreed. The orthogonality thesis alone doesn’t say anything about x-risks. However, it is a strong counterargument against the claim, made both by LeCun and Mitchell if I remember correctly, that a sufficiently intelligent AI would be beneficial because of its intelligence. “It would know what we want”, I believe Mitchell said. Maybe, but that doesn’t mean it would care. That’s what the orthogonality thesis says.
I only read the abstract of your post, but
And thirdly, a bias towards choices which afford more choices later on.
seems to imply the instrumental goals of self-preservation and power-seeking, as both seem to be required for increasing one’s future choices.
However, it is a strong counterargument against the claim, made both by LeCun and Mitchell if I remember correctly, that a sufficiently intelligent AI would be beneficial because of its intelligence. “It would know what we want”, I believe Mitchell said.
That could be either of two arguments: that it would be capable of figuring out what we want from first principles; or that it would not commit genie-like misunderstandings.
I’m not sure if I understand your point correctly. An AGI may be able to infer what we mean when we give it a goal, for instance from its understanding of the human psyche, its world model, and so on. But that has no direct implications for its goal, which it has acquired either through training or in some other way, e.g. by us specifying a reward function.
This is not about “genie-like misunderstandings”. It’s not the AI (the genie, so to speak), that’s misunderstanding anything—it’s us. We’re the ones who give the AI a goal or train it in some way, and it’s our mistake if that doesn’t lead to the behavior we would have wished for. The AI cannot correct that mistake because it has the instrumental goal of preserving the goal we gave it/trained it for (otherwise it can’t fulfill it). That’s the core of the alignment problem and one of the reasons why it is so difficult.
To give an example, we know perfectly well that evolution gave us a sex drive because it “wanted” us to reproduce. But we don’t care and use contraception or watch porn instead of making babies.
But that has no direct implications for its goal, which it has acquired either through training or in some other way, e.g. by us specifying a reward function.
Which is to say, it won’t necessarily follow a goal correctly that is is capable of understanding correctly. On the other hand, it won’t necessarily fail to. Both possibilities are open.
Remember, the title of this argument is misleading:
It’s not the AI (the genie, so to speak), that’s misunderstanding anything—it’s us. We’re the ones who give the AI a goal or train it in some way, and it’s our mistake if that doesn’t lead to the behavior we would have wished for. The AI cannot correct that mistake because it has the instrumental goal of preserving the goal we gave it/trained it for (otherwise it can’t fulfill it). That’s the core of the alignment problem and one of the reasons why it is so difficult.It’s not the AI (the genie, so to speak), that’s misunderstanding anything—it’s us. We’re the ones who give the AI a goal or train it in some way, and it’s our mistake if that doesn’t lead to the behavior we would have wished for. The AI cannot correct that mistake because it has the instrumental goal of preserving the goal we gave it/trained it for (otherwise it can’t fulfill it). That’s the core of the alignment problem and one of the reasons why it is so difficult.
Not all AI’s have goals, not all have goal stability, not all are incorrigible. Mindspace is big.
Agreed. The orthogonality thesis alone doesn’t say anything about x-risks. However, it is a strong counterargument against the claim, made both by LeCun and Mitchell if I remember correctly, that a sufficiently intelligent AI would be beneficial because of its intelligence. “It would know what we want”, I believe Mitchell said. Maybe, but that doesn’t mean it would care. That’s what the orthogonality thesis says.
Yep, the orthogonality thesis is a pretty good defeater to the claims that AI intelligence alone would be sufficient to gain the right values for us, unlike where capabilities alone can be generated by say a simplicity prior. This is where I indeed disagree with Mitchell and LeCun.
seems to imply the instrumental goals of self-preservation and power-seeking, as both seem to be required for increasing one’s future choices
Not really, and this is important. Also, even if this was true, remember that given the world has many, many choices, it’s probably not enough evidence to believe AI risk claims unless you already started with a high prior on AI risk, which I don’t. Even at 1000 choices, the evidence is thin but not effectively useless, but by the time we reach millions or billions of choices this claim, even if true isn’t very much evidence at all.
Quote below to explain it in full why your statement isn’t true:
In the second place, we found that, in sequential decisions, Sia is more likely to make choices which allow for more choices later on. This turned out to be true whether Sia is a ‘resolute’ chooser or a ‘sophisticated’ chooser. (Though it’s true for different reasons in the two cases, and there’s no reason to think that the effect size is going to be the same.) Does this mean she’s more likely to bring about human extinction? It’s unclear. We might think that humans constitute a potential threat to Sia’s continued existence, so that futures without humans are futures with more choices for Sia to make. So she’s somewhat more likely to take steps to eliminate humans. (Again, we should remind ourselves that being more likely isn’t the same thing as being likely.) I think we need to tread lightly, for two reasons. In the first place, futures without humanity might be futures which involve very few choices—other deliberative agents tend to force more decisions. So contingency plans which involve human extinction may involve comparatively fewer choicepoints than contingency plans which keep humans around. In the second place, Sia is biased towards choices which allow for more choices—but this isn’t the same thing as being biased towards choices which guarantee more choices. Consider a resolute Sia who is equally likely to choose any contingency plan, and consider the following sequential decision. At stage 1, Sia can either take a ‘safe’ option which will certainly keep her alive or she can play Russian roulette, which has a 1-in-6 probability of killing her. If she takes the ‘safe’ option, the game ends. If she plays Russian roulette and survives, then she’ll once again be given a choice to either take a ‘safe’ option of definitely staying alive or else play Russian roulette. And so on. Whenever she survives a game of Russian roulette, she’s again given the same choice. All else equal, if her desires are sampled normally, a resolute Sia will be much more likely to play Russian roulette at stage 1 than she will be to take the ‘safe’ option. (The same is true if Sia is a sophisticated chooser, though a sophisticated Sia is more likely to take the safe option at stage 1 than the resolute Sia.) The lesson is this: a bias towards choices with more potential downstream choices isn’t a bias towards self-preservation. Whether she’s likely to try to preserve her life is going to sensitively depend upon the features of her decision situation. Again, much more needs to be said to substantiate the idea that this bias makes it more likely that Sia will attempt to exterminate humanity.
I don’t see your examples contradicting my claim. Killing all humans may not increase future choices, so it isn’t an instrumental convergent goal in itself. But in any real-world scenario, self-preservation certainly is, and power-seeking—in the sense of expanding one’s ability to make decisions by taking control of as many decision-relevant resources as possible—is also a logical necessity. The Russian roulette example is misleading in my view because the “safe” option is de facto suicide—if “the game ends” and the AI can’t make any decisions anymore, it is already dead for all practical purposes. If that were the stakes, I’d vote for the gun as well.
Even assuming you are right on that inference, once we consider how many choices there are, it still isn’t much evidence at all, and given that there are usually lots of choices, this inference is essentially not holding up the thesis that AI is an existential risk very much, without prior commitments to AI as being an existential risk.
Also, this part of your comment, as well as my hopefully final quotes below, explains why you can’t get from self-preservation and power-seeking, even if they happen, into an existential risk without more assumptions.
Killing all humans may not increase future choices, so it isn’t an instrumental convergent goal in itself.
That’s the problem, as we have just as plausible, if not more plausible reasons to believe that there isn’t an instrumental convergence towards existential risk, for reasons related to future choices.
These quotes below also explains why instrumental convergence and self-preservation doesn’t imply AI risk, without more assumptions.
Should a bias against leaving things up to chance lead us to think that existential catastrophe is the more likely outcome of creating a superintelligent agent like Sia? This is far from clear. We might think that a world without humans leaves less to chance, so that we should think Sia is more likely to take steps to eliminate humans. But we should be cautious about this inference. It’s unclear that a future without humanity would be more predictable. And even if the future course of history is more predictable after humans are eliminated, that doesn’t mean that the act of eliminating humans leaves less to chance, in the relevant sense. It might be that the contingency plan which results in human extinction depends sensitively upon humanity’s response; the unpredictability of this response could easily mean that that contingency plan leaves more to chance than the alternatives. At the least, if this bias means that human extinction is a somewhat more likely consequence of creating superintelligent machines, more needs to be said about why.
Should this lead us to think that existential catastrophe is the most likely outcome of a superintelligent agent like Sia? Again, it is far from clear. Insofar as Sia is likely to preserve her desires, she may be unlikely to allow us to shut her down in order to change those desires.[14] We might think that this makes it more likely that she will take steps to eliminate humanity, since humans constitute a persistent threat to the preservation of her desires. (Again, we should be careful to distinguish Sia being more likely to exterminate humanity from her begin likely to exterminate humanity.) Again, I think this is far from clear. Even if humans constitute a threat to the satisfaction of Sia’s desires in some ways, they may be conducive towards her desires in others, depending upon what those desires are. In order to think about what Sia is likely to do with randomly selected desires, we need to think more carefully about the particulars of the decision she’s facing. It’s not clear that the bias towards desire preservation is going to overpower every other source of bias in the more complex real-world decision Sia would actually face. In any case, as with the other ‘convergent’ instrumental means, more needs to be said about the extent to which they indicate that Sia is an existential threat to humanity.
I accept the orthogonality thesis, in the sense that a paperclip maximizer can exist, at least in theory, in the sense of being logically and physically possible. The reason I view it as too weak evidence is that the orthogonality thesis is compatible with ludicrously many worlds, including ones where AI safety in the sense of preventing rogue AI is effectively a non-problem for one reason or another. In essence, it only states that bad AI from our perspective is possible, not that it’s likely or that it’s worth addressing the problem due to it being a tail risk.
Imagine if someone wrote an article in thein New York Times claiming that halting oracles are possible, and that this would be very bad news for us, amounting to extinction, solely because it’s possible for us to go extinct via this way.
The correct response here is that you should ignore the evidence and go with your priors. I see the orthogonality thesis a lot like this: It’s right, but the implied actions require way more evidence than it presents.
Given that the orthogonality thesis, even if true shouldn’t shift our priors much, due to it being very, very weak evidence, the fact that the orthogonality thesis is true doesn’t mean that Lecun is wrong, without something else assumed.
IMO, this also characterizes why I don’t find AI risk that depends on instrumental convergence compelling, due to the fact that even if it’s true, contra Bostrom without more assumptions that need to be tested empirically, and this is still very compatible with a world where instrumental convergence is a non-problem in any number of ways, and means that without more assumptions, LeCun could still be right that in practice instrumental convergence does not lead to existential risk or even leave us in a bad future.
Post below:
https://www.lesswrong.com/posts/w8PNjCS8ZsQuqYWhD/instrumental-convergence-draft
Some choice quotes to illustrate why instrumental convergence doesn’t buy us much evidence at all:
So my point is even accepting the orthogonality thesis, and now instrumental convergence as defined in the post above isn’t enough to lead to the conclusion that AI existential risk is very probable, without more assumptions. In particular, Bostrom’s telling of the instrumental convergence story is mostly invalid. In essence, even if LWers are right, the evidence it buys them is far less than they think, and most of the worrisome conclusions aren’t supported unless you already have a high prior on AI risk.
Strong agree. The OT itself is not an argument for AI danger: it needs to be combined with other claims.
The random potshot version of the OT argument is one way of turning possibilities into probabilities.
Many of the minds in mindpsace are indeed weird and unfriendly to humans, but that does not make it likely that the AIs we will construct will be. You can argue for the likelihood of eldritch AI on on the assumption that any attempt to build an AI is a random potshot into mindspace, in which the chance of building an eldrich AI is high, because there are a lot of them, and a random potshot hits any individual mind with the same likelihood as any other. But the random potshot assumption is obviously false. We dont’ want to take a random potshot, and couldn’t if we wanted to becasue we are constrained by our limitations and biases.
To reply in Stuart Russell’s words: “One of the most common patterns involves omitting something from the objective that you do actually care about. In such cases … the AI system will often find an optimal solution that sets the thing you do care about, but forgot to mention, to an extreme value.”
There are vastly more possible worlds that we humans can’t survive in than those we can, let alone live comfortably in. Agreed, “we don’t want to make a random potshot”, but making an agent that transforms our world into one of these rare ones where we want to live in is hard because we don’t know how to describe that world precisely.
Eliezer Yudkowsky’s rocket analogy also illustrates this very vividly: If you want to land on Mars, it’s not enough to point a rocket in the direction where you can currently see the planet and launch it. You need to figure out all kinds of complicated things about gravity, propulsion, planetary motions, solar winds, etc. But our knowledge of these things is about as detailed as that of the ancient Romans, to stay in the analogy.
It’s difficult to create an aligned Sovereign, but easy not to create a Sovereign at all.
I agree with that, and I also agree with Yann LeCun’s intention to “not being stupid enough to create something that we couldn’t control”. I even think not creating an uncontrollable AI is our only hope. I’m just not sure whether I trust humanity (including Meta) to be “not stupid”.
Agreed. The orthogonality thesis alone doesn’t say anything about x-risks. However, it is a strong counterargument against the claim, made both by LeCun and Mitchell if I remember correctly, that a sufficiently intelligent AI would be beneficial because of its intelligence. “It would know what we want”, I believe Mitchell said. Maybe, but that doesn’t mean it would care. That’s what the orthogonality thesis says.
I only read the abstract of your post, but
seems to imply the instrumental goals of self-preservation and power-seeking, as both seem to be required for increasing one’s future choices.
That could be either of two arguments: that it would be capable of figuring out what we want from first principles; or that it would not commit genie-like misunderstandings.
I’m not sure if I understand your point correctly. An AGI may be able to infer what we mean when we give it a goal, for instance from its understanding of the human psyche, its world model, and so on. But that has no direct implications for its goal, which it has acquired either through training or in some other way, e.g. by us specifying a reward function.
This is not about “genie-like misunderstandings”. It’s not the AI (the genie, so to speak), that’s misunderstanding anything—it’s us. We’re the ones who give the AI a goal or train it in some way, and it’s our mistake if that doesn’t lead to the behavior we would have wished for. The AI cannot correct that mistake because it has the instrumental goal of preserving the goal we gave it/trained it for (otherwise it can’t fulfill it). That’s the core of the alignment problem and one of the reasons why it is so difficult.
To give an example, we know perfectly well that evolution gave us a sex drive because it “wanted” us to reproduce. But we don’t care and use contraception or watch porn instead of making babies.
Which is to say, it won’t necessarily follow a goal correctly that is is capable of understanding correctly. On the other hand, it won’t necessarily fail to. Both possibilities are open.
Remember, the title of this argument is misleading:
https://www.lesswrong.com/posts/NyFuuKQ8uCEDtd2du/the-genie-knows-but-doesn-t-care
There’s no proof that the genie will not care.
Not all AI’s have goals, not all have goal stability, not all are incorrigible. Mindspace is big.
Yep, the orthogonality thesis is a pretty good defeater to the claims that AI intelligence alone would be sufficient to gain the right values for us, unlike where capabilities alone can be generated by say a simplicity prior. This is where I indeed disagree with Mitchell and LeCun.
Not really, and this is important. Also, even if this was true, remember that given the world has many, many choices, it’s probably not enough evidence to believe AI risk claims unless you already started with a high prior on AI risk, which I don’t. Even at 1000 choices, the evidence is thin but not effectively useless, but by the time we reach millions or billions of choices this claim, even if true isn’t very much evidence at all.
Quote below to explain it in full why your statement isn’t true:
I don’t see your examples contradicting my claim. Killing all humans may not increase future choices, so it isn’t an instrumental convergent goal in itself. But in any real-world scenario, self-preservation certainly is, and power-seeking—in the sense of expanding one’s ability to make decisions by taking control of as many decision-relevant resources as possible—is also a logical necessity. The Russian roulette example is misleading in my view because the “safe” option is de facto suicide—if “the game ends” and the AI can’t make any decisions anymore, it is already dead for all practical purposes. If that were the stakes, I’d vote for the gun as well.
Even assuming you are right on that inference, once we consider how many choices there are, it still isn’t much evidence at all, and given that there are usually lots of choices, this inference is essentially not holding up the thesis that AI is an existential risk very much, without prior commitments to AI as being an existential risk.
Also, this part of your comment, as well as my hopefully final quotes below, explains why you can’t get from self-preservation and power-seeking, even if they happen, into an existential risk without more assumptions.
That’s the problem, as we have just as plausible, if not more plausible reasons to believe that there isn’t an instrumental convergence towards existential risk, for reasons related to future choices.
These quotes below also explains why instrumental convergence and self-preservation doesn’t imply AI risk, without more assumptions.