Contra Nora Belrose on Orthogonality Thesis Being Trivial
I think this is wrong. For instance, without the orthogonality thesis, one might think that in order to improve the world, one should charge ahead in creating an AGI, which would automatically use its superior intelligence to figure out the meaning of life and achieve it.
You might think this is stupid, but some significant people believe it. For example, some guy named E. Yudkowsky who used to run an organization for creating AGI wrote an argument like this, but thanks to Eliezer Y.’s reasoning about the orthogonality thesis and related topics, he eventually changed his mind. Clearly the orthogonality thesis is nontrivially useful for cases like this.
- 11 Mar 2024 7:44 UTC; 7 points) 's comment on Deconstructing Bostrom’s Classic Argument for AI Doom by (
The real issue with the orthogonality thesis, at least for the defensible version, is that it makes a claim that I think is true if we quantify over all possible intelligences, but the problem is that this has basically ~no implications for AI safety, even if rationalists are right, because it is far too weak to support AI risk claims.
It only says that an AI can have arbitrary goals, not that it is likely or will have arbitrary goals. You can justify any probability of AI having bad goals from probability 0 to probability 1, and everything in between.
Outside of history, it is completely useless, since it doesn’t constrain your anticipation very much.
I feel like you are not really addressing my argument. Don’t you agree that the orthogonality thesis shoots down E. Yudkowsky’s argument?
Agree specifically on that the argument is bad, but someone might have other reasons for accelerating AI progress that do not depend on the orthogonality thesis being false.
I agree that the orthogonality thesis doesn’t by itself prove most things related to AI safety.
It feels way to easy to flip the sign:
« I think the orthogonality thesis is wrong. For instance, without rejecting the orthogonality thesis, one might think we should stop constructing AGI!
You might think this is stupid, but some significant people believe it. Clearly the orthogonality thesis is nontrivially confusion for cases like this. »
I think the sign flip would be:
If Nora Belrose wants to make that argument, then she can just do what.
Thanks to Eliezer Y. pushing the orthogonality thesis in rationalist circles, I don’t think anyone wants to make that argument, and that’s why I didn’t address it but instead just showed how it used to be believed.
So what’s the probability of a misaligned AI killing us? Is it close to 1.0? Is it enough to justify nuking chip fabs?
The OT doesn’t tell you—it doesn’t quantify probabilities.
Inasmuch as it doesn’t quantify probability, it is Useless.
An AI can non-automatically do something like that, ie. constitutional AI.Edit: The OT is a valid argument against a particularly strong form of let-the-AI-figure-it-out, but not against all forms.
The orthogonality thesis tells you that that thing where you attempt to do a logical proof that it is a good idea to make an AI is doomed to fail because whether it’s a good idea depends on the goals you give it. This sort of logical proof might seem absurd, but I refer you to E. Yudkowsky’s argument to show how prior to the popularization of the orthogonality thesis, it apparently seemed plausible to some people.
I’m not claiming that the orthogonality thesis is a knockdown argument wrt. everything, only that it’s important for directing the conversation towards productive topics like “what do we want the AI to do, and how will it’s goals become to do that?” rather than unproductive topics.
Current constitutional AI doesn’t do any backchaining, and is therefore limited in potential.
It seems easy to imagine that one could expand it to do backchaining using chain-of-thought prompting etc.. In that case, the values of the AI would presumably be determined by human-like moral reasoning. The issue with human-like moral reasoning is that when humans do it, one tends to come up with wild and sketchy ideas like tiling the universe with hedonium (if utilitarian) or various other problematic things (if non-utilitarian). Given this track record, I’m not convinced constitutional AI scales to superintelligences.
It depends on whether it will be safe, in general. Not having goals,, or having corrigible goals are forms of safety—so it doesn’t all depend on the goals you initially give it.
As widely (mis)understood, it smuggles in the idea that AIs are necessarily goal driven, the goals are necessarily stable and incorrigible, etc. It’s not widely recognised that there is a wider orthogonality thesis, that Mindpsace also contains many combinations of capability and goal instability/stability. That weakens the MIRI/Yudkoiwsky argument that goal alignment has to be got right first time.
>’m not convinced constitutional AI scales to superintelligences.
I’m not convinced that ASI will happen overnight.
You can obviously create AIs that don’t try to achieve anything in the world, and sometimes they are useful for various reasons, but some people who are trying to achieve things in the world find it to be a good idea to make AIs that also try to achieve things in the world, and the existence of these people is sufficient to create existential risk.
But it’s not the OT telling you that.
The orthogonality thesis is indeed not sufficient to derive everything of AI safety, but that doesn’t mean it’s trivial.
I think there are two parts that together make one important point.
The first part is that an intelligence can have any goal (a.k.a. the orthogonality thesis).
Second part is that most arbitrarily selected goals are bad. (Not “bad from the selfish short-sighted perspective of the puny human, but good from the viewpoint of a superior intellect”, but bad in the similar sense how rearranging the atoms in your body randomly could hypothetically cure your cancer, but almost certainly will kill you instead.)
The point is, as a first approximation, “don’t build a super-intelligence with random goals, expecting that as it gets smart enough it will spontaneously converge towards good”.
If you have a way to reliably specify good goals, then you obviously don’t have to worry about orthogonality thesis. As far as know, this is not the case we have now.
*
I wrote this before seeing the actual tweets. Now that I saw them, it seems to me that Yudkowsky basically agrees with my interpretation of him, which is based on things he wrote a decade ago, so at least the accusation about “rewriting history” is false.
(The part about “trivial, false, or unintelligible”, uhm, seems like it could just as well be made about e.g. natural selection. “What survives, survives” is trivial, if you don’t also consider mutations. So do we have a motte and bailey of natural selection, where the motte just assumes selection without discussing the mechanism of mutations, and the bailey that includes the mutations? Does this constitute a valid argument against evolution?)
I agree these two things together make an important point, but I think the orthogonality thesis by itself also makes an important point as it disproves E. Yudkowsky’s style of argument in the doc I linked to in the OP.
In fact it seems that the linked argument relies on a version of the orthogonality thesis instead of being refuted by it:
Nothing about the argument contradicts “the true meaning of life”—which seems in that argument to be effectively defined as “whatever the AI ends up with as a goal if it starts out without a goal”—being e.g. paperclips.
The quoted section more seems like instrumental convergence than orthogonality to me?
In a sense, that’s it’s flaw; it’s supposed to be an argument that building a superintelligence is desirable because it will let you achieve the meaning of life, but since nothing contradicts “the meaning of life” being paperclips, you can substitute “convert the world into paperclips” into the argument and not lose validity. Yet, the argument that we should build a superintelligence because it lets us convert the world into paperclips is of course wrong, so one can go back and say that the original argument was wrong too.
But in order to accept that, one needs to accept the orthogonality thesis. If one doesn’t consider “maximize the number of charged-up batteries” to be a sufficiently plausible outcome of a superintelligence that it’s even worth consideration, then one is going to be stuck in this sort of reasoning.
The second part of the sentence, yes. The bolded one seems to acknowledge AIs can have different goals, and I assume that version of EY wouldn’t count “God” as a good goal.
Another more relevant part:
Presumably this goal object can be anything.
I agree that EY rejected the argument because he accepted OT. I very much disagree that this is the only way to reject the argument. In fact, all four positions seem quite possible:
Accept OT, accept the argument: sure, AIs can have different goals, but this (starting an AI without explicit goals) is how you get an AI which would figure out the meaning of life.
Reject OT, reject the argument: you can think “figure out the meaning of life” is not a possible AI goal.
and 4. EY’s positions at different times.
In addition, OT can itself be a reason to charge ahead with creating an AGI: since it says an AGI can have any goal, you “just” need to create an AGI which will improve the world. It says nothing about setting an AGI’s goal being difficult.
In the sense that the Orthogonality Thesis considers goals to be static or immutable, I think it is trivial.
I’ve advocated a lot for trying to consider goals to be mutable, as well as value functions being definable on other value functions. And not just that it will be possible or a good idea to instantiate value functions this way, but also that they will probably become mutable over time anyway.
All of that makes the Orthogonality Thesis—not false, but a lot easier to grapple with, I’d say.
FWIW I strongly agree with what I thought was the original form of the orthogonality thesis—i.e. that a highly intelligent agent could have arbitrary goals—but
amwas confused by Eliezer’s tweet linked by Nora Belrose’s tweetpost.Eliezer:
This seems to me to be trivially correct if what is meant by “calculate which actions” means calculating down to every last detail.
On the other hand, if the claim were something like that we can’t know anything what an AI would do without actually running it (which I do not think is the claim, just stating an extreme interpretation), then it seems obviously false.
And if it’s something in between, or something else entirely, then what specifically is the claim?
It’s also not clear to me why that trivial-seeming claim is supposed to be related to (or even a stronger form than) other varieties of the “weak” form of the orthogonality thesis.Actually that’s obvious, I should have bothered thinking about it, like, at all. So, yeah, I can see why this formulation is useful—it’s both obviously correct and does imply that you can have something make a galaxy full of paperclips.I think the orthogonality thesis holds strongly enough to shoot down E. Yudkowsky’s argument?
Explain what you mean? What are you calling the orthogonality thesis (note that part of my point is that different alleged versions of it don’t seem to be related), what are you calling “E. Yudkowsky’s argument”, and how does the orthogonality thesis shoot it down?Ah I see, you are talking about the “meaning of life” document. Yes, the version of the orthogonality thesis that says that goals are not specified by intelligence (
notand also the version presented in Eliezer’s tweet) does shoot down that document. (My comment was about that tweet, so was largely unrelated to that document).