I shared a blog post I wrote in Astral Codex, and someone suggested I try and share with the rationalist community (though it’s somewhat polemic in tone). So here goes! The blog post is a little wider in scope, but I think the relevant part for the rationalist community is the rejection of the orthogonality thesis. The TL;DR is that the orthogonality thesis is often presented as a fact, but it seems to me that it’s mostly a series of assertions, namely:
There is a large mind design space. Do we have any actual reasons for thinking so? Sure, one can argue everything has a large design space, but in practice, there’s often an underlying unique mechanism for how things work.
Ethics are not an emergent property of intelligence—but again, that’s just an assertion. There’s no reason to believe or disbelieve it. It’s possible that self-reflection (and hence ethics and the ability to question one’s goals and motivations) is a pre-requisite for general cognition—we don’t know whether this is true or not because we don’t really understand intelligence yet.
The previous two are assertions that could be true, but reflective stability is definitely not true—it’s paradoxical. To quote from my post:
This line of reasoning is absurd: it assumes an agent knows in advance the precise effects of self-improvement — but that’s not how learning works! If you knew exactly how an alteration in your understanding of the world would impact you, you wouldn’t need the alteration: to be able to make that judgement, you’d have to be able to reason as though you had already undergone it (of course, you can predict some of the effects of a particular course of study/self-improvement: for instance, you know that if you take a course in accounting, you’ll become better at reading financial statements. But you have no idea what other effects this course will have on your worldview — for instance, an accounting course might cause you to hate reading financial statements. If you did — if you could think exactly as you would after the course—you wouldn’t need the course.)
So if the argument the OT proponents are making is that AI will not self-improve out of fear of jeopardising its commitment to its original goal, then the entire OT is moot, because AI will never risk self-improving at all.
(To tackle the Gandhi analogy head on: obviously Ghandi wouldn’t take a pill if it were sold to him as ‘if you take this you’ll start killing people’. But if he were told ‘this pill will lead to enlightenment’, and it turns out that an enlightened being is OK with murder, then he’d have to take it — otherwise, he’d be admitting that his injunction against murder is not enlightened; and ultimately, Ghandi’s agenda wasn’t simply non-violence — that was one aspect of a wider worldview and philosophy. To be logically consistent, AI doomers would need to argue that Ghandi wouldn’t dare reading anything new, out of fear that might change his worldview.)
All this is not to suggest we shouldn’t take AI risk seriously, or that we shouldn’t proactively research alignment &c. But it strikes me as dogmatic to proclaim that doom is certain, and that orthogonality is a ‘fact’.
I don’t see how this relates to the Orthogonality Thesis. For a given value or goal, there may be many different cognitive mechanisms for figuring out how to accomplish it, or there may be few, or there may be only one unique mechanism. Different cognitive mechanisms (if they exist) might lead to the same or different conclusions about how to accomplish a particular goal.
For some goals, such as re-arranging all atoms in the universe in a particular pattern, it may be that there is only one effective way of accomplishing such a goal, so whether different cognitive mechanisms are able to find the strategy for accomplishing such a goal is mainly a question of how effective those cognitive mechanisms are. The Orthogonality Thesis is saying, in part, that figuring out how to do something is independent of wanting to do something, and that the space of possible goals and values is large. If I were smarter, I probably could figure out how to tile the universe with tiny squiggles, but I don’t want to do that, so I wouldn’t.
I don’t see what ability to self-reflect has to do with ethics. It’s probably true that anything superintelligent is capable, in some sense, of self-reflection, but why would that be a problem for the Orthogonality Thesis? Do you believe that an agent which terminally values tiny molecular squiggles would “question its goals and motivations” and conclude that creating squiggles is somehow “unethical”? If so, maybe review the metaethics sequence; you may be confused about what we mean around here when we talk about ethics, morality, and human values.
I think reflective stability, as it is usually used on LW, means something more narrow than how you’re interpreting it, and is not paradoxical. It’s usually used to describe a property of an agent following a particular decision theory. For example, a causal decision theory agent is not reflectively stable, because on reflection, it will regret not having pre-committed in certain situations. Logical decision theories are more reflectively stable in the sense that their adherents do not need to pre-commit to anything, and will therefore not regret not making any pre-commitments when reflecting on their own minds and decision processes, and how they would behave in hypothetical or future situations.
It relates to it because it’s an explicit component of it, no? The point being that if there is only one way of general cognition to work, perhaps that way by default involves self-reflection, which brings us to the second point...
Yes, that’s what I’m suggesting; not saying it’s definitely true; but it’s not obviously wrong, either. Haven’t read the sequence, but I’ll try to find the time to do so—but basically I question the wording ‘terminally values’. I think that perhaps general intelligence tends to avoid valuing anything terminally (what do we humans value terminally?)
Possibly, but I’m responding to its definition in the OT post I linked to, in which it’s used to mean that agents will avoid making changes that may affect their dedication to their goals.
So if you reject the Orthogonality Thesis, what map between capability and goals are you using instead?
Not an explicit map; I’m raising the possibility that capability leads to malleable goals.
It seems there is some major confusion is going on here—it is, generally speaking, imporrible to know the outcome of an arbitrary computation without actually running it, but that does not mean it’s impossible to design a specific computation in a way you’d know exactly what the effects would be. For example, one does not need to know the trillionth digit of pi in order to write a program that they could be very certain would compute that digit.
You also seem to be too focused on minor modifications of a human-like mind, but focusing too narrowly on minds is also missing the point—focus on optimization programs instead.
For many different kinds of X, it should be possible to write a program that given a particular robotics apparatus (just the electromechanical parts without a specific control algorithm), predicts which electrical signals sent to robot’s actuators would result in more X. You can then place that program inside the robot and have the program’s output wired to the robot controls. The resulting robot does not “like” X, it’s just robotically optimizing for X.
The orthogonality principle just says that there is nothing particularly special about human-aligned Xs that would make the X-robot more likely to work well for those Xs over Xs that result in human extinction (e.g. due to convergent instrumental goals, X does not need to specifically be anti-human).
This seems to me to apply only to self improvement that modifies the outcome of decision-making irrespective of time. How does this account for self improvement that only serves to make decision making more efficient?
If I have some highly inefficient code that finds the sum of two integers by first breaking them up into 10000 smaller decimal values, randomly orders them and then adds them up in serial, and I rewrite the code to do the same thing but in way less ops, I have self improved without jeopardizing my goal.
This kind of self improvement can still be fatal in the context of deceptively aligned systems.
But that’s not general intelligence; general intelligence requires considering a wider range of problems holistically, and drawing connections among them.
I can’t upvote this sadly, because I do not have the karma, but I would if I did.
There is another post about this as well.
Don’t be too taken aback if you receive negative karma or some pushback—it is unfortunately expected for posts on this topic taking a position anti to the Orthogonality Thesis.
I don’t think there’s anything wrong with presenting arguments that the orthogonality thesis might be false. However, if those arguments are poorly argued or just rehash previously argued points without adding anything new then they’re likely to be downvoted.
I actually almost upvoted this because I want folks to discuss this topic, but ultimately downvoted because it doesn’t actually engage in arguments that seem likely to convince anyone who believes the orthogonality thesis. It’s mostly just pointing at a set of intuitions that cause surprise at the orthogonality thesis and trying to say it’s just “obviously” wrong without making a real case.
Less Wrong does have topics where I think the readership can be kind of dumb and downvote posts because they don’t want to hear about it, but this isn’t one of them.
To be fair, I’m not saying it’s obviously wrong; I’m saying it’s not obviously true, which is what many people seem to believe!
And Gordon Seidoh Worley is not saying there can’t be good arguments against orthogonality thesis that would deserve uovotes, just that this one is not one of those.