In my Xenosystems review, I discussed the Orthogonality Thesis, concluding that it was a bad metaphor. It’s a long post, though, and the comments on orthogonality build on other Xenosystems content. Therefore, I think it may be helpful to present a more concentrated discussion on Orthogonality, contrasting Orthogonality with my own view, without introducing dependencies on Land’s views. (Land gets credit for inspiring many of these thoughts, of course, but I’m presenting my views as my own here.)
First, let’s define the Orthogonality Thesis. Quoting Superintelligence for Bostrom’s formulation:
Intelligence and final goals are orthogonal: more or less any level of intelligence could in principle be combined with more or less any final goal.
To me, the main ambiguity about what this is saying is the “could in principle” part; maybe, for any level of intelligence and any final goal, there exists (in the mathematical sense) an agent combining those, but some combinations are much more natural and statistically likely than others. Let’s consider Yudkowsky’s formulations as alternatives. Quoting Arbital:
The Orthogonality Thesis asserts that there can exist arbitrarily intelligent agents pursuing any kind of goal.
The strong form of the Orthogonality Thesis says that there’s no extra difficulty or complication in the existence of an intelligent agent that pursues a goal, above and beyond the computational tractability of that goal.
As an example of the computational tractability consideration, sufficiently complex goals may only be well-represented by sufficiently intelligent agents. “Complication” may be reflected in, for example, code complexity; to my mind, the strong form implies that the code complexity of an agent with a given level of intelligence and goals is approximately the code complexity of the intelligence plus the code complexity of the goal specification, plus a constant. Code complexity would influence statistical likelihood for the usual Kolmogorov/Solomonoff reasons, of course.
I think, overall, it is more productive to examine Yudkowsky’s formulation than Bostrom’s, as he has already helpfully factored the thesis into weak and strong forms. Therefore, by criticizing Yudkowsky’s formulations, I am less likely to be criticizing a strawman. I will use “Weak Orthogonality” to refer to Yudkowsky’s “Orthogonality Thesis” and “Strong Orthogonality” to refer to Yudkowsky’s “strong form of the Orthogonality Thesis”.
Land, alternatively, describes a “diagonal” between intelligence and goals as an alternative to orthogonality, but I don’t see a specific formulation of a “Diagonality Thesis” on his part. Here’s a possible formulation:
Diagonality Thesis: Final goals tend to converge to a point as intelligence increases.
The main criticism of this thesis is that formulations of ideal agency, in the form of Bayesianism and VNM utility, leave open free parameters, e.g. priors over un-testable propositions, and the utility function. Since I expect few readers to accept the Diagonality Thesis, I will not concentrate on criticizing it.
What about my own view? I like Tsvi’s naming of it as an “obliqueness thesis”.
Obliqueness Thesis: The Diagonality Thesis and the Strong Orthogonality Thesis are false. Agents do not tend to factorize into an Orthogonal value-like component and a Diagonal belief-like component; rather, there are Oblique components that do not factorize neatly.
(Here, by Orthogonal I mean basically independent of intelligence, and by Diagonal I mean converging to a point in the limit of intelligence.)
While I will address Yudkowsky’s arguments for the Orthogonality Thesis, I think arguing directly for my view first will be more helpful. In general, it seems to me that arguments for and against the Orthogonality Thesis are not mathematically rigorous; therefore, I don’t need to present a mathematically rigorous case to contribute relevant considerations, so I will consider intuitive arguments relevant, and present multiple arguments rather than a single sequential argument (as I did with the more rigorous argument for many worlds).
Bayes/VNM point against Orthogonality
Some people may think that the free parameters in Bayes/VNM point towards the Orthogonality Thesis being true. I think, rather, that they point against Orthogonality. While they do function as arguments against the Diagonality Thesis, this is insufficient for Orthogonality.
First, on the relationship between intelligence and bounded rationality. It’s meaningless to talk about intelligence without a notion of bounded rationality. Perfect rationality in a complex environment is computationally intractable. With lower intelligence, bounded rationality is necessary. So, at non-extreme intelligence levels, the Orthogonality Thesis must be making a case that boundedly rational agents can have any computationally tractable goal.
Bayesianism and VNM expected utility optimization are known to be computationally intractable in complex environments. That is why algorithms like MCMC and reinforcement learning are used. So, making an argument for Orthogonality in terms of Bayesianism and VNM is simply dodging the question, by already assuming an extremely high intelligence level from the start.
As the Orthogonality Thesis refers to “values” or “final goals” (which I take to be synonymous), it must have a notion of the “values” of agents that are not extremely intelligent. These values cannot be assumed to be VNM, since VNM is not computationally tractable. Meanwhile, money-pumping arguments suggest that extremely intelligent agents will tend to converge to VNM-ish preferences. Thus:
Argument from Bayes/VNM: Agents with low intelligence will tend to have beliefs/values that are far from Bayesian/VNM. Agents with high intelligence will tend to have beliefs/values that are close to Bayesian/VNM. Strong Orthogonality is false because it is awkward to combine low intelligence with Bayesian/VNM beliefs/values, and awkward to combine high intelligence with far-from-Bayesian/VNM beliefs/values. Weak Orthogonality is in doubt, because having far-from-Bayesian/VNM beliefs/values puts a limit on the agent’s intelligence.
To summarize: un-intelligent agents cannot be assumed to be Bayesian/VNM from the start. Those arise at a limit of intelligence, and arguably have to arise due to money-pumping arguments. Beliefs/values therefore tend to become more Bayesian/VNM with high intelligence, contradicting Strong Orthogonality and perhaps Weak Orthogonality.
One could perhaps object that logical uncertainty allows even weak agents to be Bayesian over combined physical/mathematical uncertainty; I’ll address this consideration later.
Belief/value duality
It may be unclear why the Argument from Bayes/VNM refers to both beliefs and values, as the Orthogonality Thesis is only about values. It would, indeed, be hard to make the case that the Orthogonality Thesis is true as applied to beliefs. However, various arguments suggest that Bayesian beliefs and VNM preferences are “dual” such that complexity can be moved from one to the other.
Let A∈A be the agent’s action, and let W∈W represent the state of the world prior to / unaffected by the agent’s action Let r(A, W) be the outcome resulting from the action and world. Let P(w) be the primary agent’s probability a given world. Let U(o) be the primary agent’s utility for outcome o. The primary agent finds an action a to maximize ∑w∈WP(w)U(r(a,w)).
Now let e be an arbitrary predicate on worlds. Consider modifying P to increase the probability that e(W) is true. That is:
P′(w):∝P(w)(1+[e(w)])
P′(w)=P(w)(1+[e(w)])∑w∈WP(w)(1+[e(w)])
where [e(w)] equals 1 if e(w), otherwise 0. Now, can we define a modified utility function U’ so a secondary agent with beliefs P’ and utility function U’ will take the same action as the primary agent? Yes:
U′(o):=U(o)1+[e(w)]
This secondary agent will find an action a to maximize:
Clearly, this is a positive constant times the primary agent’s maximization target, so the secondary agent will take the same action.
This demonstrates a basic way that Bayesian beliefs and VNM utility are dual to each other. One could even model all agents as having the same utility function (of maximizing a random variable U) and simply having different beliefs about what U values are implied by the agent’s action and world state. Thus:
Argument from belief/value duality: From an agent’s behavior, multiple belief/value combinations are valid attributions. This is clearly true in the limiting Bayes/VNM case, suggesting it also applies in the case of bounded rationality. It is unlikely that the Strong Orthogonality Thesis applies to beliefs (including priors), so, due to the duality, it is also unlikely that it applies to values.
I consider this weaker than the Argument from Bayes/VNM. Someone might object that both values and a certain component of beliefs are orthogonal, while the other components of beliefs (those that change with more reasoning/intelligence) aren’t. But I think this depends on a certain factorizability of beliefs/values into the kind that change on reflection and those that don’t, and I’m skeptical of such factorizations. I think discussion of logical uncertainty will make my position on this clearer, though, so let’s move on.
Logical uncertainty as a model for bounded rationality
I’ve already argued that bounded rationality is essential to intelligence (and therefore the Orthogonality Thesis). Logical uncertainty is a form of bounded rationality (as applied to guessing the probabilities of mathematical statements). Therefore, discussing logical uncertainty is likely to be fruitful with respect to the Orthogonality Thesis.
Logical Induction is a logical uncertainty algorithm that produces a probability table for a finite subset of mathematical statements at each iteration. These beliefs are determined by a betting market of an increasing (up to infinity) number of programs that make bets, with the bets resolved by a “deductive process” that is basically a theorem prover. The algorithm is computable, though extremely computationally intractable, and has properties in the limit including some forms of Bayesian updating, statistical learning, and consistency over time.
We can see Logical Induction as evidence against the Diagonality Thesis: beliefs about undecidable statements (which exist in consistent theories due to Gödel’s first incompleteness theorem) can take on any probability in the limit, though satisfy properties such as consistency with other assigned probabilities (in a Bayesian-like manner).
However, (a) it is hard to know ahead of time which statements are actually undecidable, (b) even beliefs about undecidable statements tend to predictably change over time to Bayesian consistency with other beliefs about undecidable statements. So, Logical Induction does not straightforwardly factorize into a “belief-like” component (which converges on enough reflection) and a “value-like” component (which doesn’t change on reflection). Thus:
Argument from Logical Induction: Logical Induction is a current best-in-class model of theoretical asymptotic bounded rationality. Logical Induction is non-Diagonal, but also clearly non-Orthogonal, and doesn’t apparently factorize into separate Orthogonal and Diagonal components. Combined with considerations from “Argument from belief/value duality”, this suggests that it’s hard to identify all value-like components in advanced agents that are Orthogonal in the sense of not tending to change upon reflection.
One can imagine, for example, introducing extra function/predicate symbols into the logical theory the logical induction is over, to represent utility. Logical induction will tend to make judgments about these functions/predicates more consistent and inductively plausible over time, changing its judgments about the utilities of different outcomes towards plausible logical probabilities. This is an Oblique (non-Orthogonal and non-Diagonal) change in the interpretation of the utility symbol over time.
Likewise, Logical Induction can be specified to have beliefs over empirical facts such as observations by adding additional function/predicate symbols, and can perhaps update on these as they come in (although this might contradict UDT-type considerations). Through more iteration, Logical Inductors will come to have more approximately Bayesian, and inductively plausible, beliefs about these empirical facts, in an Oblique fashion.
Even if there is a way of factorizing out an Orthogonal value-like component from an agent, the belief-component (represented by something like Logical Induction) remains non-Diagonal, so there is still a potential “alignment problem” for these non-Diagonal components to match, say, human judgments in the limit. I don’t see evidence that these non-Diagonal components factor into a value-like “prior over the undecidable” that does not change upon reflection. So, there remain components of something analogous to a “final goal” (by belief/value duality) that are Oblique, and within the scope of alignment.
If it were possible to get the properties of Logical Induction in a Bayesian system, which makes Bayesian updates on logical facts over time, that would make it more plausible that an Orthogonal logical prior could be specified ahead of time. However, MIRI researchers have tried for a while to find Bayesian interpretations of Logical Induction, and failed, as would be expected from the Argument from Bayes/VNM.
Naive belief/value factorizations lead to optimization daemons
The AI alignment field has a long history of poking holes in alignment approaches. Oops, you tried making an oracle AI and it manipulated real-world outcomes to make its predictions true. Oops, you tried to do Solomonoff induction and got invaded by aliens. Oops, you tried getting agents to optimize over a virtual physical universe, and they discovered the real world and tried to break out. Oops, you ran a Logical Inductor and one of the traders manipulated the probabilities to instantiate itself in the real world.
These sub-processes that take over are known as optimization daemons. When you get the agent architecture wrong, sometimes a sub-process (that runs a massive search over programs, such as with Solomonoff Induction) will luck upon a better agent architecture and out-compete the original system. (See also a very strange post I wrote some years back while thinking about this issue, and Christiano’s comment relating it to Orthogonality).
If you apply a naive belief/value factorization to create an AI architecture, when compute is scaled up sufficiently, optimization daemons tend to break out, showing that this factorization was insufficient. Enough experiences like this lead to the conclusion that, if there is a realistic belief/value factorization at all, it will look pretty different from the naive one. Thus:
Argument from optimization daemons: Naive ways of factorizing an agent into beliefs/values tend to lead to optimization daemons, which have different values from in the original factorization. Any successful belief/value factorization will probably look pretty different from the naive one, and might not take the form of factorization into Diagonal belief-like components and Orthogonal value-like components. Therefore, if any realistic formulation of Orthogonality exists, it will be hard to find and substantially different from naive notions of Orthogonality.
Intelligence changes the ontology values are expressed in
The most straightforward way to specify a utility function is to specify an ontology (a theory of what exists, similar to a database schema) and then provide a utility function over elements of this ontology. Prior to humans learning about physics, evolution (taken as a design algorithm for organisms involving mutation and selection) did not know all that human physicists know. Therefore, human evolutionary values are unlikely to be expressed in the ontology of physics as physicists currently believe in.
Human evolutionary values probably care about things like eating enough, social acceptance, proxies for reproduction, etc. It is unknown how these are specified, but perhaps sensory signals (such as stomach signals) are connected with a developing world model over time. Humans can experience vertigo at learning physics, e.g. thinking that free will and morality are fake, leading to unclear applications of native values to a realistic physical ontology. Physics has known gaps (such as quantum/relativity correspondence, and dark energy/dark matter) that suggest further ontology shifts.
One response to this vertigo is to try to solve the ontology identification problem; find a way of translating states in the new ontology (such as physics) to an old one (such as any kind of native human ontology), in a structure-preserving way, such that a utility function over the new ontology can be constructed as a composition of the original utility function and the new-to-old ontological mapping. Current solutions, such as those discussed in MIRI’s Ontological Crises paper, are unsatisfying. Having looked at this problem for a while, I’m not convinced there is a satisfactory solution within the constraints presented. Thus:
Argument from ontological change: More intelligent agents tend to change their ontology to be more realistic. Utility functions are most naturally expressed relative to an ontology. Therefore, there is a correlation between an agent’s intelligence and utility function, through the agent’s ontology as an intermediate variable, contradicting Strong Orthogonality. There is no known solution for rescuing the old utility function in the new ontology, and some research intuitions pointing towards any solution being unsatisfactory in some way.
If a satisfactory solution is found, I’ll change my mind on this argument, of course, but I’m not convinced such a satisfactory solution exists. To summarize: higher intelligence causes ontological changes, and rescuing old values seems to involve unnatural “warps” to make the new ontology correspond with the old one, contradicting at least Strong Orthogonality, and possibly Weak Orthogonality (if some values are simply incompatible with realistic ontology). Paperclips, for example, tend to appear most relevant at an intermediate intelligence level (around human-level), and become more ontologically unnatural at higher intelligence levels.
As a more general point, one expects possible mutual information between mental architecture and values, because values that “re-use” parts of the mental architecture achieve lower description length. For example, if the mental architecture involves creating universal algebra structures and finding analogies between them and the world, then values expressed in terms of such universal algebras will tend to have lower relative description complexity to the architecture. Such mutual information contradicts Strong Orthogonality, as some intelligence/value combinations are more natural than others.
Intelligence leads to recognizing value-relevant symmetries
Consider a number of un-intutitive value propositions people have argued for:
Torture is preferable to Dust Specks, because it’s hard to come up with a utility function with the alternative preference without horrible unintuitive consequences elsewhere.
People are way too risk-averse in betting; the implied utility function has too strong diminishing marginal returns to be plausible.
You may think a perfect upload of you isn’t conscious (and basically another copy of you), but you’re wrong, because functionalist theory of mind is true.
You intuitively accept the premises of the Repugnant Conclusion, but not the Conclusion itself; you’re simply wrong about one of the premises, or the conclusion.
The point is not to argue for these, but to note that these arguments have been made and are relatively more accepted among people who have thought more about the relevant issues than people who haven’t. Thinking tends to lead to noticing more symmetries and dependencies between value-relevant objects, and tends to adjust values to be more mathematically plausible and natural. Of course, extrapolating this to superintelligence leads to further symmetries. Thus:
Argument from value-relevant symmetries: More intelligent agents tend to recognize more symmetries related to value-relevant entities. They will also tend to adjust their values according to symmetry considerations. This is an apparent value change, and it’s hard to see how it can instead be factored as a Bayesian update on top of a constant value function.
I’ll examine such factorizations in more detail shortly.
Human brains don’t seem to neatly factorize
This is less about the Orthogonality Thesis generally, and more about human values. If there were separable “belief components” and “value components” in the human brain, with the value components remaining constant over time, that would increase the chance that at least some Orthogonal component can be identified in human brains, corresponding with “human values” (though, remember, the belief-like component can also be Oblique rather than Diagonal).
However, human brains seem much more messy than the sort of computer program that could factorize this way. Different brain regions are connected in at least some ways that are not well-understood. Additionally, even apparent “value components” may be analogous to something like a deep Q-learning function, which incorporates empirical updates in addition to pre-set “values”.
The interaction between human brains and language is also relevant. Humans develop values they act on partly through language. And language (including language reporting values) is affected by empirical updates and reflection, thus non-Orthogonal. Reflecting on morality can easily change people’s expressed and acted-upon values, e.g. in the case of Peter Singer. People can change which values they report as instrumental or terminal even while behaving similarly (e.g. flipping between selfishness-as-terminal and altruism-as-terminal), with the ambiguity hard to resolve because most behavior relates to convergent instrumental goals.
Maybe language is more of an effect than cause of values. But there really seems to be feedback from language to non-linguistic brain functions that decide actions and so on. Attributing coherent values over realistic physics to the brain parts that are non-linguistic seems like a form of projection or anthropomorphism. Language and thought have a function in cognition and attaining coherent values over realistic ontologies. Thus:
Argument from brain messiness: Human brains don’t seem to neatly factorize into a belief-component and a value-component, with the value-component unaffected by reflection or language (which it would need to be Orthogonal). To the extent any value-component does not change due to language or reflection, it is restricted to evolutionary human ontology, which is unlikely to apply to realistic physics; language and reflection are part of the process that refines human values, rather than being an afterthought of them. Therefore, if the Orthogonality Thesis is true, humans lack identifiable values that fit into the values axis of the Orthogonality Thesis.
This doesn’t rule out that Orthogonality could apply to superintelligences, of course, but it does raise questions for the project of aligning superintelligences with human values; perhaps such values do not exist or are not formulated so as to apply to the actual universe.
Models of ASI should start with realism
Some may take arguments against Orthogonality to be disturbing at a value level, perhaps because they are attached to research projects such as Friendly AI (or more specific approaches), and think questioning foundational assumptions would make the objective (such as alignment with already-existing human values) less clear. I believe “hold off on proposing solutions” applies here: better strategies are likely to come from first understanding what is likely to happen absent a strategy, then afterwards looking for available degrees of freedom.
Quoting Yudkowsky:
Orthogonality is meant as a descriptive statement about reality, not a normative assertion. Orthogonality is not a claim about the way things ought to be; nor a claim that moral relativism is true (e.g. that all moralities are on equally uncertain footing according to some higher metamorality that judges all moralities as equally devoid of what would objectively constitute a justification). Claiming that paperclip maximizers can be constructed as cognitive agents is not meant to say anything favorable about paperclips, nor anything derogatory about sapient life.
Likewise, Obliqueness does not imply that we shouldn’t think about the future and ways of influencing it, that we should just give up on influencing the future because we’re doomed anyway, that moral realist philosophers are correct or that their moral theories are predictive of ASI, that ASIs are necessarily morally good, and so on. The Friendly AI research program was formulated based on descriptive statements believed at the time, such as that an ASI singleton would eventually emerge, that the Orthogonality Thesis is basically true, and so on. Whatever cognitive process formulated this program would have formulated a different program conditional on different beliefs about likely ASI trajectories. Thus:
Meta-argument from realism: Paths towards beneficially achieving human values (or analogues, if “human values” don’t exist) in the far future likely involve a lot of thinking about likely ASI trajectories absent intervention. The realistic paths towards human influence on the far future depend on realistic forecasting models for ASI, with Orthogonality/Diagonality/Obliqueness as alternative forecasts. Such forecasting models can be usefully thought about prior to formulation of a research program intended to influence the far future. Formulating and working from models of bounded rationality such as Logical Induction is likely to be more fruitful than assuming that bounded rationality will factorize into Orthogonal and Diagonal components without evidence in favor of this proposition. Forecasting also means paying more attention to the Strong Orthogonality Thesis than the Weak Orthogonality Thesis, as statistical correlations between intelligence and values will show up in such forecasts.
On Yudkowsky’s arguments
Now that I’ve explained my own position, addressing Yudkowsky’s main arguments may be useful. His main argument has to do with humans making paperclips instrumentally:
Suppose some strange alien came to Earth and credibly offered to pay us one million dollars’ worth of new wealth every time we created a paperclip. We’d encounter no special intellectual difficulty in figuring out how to make lots of paperclips.
That is, minds would readily be able to reason about:
How many paperclips would result, if I pursued a policy π0?
How can I search out a policy π that happens to have a high answer to the above question?
I believe it is better to think of the payment as coming in the far future and perhaps in another universe; that way, the belief about future payment is more analogous to terminal values than instrumental values. In this case, creating paperclips is a decent proxy for achievement of human value, so long-termist humans would tend to want lots of paperclips to be created.
I basically accept this, but, notably, Yudkowsky’s argument is based on belief/value duality. He thinks it would be awkward for the reader to imagine terminally wanting paperclips, so he instead asks them to imagine a strange set of beliefs leading to paperclip production being oddly correlated with human value achievement. Thus, acceptance of Yudkowsky’s premises here will tend to strengthen the Argument from belief/value duality and related arguments.
In particular, more intelligence would cause human-like agents to develop different beliefs about what actions aliens are likely to reward, and what numbers of paperclips different policies result in. This points towards Obliqueness as with Logical Induction: such beliefs will be revised (but not totally convergent) over time, leading to applying different strategies toward value achievement. And ontological issues around what counts as a paperclip will come up at some point, and likely be decided in a prior-dependent but also reflection-dependent way.
Beliefs about which aliens are most capable/honest likely depend on human priors, and are therefore Oblique: humans would want to program an aligned AI to mostly match these priors while revising beliefs along the way, but can’t easily factor out their prior for the AI to share.
Now onto other arguments. The “Size of mind design space” argument implies many agents exist with different values from humans, which agrees with Obliqueness (intelligent agents tend to have different values from unintelligent ones). It’s more of an argument about the possibility space than statistical correlation, thus being more about Weak than Strong Orthogonality.
The “Instrumental Convergence” argument doesn’t appear to be an argument for Orthogonality per se; rather, it’s a counter to arguments against Orthogonality based on noticing convergent instrumental goals. My arguments don’t take this form.
Likewise, “Reflective Stability” is about a particular convergent instrumental goal (preventing value modification). In an Oblique framing, a Logical Inductor will tend not to change its beliefs about even un-decidable propositions too often (as this would lead to money-pumps), so consistency is valued all else being equal.
While I could go into more detail responding to Yudkowsky, I think space is better spent presenting my own Oblique views for now.
Conclusion
As an alternative to the Orthogonality Thesis and the Diagonality Thesis, I present the Obliqueness Thesis, which says that increasing intelligence tends to lead to value changes but not total value convergence. I have presented arguments that advanced agents and humans do not neatly factor into Orthogonal value-like components and Diagonal belief-like components, using Logical Induction as a model of bounded rationality. This implies complications to theories of AI alignment based on assuming humans have values and we need the AGI to agree about those values, while increasing their intelligence (and thus changing beliefs).
At a methodological level, I believe it is productive to start by forecasting default ASI using models of bounded rationality, especially known models such as Logical Induction, and further developing such models. I think this is more productive than assuming that these models will take the form of a belief/value factorization, although I have some uncertainty about whether such a factorization will be found.
If the Obliqueness Thesis is accepted, what possibility space results? One could think of this as steering a boat in a current of varying strength. Clearly, ignoring the current and just steering where you want to go is unproductive, as is just going along with the current and not trying to steer at all. Getting to where one wants to go consists in largely going with the current (if it’s strong enough), charting a course that takes it into account.
Assuming Obliqueness, it’s not viable to have large impacts on the far future without accepting some value changes that come from higher intelligence (and better epistemology in general). The Friendly AI research program already accepts that paths towards influencing the far future involve “going with the flow” regarding superintelligence, ontology changes, and convergent instrumental goals; Obliqueness says such flows go further than just these, being hard to cleanly separate from values.
Obliqueness obviously leaves open the question of just how oblique. It’s hard to even formulate a quantitative question here. I’d very intuitively and roughly guess that intelligence and values are 3 degrees off (that is, almost diagonal), but it’s unclear what question I am even guessing the answer to. I’ll leave formulating and answering the question as an open problem.
I think Obliqueness is realistic, and that it’s useful to start with realism when thinking of how to influence the far future. Maybe superintelligence necessitates significant changes away from current human values; the Litany of Tarski applies. But this post is more about the technical thesis than emotional processing of it, so I’ll end here.
The Obliqueness Thesis
In my Xenosystems review, I discussed the Orthogonality Thesis, concluding that it was a bad metaphor. It’s a long post, though, and the comments on orthogonality build on other Xenosystems content. Therefore, I think it may be helpful to present a more concentrated discussion on Orthogonality, contrasting Orthogonality with my own view, without introducing dependencies on Land’s views. (Land gets credit for inspiring many of these thoughts, of course, but I’m presenting my views as my own here.)
First, let’s define the Orthogonality Thesis. Quoting Superintelligence for Bostrom’s formulation:
To me, the main ambiguity about what this is saying is the “could in principle” part; maybe, for any level of intelligence and any final goal, there exists (in the mathematical sense) an agent combining those, but some combinations are much more natural and statistically likely than others. Let’s consider Yudkowsky’s formulations as alternatives. Quoting Arbital:
As an example of the computational tractability consideration, sufficiently complex goals may only be well-represented by sufficiently intelligent agents. “Complication” may be reflected in, for example, code complexity; to my mind, the strong form implies that the code complexity of an agent with a given level of intelligence and goals is approximately the code complexity of the intelligence plus the code complexity of the goal specification, plus a constant. Code complexity would influence statistical likelihood for the usual Kolmogorov/Solomonoff reasons, of course.
I think, overall, it is more productive to examine Yudkowsky’s formulation than Bostrom’s, as he has already helpfully factored the thesis into weak and strong forms. Therefore, by criticizing Yudkowsky’s formulations, I am less likely to be criticizing a strawman. I will use “Weak Orthogonality” to refer to Yudkowsky’s “Orthogonality Thesis” and “Strong Orthogonality” to refer to Yudkowsky’s “strong form of the Orthogonality Thesis”.
Land, alternatively, describes a “diagonal” between intelligence and goals as an alternative to orthogonality, but I don’t see a specific formulation of a “Diagonality Thesis” on his part. Here’s a possible formulation:
Diagonality Thesis: Final goals tend to converge to a point as intelligence increases.
The main criticism of this thesis is that formulations of ideal agency, in the form of Bayesianism and VNM utility, leave open free parameters, e.g. priors over un-testable propositions, and the utility function. Since I expect few readers to accept the Diagonality Thesis, I will not concentrate on criticizing it.
What about my own view? I like Tsvi’s naming of it as an “obliqueness thesis”.
Obliqueness Thesis: The Diagonality Thesis and the Strong Orthogonality Thesis are false. Agents do not tend to factorize into an Orthogonal value-like component and a Diagonal belief-like component; rather, there are Oblique components that do not factorize neatly.
(Here, by Orthogonal I mean basically independent of intelligence, and by Diagonal I mean converging to a point in the limit of intelligence.)
While I will address Yudkowsky’s arguments for the Orthogonality Thesis, I think arguing directly for my view first will be more helpful. In general, it seems to me that arguments for and against the Orthogonality Thesis are not mathematically rigorous; therefore, I don’t need to present a mathematically rigorous case to contribute relevant considerations, so I will consider intuitive arguments relevant, and present multiple arguments rather than a single sequential argument (as I did with the more rigorous argument for many worlds).
Bayes/VNM point against Orthogonality
Some people may think that the free parameters in Bayes/VNM point towards the Orthogonality Thesis being true. I think, rather, that they point against Orthogonality. While they do function as arguments against the Diagonality Thesis, this is insufficient for Orthogonality.
First, on the relationship between intelligence and bounded rationality. It’s meaningless to talk about intelligence without a notion of bounded rationality. Perfect rationality in a complex environment is computationally intractable. With lower intelligence, bounded rationality is necessary. So, at non-extreme intelligence levels, the Orthogonality Thesis must be making a case that boundedly rational agents can have any computationally tractable goal.
Bayesianism and VNM expected utility optimization are known to be computationally intractable in complex environments. That is why algorithms like MCMC and reinforcement learning are used. So, making an argument for Orthogonality in terms of Bayesianism and VNM is simply dodging the question, by already assuming an extremely high intelligence level from the start.
As the Orthogonality Thesis refers to “values” or “final goals” (which I take to be synonymous), it must have a notion of the “values” of agents that are not extremely intelligent. These values cannot be assumed to be VNM, since VNM is not computationally tractable. Meanwhile, money-pumping arguments suggest that extremely intelligent agents will tend to converge to VNM-ish preferences. Thus:
Argument from Bayes/VNM: Agents with low intelligence will tend to have beliefs/values that are far from Bayesian/VNM. Agents with high intelligence will tend to have beliefs/values that are close to Bayesian/VNM. Strong Orthogonality is false because it is awkward to combine low intelligence with Bayesian/VNM beliefs/values, and awkward to combine high intelligence with far-from-Bayesian/VNM beliefs/values. Weak Orthogonality is in doubt, because having far-from-Bayesian/VNM beliefs/values puts a limit on the agent’s intelligence.
To summarize: un-intelligent agents cannot be assumed to be Bayesian/VNM from the start. Those arise at a limit of intelligence, and arguably have to arise due to money-pumping arguments. Beliefs/values therefore tend to become more Bayesian/VNM with high intelligence, contradicting Strong Orthogonality and perhaps Weak Orthogonality.
One could perhaps object that logical uncertainty allows even weak agents to be Bayesian over combined physical/mathematical uncertainty; I’ll address this consideration later.
Belief/value duality
It may be unclear why the Argument from Bayes/VNM refers to both beliefs and values, as the Orthogonality Thesis is only about values. It would, indeed, be hard to make the case that the Orthogonality Thesis is true as applied to beliefs. However, various arguments suggest that Bayesian beliefs and VNM preferences are “dual” such that complexity can be moved from one to the other.
Abram Demski has presented this general idea in the past, and I’ll give a simple example to illustrate.
Let A∈A be the agent’s action, and let W∈W represent the state of the world prior to / unaffected by the agent’s action Let r(A, W) be the outcome resulting from the action and world. Let P(w) be the primary agent’s probability a given world. Let U(o) be the primary agent’s utility for outcome o. The primary agent finds an action a to maximize ∑w∈WP(w)U(r(a,w)).
Now let e be an arbitrary predicate on worlds. Consider modifying P to increase the probability that e(W) is true. That is:
P′(w):∝P(w)(1+[e(w)])
P′(w)=P(w)(1+[e(w)])∑w∈WP(w)(1+[e(w)])
where [e(w)] equals 1 if e(w), otherwise 0. Now, can we define a modified utility function U’ so a secondary agent with beliefs P’ and utility function U’ will take the same action as the primary agent? Yes:
U′(o):=U(o)1+[e(w)]
This secondary agent will find an action a to maximize:
∑w∈WP′(w)U′(r(a,w))
=∑w∈WP(w)(1+[e(w)])∑w′∈WP(w′)(1+[e(w′)])U(r(a,w))1+[e(w)]
=1∑w∈WP(w)(1+[e(w)])∑w∈WP(w)U(r(a,w))
Clearly, this is a positive constant times the primary agent’s maximization target, so the secondary agent will take the same action.
This demonstrates a basic way that Bayesian beliefs and VNM utility are dual to each other. One could even model all agents as having the same utility function (of maximizing a random variable U) and simply having different beliefs about what U values are implied by the agent’s action and world state. Thus:
Argument from belief/value duality: From an agent’s behavior, multiple belief/value combinations are valid attributions. This is clearly true in the limiting Bayes/VNM case, suggesting it also applies in the case of bounded rationality. It is unlikely that the Strong Orthogonality Thesis applies to beliefs (including priors), so, due to the duality, it is also unlikely that it applies to values.
I consider this weaker than the Argument from Bayes/VNM. Someone might object that both values and a certain component of beliefs are orthogonal, while the other components of beliefs (those that change with more reasoning/intelligence) aren’t. But I think this depends on a certain factorizability of beliefs/values into the kind that change on reflection and those that don’t, and I’m skeptical of such factorizations. I think discussion of logical uncertainty will make my position on this clearer, though, so let’s move on.
Logical uncertainty as a model for bounded rationality
I’ve already argued that bounded rationality is essential to intelligence (and therefore the Orthogonality Thesis). Logical uncertainty is a form of bounded rationality (as applied to guessing the probabilities of mathematical statements). Therefore, discussing logical uncertainty is likely to be fruitful with respect to the Orthogonality Thesis.
Logical Induction is a logical uncertainty algorithm that produces a probability table for a finite subset of mathematical statements at each iteration. These beliefs are determined by a betting market of an increasing (up to infinity) number of programs that make bets, with the bets resolved by a “deductive process” that is basically a theorem prover. The algorithm is computable, though extremely computationally intractable, and has properties in the limit including some forms of Bayesian updating, statistical learning, and consistency over time.
We can see Logical Induction as evidence against the Diagonality Thesis: beliefs about undecidable statements (which exist in consistent theories due to Gödel’s first incompleteness theorem) can take on any probability in the limit, though satisfy properties such as consistency with other assigned probabilities (in a Bayesian-like manner).
However, (a) it is hard to know ahead of time which statements are actually undecidable, (b) even beliefs about undecidable statements tend to predictably change over time to Bayesian consistency with other beliefs about undecidable statements. So, Logical Induction does not straightforwardly factorize into a “belief-like” component (which converges on enough reflection) and a “value-like” component (which doesn’t change on reflection). Thus:
Argument from Logical Induction: Logical Induction is a current best-in-class model of theoretical asymptotic bounded rationality. Logical Induction is non-Diagonal, but also clearly non-Orthogonal, and doesn’t apparently factorize into separate Orthogonal and Diagonal components. Combined with considerations from “Argument from belief/value duality”, this suggests that it’s hard to identify all value-like components in advanced agents that are Orthogonal in the sense of not tending to change upon reflection.
One can imagine, for example, introducing extra function/predicate symbols into the logical theory the logical induction is over, to represent utility. Logical induction will tend to make judgments about these functions/predicates more consistent and inductively plausible over time, changing its judgments about the utilities of different outcomes towards plausible logical probabilities. This is an Oblique (non-Orthogonal and non-Diagonal) change in the interpretation of the utility symbol over time.
Likewise, Logical Induction can be specified to have beliefs over empirical facts such as observations by adding additional function/predicate symbols, and can perhaps update on these as they come in (although this might contradict UDT-type considerations). Through more iteration, Logical Inductors will come to have more approximately Bayesian, and inductively plausible, beliefs about these empirical facts, in an Oblique fashion.
Even if there is a way of factorizing out an Orthogonal value-like component from an agent, the belief-component (represented by something like Logical Induction) remains non-Diagonal, so there is still a potential “alignment problem” for these non-Diagonal components to match, say, human judgments in the limit. I don’t see evidence that these non-Diagonal components factor into a value-like “prior over the undecidable” that does not change upon reflection. So, there remain components of something analogous to a “final goal” (by belief/value duality) that are Oblique, and within the scope of alignment.
If it were possible to get the properties of Logical Induction in a Bayesian system, which makes Bayesian updates on logical facts over time, that would make it more plausible that an Orthogonal logical prior could be specified ahead of time. However, MIRI researchers have tried for a while to find Bayesian interpretations of Logical Induction, and failed, as would be expected from the Argument from Bayes/VNM.
Naive belief/value factorizations lead to optimization daemons
The AI alignment field has a long history of poking holes in alignment approaches. Oops, you tried making an oracle AI and it manipulated real-world outcomes to make its predictions true. Oops, you tried to do Solomonoff induction and got invaded by aliens. Oops, you tried getting agents to optimize over a virtual physical universe, and they discovered the real world and tried to break out. Oops, you ran a Logical Inductor and one of the traders manipulated the probabilities to instantiate itself in the real world.
These sub-processes that take over are known as optimization daemons. When you get the agent architecture wrong, sometimes a sub-process (that runs a massive search over programs, such as with Solomonoff Induction) will luck upon a better agent architecture and out-compete the original system. (See also a very strange post I wrote some years back while thinking about this issue, and Christiano’s comment relating it to Orthogonality).
If you apply a naive belief/value factorization to create an AI architecture, when compute is scaled up sufficiently, optimization daemons tend to break out, showing that this factorization was insufficient. Enough experiences like this lead to the conclusion that, if there is a realistic belief/value factorization at all, it will look pretty different from the naive one. Thus:
Argument from optimization daemons: Naive ways of factorizing an agent into beliefs/values tend to lead to optimization daemons, which have different values from in the original factorization. Any successful belief/value factorization will probably look pretty different from the naive one, and might not take the form of factorization into Diagonal belief-like components and Orthogonal value-like components. Therefore, if any realistic formulation of Orthogonality exists, it will be hard to find and substantially different from naive notions of Orthogonality.
Intelligence changes the ontology values are expressed in
The most straightforward way to specify a utility function is to specify an ontology (a theory of what exists, similar to a database schema) and then provide a utility function over elements of this ontology. Prior to humans learning about physics, evolution (taken as a design algorithm for organisms involving mutation and selection) did not know all that human physicists know. Therefore, human evolutionary values are unlikely to be expressed in the ontology of physics as physicists currently believe in.
Human evolutionary values probably care about things like eating enough, social acceptance, proxies for reproduction, etc. It is unknown how these are specified, but perhaps sensory signals (such as stomach signals) are connected with a developing world model over time. Humans can experience vertigo at learning physics, e.g. thinking that free will and morality are fake, leading to unclear applications of native values to a realistic physical ontology. Physics has known gaps (such as quantum/relativity correspondence, and dark energy/dark matter) that suggest further ontology shifts.
One response to this vertigo is to try to solve the ontology identification problem; find a way of translating states in the new ontology (such as physics) to an old one (such as any kind of native human ontology), in a structure-preserving way, such that a utility function over the new ontology can be constructed as a composition of the original utility function and the new-to-old ontological mapping. Current solutions, such as those discussed in MIRI’s Ontological Crises paper, are unsatisfying. Having looked at this problem for a while, I’m not convinced there is a satisfactory solution within the constraints presented. Thus:
Argument from ontological change: More intelligent agents tend to change their ontology to be more realistic. Utility functions are most naturally expressed relative to an ontology. Therefore, there is a correlation between an agent’s intelligence and utility function, through the agent’s ontology as an intermediate variable, contradicting Strong Orthogonality. There is no known solution for rescuing the old utility function in the new ontology, and some research intuitions pointing towards any solution being unsatisfactory in some way.
If a satisfactory solution is found, I’ll change my mind on this argument, of course, but I’m not convinced such a satisfactory solution exists. To summarize: higher intelligence causes ontological changes, and rescuing old values seems to involve unnatural “warps” to make the new ontology correspond with the old one, contradicting at least Strong Orthogonality, and possibly Weak Orthogonality (if some values are simply incompatible with realistic ontology). Paperclips, for example, tend to appear most relevant at an intermediate intelligence level (around human-level), and become more ontologically unnatural at higher intelligence levels.
As a more general point, one expects possible mutual information between mental architecture and values, because values that “re-use” parts of the mental architecture achieve lower description length. For example, if the mental architecture involves creating universal algebra structures and finding analogies between them and the world, then values expressed in terms of such universal algebras will tend to have lower relative description complexity to the architecture. Such mutual information contradicts Strong Orthogonality, as some intelligence/value combinations are more natural than others.
Intelligence leads to recognizing value-relevant symmetries
Consider a number of un-intutitive value propositions people have argued for:
Torture is preferable to Dust Specks, because it’s hard to come up with a utility function with the alternative preference without horrible unintuitive consequences elsewhere.
People are way too risk-averse in betting; the implied utility function has too strong diminishing marginal returns to be plausible.
You may think your personal identity is based on having the same atoms, but you’re wrong, because you’re distinguishing identical configurations.
You may think a perfect upload of you isn’t conscious (and basically another copy of you), but you’re wrong, because functionalist theory of mind is true.
You intuitively accept the premises of the Repugnant Conclusion, but not the Conclusion itself; you’re simply wrong about one of the premises, or the conclusion.
The point is not to argue for these, but to note that these arguments have been made and are relatively more accepted among people who have thought more about the relevant issues than people who haven’t. Thinking tends to lead to noticing more symmetries and dependencies between value-relevant objects, and tends to adjust values to be more mathematically plausible and natural. Of course, extrapolating this to superintelligence leads to further symmetries. Thus:
Argument from value-relevant symmetries: More intelligent agents tend to recognize more symmetries related to value-relevant entities. They will also tend to adjust their values according to symmetry considerations. This is an apparent value change, and it’s hard to see how it can instead be factored as a Bayesian update on top of a constant value function.
I’ll examine such factorizations in more detail shortly.
Human brains don’t seem to neatly factorize
This is less about the Orthogonality Thesis generally, and more about human values. If there were separable “belief components” and “value components” in the human brain, with the value components remaining constant over time, that would increase the chance that at least some Orthogonal component can be identified in human brains, corresponding with “human values” (though, remember, the belief-like component can also be Oblique rather than Diagonal).
However, human brains seem much more messy than the sort of computer program that could factorize this way. Different brain regions are connected in at least some ways that are not well-understood. Additionally, even apparent “value components” may be analogous to something like a deep Q-learning function, which incorporates empirical updates in addition to pre-set “values”.
The interaction between human brains and language is also relevant. Humans develop values they act on partly through language. And language (including language reporting values) is affected by empirical updates and reflection, thus non-Orthogonal. Reflecting on morality can easily change people’s expressed and acted-upon values, e.g. in the case of Peter Singer. People can change which values they report as instrumental or terminal even while behaving similarly (e.g. flipping between selfishness-as-terminal and altruism-as-terminal), with the ambiguity hard to resolve because most behavior relates to convergent instrumental goals.
Maybe language is more of an effect than cause of values. But there really seems to be feedback from language to non-linguistic brain functions that decide actions and so on. Attributing coherent values over realistic physics to the brain parts that are non-linguistic seems like a form of projection or anthropomorphism. Language and thought have a function in cognition and attaining coherent values over realistic ontologies. Thus:
Argument from brain messiness: Human brains don’t seem to neatly factorize into a belief-component and a value-component, with the value-component unaffected by reflection or language (which it would need to be Orthogonal). To the extent any value-component does not change due to language or reflection, it is restricted to evolutionary human ontology, which is unlikely to apply to realistic physics; language and reflection are part of the process that refines human values, rather than being an afterthought of them. Therefore, if the Orthogonality Thesis is true, humans lack identifiable values that fit into the values axis of the Orthogonality Thesis.
This doesn’t rule out that Orthogonality could apply to superintelligences, of course, but it does raise questions for the project of aligning superintelligences with human values; perhaps such values do not exist or are not formulated so as to apply to the actual universe.
Models of ASI should start with realism
Some may take arguments against Orthogonality to be disturbing at a value level, perhaps because they are attached to research projects such as Friendly AI (or more specific approaches), and think questioning foundational assumptions would make the objective (such as alignment with already-existing human values) less clear. I believe “hold off on proposing solutions” applies here: better strategies are likely to come from first understanding what is likely to happen absent a strategy, then afterwards looking for available degrees of freedom.
Quoting Yudkowsky:
Likewise, Obliqueness does not imply that we shouldn’t think about the future and ways of influencing it, that we should just give up on influencing the future because we’re doomed anyway, that moral realist philosophers are correct or that their moral theories are predictive of ASI, that ASIs are necessarily morally good, and so on. The Friendly AI research program was formulated based on descriptive statements believed at the time, such as that an ASI singleton would eventually emerge, that the Orthogonality Thesis is basically true, and so on. Whatever cognitive process formulated this program would have formulated a different program conditional on different beliefs about likely ASI trajectories. Thus:
Meta-argument from realism: Paths towards beneficially achieving human values (or analogues, if “human values” don’t exist) in the far future likely involve a lot of thinking about likely ASI trajectories absent intervention. The realistic paths towards human influence on the far future depend on realistic forecasting models for ASI, with Orthogonality/Diagonality/Obliqueness as alternative forecasts. Such forecasting models can be usefully thought about prior to formulation of a research program intended to influence the far future. Formulating and working from models of bounded rationality such as Logical Induction is likely to be more fruitful than assuming that bounded rationality will factorize into Orthogonal and Diagonal components without evidence in favor of this proposition. Forecasting also means paying more attention to the Strong Orthogonality Thesis than the Weak Orthogonality Thesis, as statistical correlations between intelligence and values will show up in such forecasts.
On Yudkowsky’s arguments
Now that I’ve explained my own position, addressing Yudkowsky’s main arguments may be useful. His main argument has to do with humans making paperclips instrumentally:
I believe it is better to think of the payment as coming in the far future and perhaps in another universe; that way, the belief about future payment is more analogous to terminal values than instrumental values. In this case, creating paperclips is a decent proxy for achievement of human value, so long-termist humans would tend to want lots of paperclips to be created.
I basically accept this, but, notably, Yudkowsky’s argument is based on belief/value duality. He thinks it would be awkward for the reader to imagine terminally wanting paperclips, so he instead asks them to imagine a strange set of beliefs leading to paperclip production being oddly correlated with human value achievement. Thus, acceptance of Yudkowsky’s premises here will tend to strengthen the Argument from belief/value duality and related arguments.
In particular, more intelligence would cause human-like agents to develop different beliefs about what actions aliens are likely to reward, and what numbers of paperclips different policies result in. This points towards Obliqueness as with Logical Induction: such beliefs will be revised (but not totally convergent) over time, leading to applying different strategies toward value achievement. And ontological issues around what counts as a paperclip will come up at some point, and likely be decided in a prior-dependent but also reflection-dependent way.
Beliefs about which aliens are most capable/honest likely depend on human priors, and are therefore Oblique: humans would want to program an aligned AI to mostly match these priors while revising beliefs along the way, but can’t easily factor out their prior for the AI to share.
Now onto other arguments. The “Size of mind design space” argument implies many agents exist with different values from humans, which agrees with Obliqueness (intelligent agents tend to have different values from unintelligent ones). It’s more of an argument about the possibility space than statistical correlation, thus being more about Weak than Strong Orthogonality.
The “Instrumental Convergence” argument doesn’t appear to be an argument for Orthogonality per se; rather, it’s a counter to arguments against Orthogonality based on noticing convergent instrumental goals. My arguments don’t take this form.
Likewise, “Reflective Stability” is about a particular convergent instrumental goal (preventing value modification). In an Oblique framing, a Logical Inductor will tend not to change its beliefs about even un-decidable propositions too often (as this would lead to money-pumps), so consistency is valued all else being equal.
While I could go into more detail responding to Yudkowsky, I think space is better spent presenting my own Oblique views for now.
Conclusion
As an alternative to the Orthogonality Thesis and the Diagonality Thesis, I present the Obliqueness Thesis, which says that increasing intelligence tends to lead to value changes but not total value convergence. I have presented arguments that advanced agents and humans do not neatly factor into Orthogonal value-like components and Diagonal belief-like components, using Logical Induction as a model of bounded rationality. This implies complications to theories of AI alignment based on assuming humans have values and we need the AGI to agree about those values, while increasing their intelligence (and thus changing beliefs).
At a methodological level, I believe it is productive to start by forecasting default ASI using models of bounded rationality, especially known models such as Logical Induction, and further developing such models. I think this is more productive than assuming that these models will take the form of a belief/value factorization, although I have some uncertainty about whether such a factorization will be found.
If the Obliqueness Thesis is accepted, what possibility space results? One could think of this as steering a boat in a current of varying strength. Clearly, ignoring the current and just steering where you want to go is unproductive, as is just going along with the current and not trying to steer at all. Getting to where one wants to go consists in largely going with the current (if it’s strong enough), charting a course that takes it into account.
Assuming Obliqueness, it’s not viable to have large impacts on the far future without accepting some value changes that come from higher intelligence (and better epistemology in general). The Friendly AI research program already accepts that paths towards influencing the far future involve “going with the flow” regarding superintelligence, ontology changes, and convergent instrumental goals; Obliqueness says such flows go further than just these, being hard to cleanly separate from values.
Obliqueness obviously leaves open the question of just how oblique. It’s hard to even formulate a quantitative question here. I’d very intuitively and roughly guess that intelligence and values are 3 degrees off (that is, almost diagonal), but it’s unclear what question I am even guessing the answer to. I’ll leave formulating and answering the question as an open problem.
I think Obliqueness is realistic, and that it’s useful to start with realism when thinking of how to influence the far future. Maybe superintelligence necessitates significant changes away from current human values; the Litany of Tarski applies. But this post is more about the technical thesis than emotional processing of it, so I’ll end here.