@adamShimi’s comment already listed what I think is the most important point: that you’re already implicitly assuming an aligned AI that wants to want what humans would want to have told it to want if we knew how, and if we knew what we wanted it to want more precisely. You’re treating an AI’s goals as somehow separate from the code it executes. An AI’s goals aren’t what a human writes on a design document or verbally asks for, they’re what are written in its code and implicit in its wiring. This is the same for humans: our goals, in terms of what we will actually do, aren’t the instructions other humans give us, they’re implicit in the structure of our (self-rewiring) brains.
My layman’s understanding is that superintelligence + self modification can automatically grant you 1) increasing instrumental capabilities, and 2) the ability to rapidly close the gap between wanting and wanting to want. (I would add that I think self-modification within a single set of pieces of active hardware or software isn’t strictly necessary for this, only an AI that can create its own successor and then shut itself down).
Beyond that, this argument doesn’t hold. You point to human introspection as an example of what you think AGI should be automatically would be inclined to want, because the humans who made it want it to want those things, or would if they better understood the implications of their own object- and meta-level wants. Actually your claim is stronger than that, because it requires that all possible mind designs achieve this kind of goal convergence fast enough to get there before causing massive or unrecoverable harm to humans. Even within the space of human minds, for decisions and choices where our brains have the ability to easily self-modify to do this, this is a task at which humans very often fail, sometimes spectacularly, whether we’re aware of the gap or not, even for tasks well within our range of intellectual and emotional understanding.
From another angle: how smart does an AI need to be to self-modify or create an as-smart or smarter successor? Clearly, less smart than its human creators had to be to create it, or the process could never have gotten started. And yet, humans have been debating the same basic moral and political questions since at least the dawn of writing, including the same broad categories of plausible answers, without achieving convergence in what to want to want (which, again, is all that’s needed for an AI that can modify its goals structure to want whatever it wants to want). What I’m pointing to is that your argument in this post, I think, includes an implicit claim that logical necessity guarantees that humankind as we currently exist will achieve convergence on the objectively correct moral philosophy before we destroy ourselves. I… don’t think that is a plausible claim, given how many times we’ve come so close to doing so in the recent past, and how quickly we’re developing new and more powerful ways to potentially do so through the actions of smaller and smaller groups of people.
You might want to also look at my argument in the top-level comment here, which more directly engages with Bostrom’s arguments for the orthogonality hypothesis. In brief, Bostrom says that all intelligence levels are compatible with all goals. I think that this is false: some intelligence levels are incompatible with some goals. AI safety is still as much of a risk either way, since many intelligence levels are compatible with many problematic goals. However, I don’t think Bostrom argues successfully for the orthogonality thesis, and I tried in the OP to illustrate a level of intelligence that is not compatible with any goal.
I don’t think anyone believes that literally that “all intelligence levels are compatible with all goals”. For example, an intelligence that is too dumb to understand the concept of “algebraic geometry” cannot have a goal that can only be stated in terms of algebraic geometry. I’m pretty sure Bostrom put in a caveat along those lines...
I freely grant that this maximally strengthened version of the orthogonality thesis is false, even if only for the reasons @Steven Byrnes mentioned below. No entity can have a goal that requires more bits to specify than are used in the specification of the entity’s mind (though this implies a widening circle of goals with increasing intelligence, rather than convergence).
I think it might be worth taking a moment more to ask what you mean by the word “intelligence.” How does a mind become more intelligent? Bostrom proposed three main classes.
There is speed superintelligence, which you could mimic by replacing the neurons of a human brain with components that run millions of times faster but with the same initial connectome. It is at the very least non-obvious that a million-fold-faster thinking Hitler, Gandhi, Einstein, a-random-peasant-farmer-from-the-early-bronze-age, and a-random-hunter-gatherer-from-ice-age-Siberia would end up with compatible goal structures as a result of their boosted thinking.
There is collective superintelligence, where individually smart entities work together form a much smarter whole. At least so far in history, while the behavior of collectives is often hard to predict, their goals have generally been simpler than those of their constituent human minds. I don’t think that’s necessarily a prerequisite for nonhuman collectives, but something has to keep the component goals aligned with each other, well enough to ensure the system as a whole retains coherence. Presumably that somehow is a subset of the overall system—which seems to imply that a collective superintelligence’s goals must be comprehensible to and decided by a smaller collective, which by your argument would seem to be itself less constrained by the forces pushing superintelligences towards convergence. Maybe this implies a simplification of goals as the system gets smarter? But that competes against the system gradually improving each of its subsystems, and even if not it would be a simplification of the subsystems’ goals, and it is again unclear that one very specific goal type is something that every possible collective superintelligence would converge on.
Then there’s quality superintelligence, which he admits is a murky category, but which includes: larger working and total memory, better speed of internal communication, more total computational elements, lower computational error rate, better or more senses/sensors, and more efficient algorithms (for example, having multiple powerful ANI subsystems it can call upon). That’s a lot of possible degrees of freedom in system design. Even in the absence of the orthogonality thesis, it is at best very unclear that all superintelligences would tend towards the specific kind of goals you’re highlighting.
In that last sense, you’re making the kind of mistake EY was pointing to in this part of the quantum physics sequence, where you’ve ignored an overwhelming prior against a nice-sounding hypothesis based on essentially zero bits of data. I am very confident that MIRI and the FHI would be thrilled to find strong reasons to think alignment won’t be such a hard problem after all, should you or any of them ever find such reasons.
@adamShimi’s comment already listed what I think is the most important point: that you’re already implicitly assuming an aligned AI that wants to want what humans would want to have told it to want if we knew how, and if we knew what we wanted it to want more precisely. You’re treating an AI’s goals as somehow separate from the code it executes. An AI’s goals aren’t what a human writes on a design document or verbally asks for, they’re what are written in its code and implicit in its wiring. This is the same for humans: our goals, in terms of what we will actually do, aren’t the instructions other humans give us, they’re implicit in the structure of our (self-rewiring) brains.
You’re also making an extraordinarily broad, strong, and precise claim about the content of the set of all possible minds. A priori, any such claim has at least billions of orders of magnitude more ways to be false than true. That’s the prior.
My layman’s understanding is that superintelligence + self modification can automatically grant you 1) increasing instrumental capabilities, and 2) the ability to rapidly close the gap between wanting and wanting to want. (I would add that I think self-modification within a single set of pieces of active hardware or software isn’t strictly necessary for this, only an AI that can create its own successor and then shut itself down).
Beyond that, this argument doesn’t hold. You point to human introspection as an example of what you think AGI should be automatically would be inclined to want, because the humans who made it want it to want those things, or would if they better understood the implications of their own object- and meta-level wants. Actually your claim is stronger than that, because it requires that all possible mind designs achieve this kind of goal convergence fast enough to get there before causing massive or unrecoverable harm to humans. Even within the space of human minds, for decisions and choices where our brains have the ability to easily self-modify to do this, this is a task at which humans very often fail, sometimes spectacularly, whether we’re aware of the gap or not, even for tasks well within our range of intellectual and emotional understanding.
From another angle: how smart does an AI need to be to self-modify or create an as-smart or smarter successor? Clearly, less smart than its human creators had to be to create it, or the process could never have gotten started. And yet, humans have been debating the same basic moral and political questions since at least the dawn of writing, including the same broad categories of plausible answers, without achieving convergence in what to want to want (which, again, is all that’s needed for an AI that can modify its goals structure to want whatever it wants to want). What I’m pointing to is that your argument in this post, I think, includes an implicit claim that logical necessity guarantees that humankind as we currently exist will achieve convergence on the objectively correct moral philosophy before we destroy ourselves. I… don’t think that is a plausible claim, given how many times we’ve come so close to doing so in the recent past, and how quickly we’re developing new and more powerful ways to potentially do so through the actions of smaller and smaller groups of people.
You might want to also look at my argument in the top-level comment here, which more directly engages with Bostrom’s arguments for the orthogonality hypothesis. In brief, Bostrom says that all intelligence levels are compatible with all goals. I think that this is false: some intelligence levels are incompatible with some goals. AI safety is still as much of a risk either way, since many intelligence levels are compatible with many problematic goals. However, I don’t think Bostrom argues successfully for the orthogonality thesis, and I tried in the OP to illustrate a level of intelligence that is not compatible with any goal.
I don’t think anyone believes that literally that “all intelligence levels are compatible with all goals”. For example, an intelligence that is too dumb to understand the concept of “algebraic geometry” cannot have a goal that can only be stated in terms of algebraic geometry. I’m pretty sure Bostrom put in a caveat along those lines...
Note: even so, this objection would imply an in increasing range of possible goals as intelligence rises, not convergence.
I freely grant that this maximally strengthened version of the orthogonality thesis is false, even if only for the reasons @Steven Byrnes mentioned below. No entity can have a goal that requires more bits to specify than are used in the specification of the entity’s mind (though this implies a widening circle of goals with increasing intelligence, rather than convergence).
I think it might be worth taking a moment more to ask what you mean by the word “intelligence.” How does a mind become more intelligent? Bostrom proposed three main classes.
There is speed superintelligence, which you could mimic by replacing the neurons of a human brain with components that run millions of times faster but with the same initial connectome. It is at the very least non-obvious that a million-fold-faster thinking Hitler, Gandhi, Einstein, a-random-peasant-farmer-from-the-early-bronze-age, and a-random-hunter-gatherer-from-ice-age-Siberia would end up with compatible goal structures as a result of their boosted thinking.
There is collective superintelligence, where individually smart entities work together form a much smarter whole. At least so far in history, while the behavior of collectives is often hard to predict, their goals have generally been simpler than those of their constituent human minds. I don’t think that’s necessarily a prerequisite for nonhuman collectives, but something has to keep the component goals aligned with each other, well enough to ensure the system as a whole retains coherence. Presumably that somehow is a subset of the overall system—which seems to imply that a collective superintelligence’s goals must be comprehensible to and decided by a smaller collective, which by your argument would seem to be itself less constrained by the forces pushing superintelligences towards convergence. Maybe this implies a simplification of goals as the system gets smarter? But that competes against the system gradually improving each of its subsystems, and even if not it would be a simplification of the subsystems’ goals, and it is again unclear that one very specific goal type is something that every possible collective superintelligence would converge on.
Then there’s quality superintelligence, which he admits is a murky category, but which includes: larger working and total memory, better speed of internal communication, more total computational elements, lower computational error rate, better or more senses/sensors, and more efficient algorithms (for example, having multiple powerful ANI subsystems it can call upon). That’s a lot of possible degrees of freedom in system design. Even in the absence of the orthogonality thesis, it is at best very unclear that all superintelligences would tend towards the specific kind of goals you’re highlighting.
In that last sense, you’re making the kind of mistake EY was pointing to in this part of the quantum physics sequence, where you’ve ignored an overwhelming prior against a nice-sounding hypothesis based on essentially zero bits of data. I am very confident that MIRI and the FHI would be thrilled to find strong reasons to think alignment won’t be such a hard problem after all, should you or any of them ever find such reasons.