I think you won’t find a very good argument either way, because different ways of building AIs create different constraints on the possible motivations they could have, and we don’t know which methods are likely to succeed (or come first) at this point.
For example, uploads would be constrained to have motivations similar to existing humans (plus random drifts or corruptions of such). It seems impossible to create an upload who is motivated solely to fill the universe with paperclips. AIs created by genetic algorithms might be constrained to have certain motivations, which would probably differ from the set of possible motivations of AIs created by simulated biological evolution, etc.
The Orthogonality Thesis (or it’s denial) must assume that certain types of AI, e.g., those based on generic optimization algorithms that can accept a wide range of objective functions, are feasible (or not) to build, but I don’t think we can safely make such assumptions yet.
ETA: Just noticed Will Newsome’s comment, which makes similar points.
Wei Dai’s comment is full of wisdom. In particular:
The Orthogonality Thesis (or it’s denial) must assume that certain types of AI, e.g., those based on generic optimization algorithms that can accept a wide range of objective functions, are feasible (or not) to build, but I don’t think we can safely make such assumptions yet.
But even if that is true, it is nowhere near enough to support an OT that can be plugged into an unfriendliness argument. The Unfriendliness argument requires that it is reasonably likely that researchers could create a paperclipper without meaning to. However, if paperclippers require an architecture—a possible architecture, but only one possible architecture—where goals and their implementation are decoupled, then both requirements are undermined. It is not clear that we can build such machines (“based on generic optimization algorithms that can accept a wide range of objective functions”) , hence a lack of likelihood; and it is also not clear that well intentioned people would.
Unfriendliness of the sort that MIRI worries about could be sidestepped by not adopting the architecture that supports orthogonality, and choosing one of a number of alternatives.
Exactly. The first AI we can create, certainly can’t have ‘nearly any type of motivation’.
There are several classes of AIs we can create; the uploads start off human; the human embryonic development sim (or other brain emulation that isn’t upload) is basically a child that learns and becomes human; that is to some extent true of most learning AI approaches; the neat AI that starts stupid can not start off with the goals that require highly accurate world-model (like the paperclip maximization) or the goals that lead to AI damaging itself, or the goals that prevent AI self improvement, as the first AI we create reasonably doesn’t start at grown-up educated Descartes level intelligence and invents the notion of self, and figures out that it must preserve itself to achieve the goals (and then figures out that it must keep the goals above the instrumental self preservation).
On top of this, as I commented on some other thread (forgot where) with the Greenpeace By Default example, if you generate random code, the simplest-behaving code dominates the space of code that doesn’t crash. This goes for the goal systems.
The orthogonality thesis, even if in some narrow sense true (or broad sense, for that matter), is entirely irrelevant; for example absolute orthogonality thesis would be entirely compatible with the hypothetical where out of the random goal space for the seed AI, and excluding the AIs that self destruct or fail to self improve, only one in 10^1000 is mankind destroying to any extent (simply because one or two simplest goal systems end up mankind-preserving because they were too simple to preserve just the AI).
I think you won’t find a very good argument either way, because different ways of building AIs create different constraints on the possible motivations they could have, and we don’t know which methods are likely to succeed (or come first) at this point.
For example, uploads would be constrained to have motivations similar to existing humans (plus random drifts or corruptions of such). It seems impossible to create an upload who is motivated solely to fill the universe with paperclips. AIs created by genetic algorithms might be constrained to have certain motivations, which would probably differ from the set of possible motivations of AIs created by simulated biological evolution, etc.
The Orthogonality Thesis (or it’s denial) must assume that certain types of AI, e.g., those based on generic optimization algorithms that can accept a wide range of objective functions, are feasible (or not) to build, but I don’t think we can safely make such assumptions yet.
ETA: Just noticed Will Newsome’s comment, which makes similar points.
Wei Dai’s comment is full of wisdom. In particular:
But even if that is true, it is nowhere near enough to support an OT that can be plugged into an unfriendliness argument. The Unfriendliness argument requires that it is reasonably likely that researchers could create a paperclipper without meaning to. However, if paperclippers require an architecture—a possible architecture, but only one possible architecture—where goals and their implementation are decoupled, then both requirements are undermined. It is not clear that we can build such machines (“based on generic optimization algorithms that can accept a wide range of objective functions”) , hence a lack of likelihood; and it is also not clear that well intentioned people would.
Unfriendliness of the sort that MIRI worries about could be sidestepped by not adopting the architecture that supports orthogonality, and choosing one of a number of alternatives.
Exactly. The first AI we can create, certainly can’t have ‘nearly any type of motivation’.
There are several classes of AIs we can create; the uploads start off human; the human embryonic development sim (or other brain emulation that isn’t upload) is basically a child that learns and becomes human; that is to some extent true of most learning AI approaches; the neat AI that starts stupid can not start off with the goals that require highly accurate world-model (like the paperclip maximization) or the goals that lead to AI damaging itself, or the goals that prevent AI self improvement, as the first AI we create reasonably doesn’t start at grown-up educated Descartes level intelligence and invents the notion of self, and figures out that it must preserve itself to achieve the goals (and then figures out that it must keep the goals above the instrumental self preservation).
On top of this, as I commented on some other thread (forgot where) with the Greenpeace By Default example, if you generate random code, the simplest-behaving code dominates the space of code that doesn’t crash. This goes for the goal systems.
The orthogonality thesis, even if in some narrow sense true (or broad sense, for that matter), is entirely irrelevant; for example absolute orthogonality thesis would be entirely compatible with the hypothetical where out of the random goal space for the seed AI, and excluding the AIs that self destruct or fail to self improve, only one in 10^1000 is mankind destroying to any extent (simply because one or two simplest goal systems end up mankind-preserving because they were too simple to preserve just the AI).