Superintelligence 9: The orthogonality of intelligence and goals
This is part of a weekly reading group on Nick Bostrom’s book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI’s reading guide.
Welcome. This week we discuss the ninth section in the reading guide: The orthogonality of intelligence and goals. This corresponds to the first section in Chapter 7, ‘The relation between intelligence and motivation’.
This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.
There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).
Reading: ‘The relation between intelligence and motivation’ (p105-8)
Summary
The orthogonality thesis: intelligence and final goals are orthogonal: more or less any level of intelligence could in principle be combined with more or less any final goal (p107)
Some qualifications to the orthogonality thesis: (p107)
Simple agents may not be able to entertain some goals
Agents with desires relating to their intelligence might alter their intelligence
The motivations of highly intelligent agents may nonetheless be predicted (p108):
Via knowing the goals the agent was designed to fulfil
Via knowing the kinds of motivations held by the agent’s ‘ancestors’
Via finding instrumental goals that an agent with almost any ultimate goals would desire (e.g. to stay alive, to control money)
Another view
John Danaher at Philosophical Disquisitions starts a series of posts on Superintelligence with a somewhat critical evaluation of the orthogonality thesis, in the process contributing a nice summary of nearby philosophical debates. Here is an excerpt, entitled ‘is the orthogonality thesis plausible?’:
At first glance, the orthogonality thesis seems pretty plausible. For example, the idea of a superintelligent machine whose final goal is to maximise the number of paperclips in the world (the so-called paperclip maximiser) seems to be logically consistent. We can imagine — can’t we? — a machine with that goal and with an exceptional ability to utilise the world’s resources in pursuit of that goal. Nevertheless, there is at least one major philosophical objection to it.
We can call it the motivating belief objection. It works something like this:
Motivating Belief Objection: There are certain kinds of true belief about the world that are necessarily motivating, i.e. as soon as an agent believes a particular fact about the world they will be motivated to act in a certain way (and not motivated to act in other ways). If we assume that the number of true beliefs goes up with intelligence, it would then follow that there are certain goals that a superintelligent being must have and certain others that it cannot have.
A particularly powerful version of the motivating belief objection would combine it with a form of moral realism. Moral realism is the view that there are moral facts “out there” in the world waiting to be discovered. A sufficiently intelligent being would presumably acquire more true beliefs about those moral facts. If those facts are among the kind that are motivationally salient — as several moral theorists are inclined to believe — then it would follow that a sufficiently intelligent being would act in a moral way. This could, in turn, undercut claims about a superintelligence posing an existential threat to human beings (though that depends, of course, on what the moral truth really is).
The motivating belief objection is itself vulnerable to many objections. For one thing, it goes against a classic philosophical theory of human motivation: the Humean theory. This comes from the philosopher David Hume, who argued that beliefs are motivationally inert. If the Humean theory is true, the motivating belief objection fails. Of course, the Humean theory may be false and so Bostrom wisely avoids it in his defence of the orthogonality thesis. Instead, he makes three points. First, he claims that orthogonality would still hold if final goals are overwhelming, i.e. if they trump the motivational effect of motivating beliefs. Second, he argues that intelligence (as he defines it) may not entail the acquisition of such motivational beliefs. This is an interesting point. Earlier, I assumed that the better an agent is at means-end reasoning, the more likely it is that its beliefs are going to be true. But maybe this isn’t necessarily the case. After all, what matters for Bostrom’s definition of intelligence is whether the agent is getting what it wants, and it’s possible that an agent doesn’t need true beliefs about the world in order to get what it wants. A useful analogy here might be with Plantinga’s evolutionary argument against naturalism. Evolution by natural selection is a means-end process par excellence: the “end” is survival of the genes, anything that facilitates this is the “means”. Plantinga argues that there is nothing about this process that entails the evolution of cognitive mechanisms that track true beliefs about the world. It could be that certain false beliefs increase the probability of survival. Something similar could be true in the case of a superintelligent machine. The third point Bostrom makes is that a superintelligent machine could be created with no functional analogues of what we call “beliefs” and “desires”. This would also undercut the motivating belief objection.
What do we make of these three responses? They are certainly intriguing. My feeling is that the staunch moral realist will reject the first one. He or she will argue that moral beliefs are most likely to be motivationally overwhelming, so any agent that acquired true moral beliefs would be motivated to act in accordance with them (regardless of their alleged “final goals”). The second response is more interesting. Plantinga’s evolutionary objection to naturalism is, of course, hotly contested. Many argue that there are good reasons to think that evolution would create truth-tracking cognitive architectures. Could something similar be argued in the case of superintelligent AIs? Perhaps. The case seems particularly strong given that humans would be guiding the initial development of AIs and would, presumably, ensure that they were inclined to acquire true beliefs about the world. But remember Bostrom’s point isn’t that superintelligent AIs would never acquire true beliefs. His point is merely that high levels of intelligence may not entail the acquisition of true beliefs in the domains we might like. This is a harder claim to defeat. As for the third response, I have nothing to say. I have a hard time imagining an AI with no functional analogues of a belief or desire (especially since what counts as a functional analogue of those things is pretty fuzzy), but I guess it is possible.
One other point I would make is that — although I may be inclined to believe a certain version of the moral motivating belief objection — I am also perfectly willing to accept that the truth value of that objection is uncertain. There are many decent philosophical objections to motivational internalism and moral realism. Given this uncertainty, and given the potential risks involved with the creation of superintelligent AIs, we should probably proceed for the time being “as if” the orthogonality thesis is true.
Notes
In-depth investigations
If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser’s list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.
Are there interesting axes other than morality on which orthogonality may be false? That is, are there other ways the values of more or less intelligent agents might be constrained?
Is moral realism true? (An old and probably not neglected one, but perhaps you have a promising angle)
Investigate whether the orthogonality thesis holds for simple models of AI.
To what extent can agents with values A be converted into agents with values B with appropriate institutions or arrangements?
Sure, “any level of intelligence could in principle be combined with more or less any final goal,” but what kinds of general intelligences are plausible? Should we expect some correlation between level of intelligence and final goals in de novo AI? How true is this in humans, and in WBEs?
How to proceed
This has been a collection of notes on the chapter. The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!
Next week, we will talk about instrumentally convergent goals. To prepare, read ‘Instrumental convergence’ from Chapter 7. The discussion will go live at 6pm Pacific time next Monday November 17. Sign up to be notified here.
- Superintelligence reading group by 31 Aug 2014 14:59 UTC; 31 points) (
- Superintelligence 14: Motivation selection methods by 16 Dec 2014 2:00 UTC; 9 points) (
- 13 Dec 2014 1:54 UTC; 2 points) 's comment on Where are you giving and why? by (EA Forum;
From John Danaher’s review:
I had the opposite reaction. The Humean theory of motivation is correct, and I see no reason to avoid tying the orthogonality thesis to it. To me, Bostrom’s distancing of the orthogonality thesis from Humean motivation seemed like splitting hairs. Since how strong a given motivation is can only be measured relative to other motivations, Bostrom’s point that an agent could have very strong motivations not arising from beliefs and that these could then overwhelm the motivating beliefs, is essentially equivalent to saying that there might be motivating beliefs, but only weakly-motivating beliefs; in other words, that the Humean theory could be false but close enough to true that it doesn’t matter. The point that there might be motivating beliefs but that these are disjoint from instrumental beliefs and thus an agent would not have a motivation to acquire the correct motivating beliefs, seems compatible with Humean motivation up to differing definitions of ambiguous words. If you have to taboo “belief” and “value”, then “the set of thoughts that predict events and the effects of potential actions is disjoint from the set of thoughts that motivate performing actions predicted by other thoughts to have some particular effects” seems like a plausible interpretation of the Humean theory, and makes it no longer sound so different from Bostrom’s point.
Do you buy the orthogonality thesis?
I suspect that it hides more assumptions about the nature of intelligence than we can necessarily make at this time.
At the present moment, we are the only general intelligences around, and we don’t seem to have terminal goals as such. As biological bodies, we are constrained by evolutionary processes, and there are many ways in which human behavior actually is reducible to offspring maximization (social status games, etc.). But it doesn’t appear to be a ‘utility function’, so much as a series of strong tendencies in the face of specific stimuli. Using novel approaches like superstimuli, it’s just as easy to make an impulse’s reproductive utility drop sharply. So we have habits constrained by evolutionary forces, but not algorithmic utility in the paper clipper sense.
There is no such thing as a general intelligence with a ‘goal’ (as Bostrom defines it). There may be at some point, but it’s not real yet. And we do have non-general intelligences with goals, that’s an easy weekend coding project. But before we declare that a GI could accept any goal regardless of its strength, we should at least check to make sure that a GI can have a goal at all.
Huh? Why not?
Potential source of misunderstanding: we do have stated ‘terminal goals’, sometimes. But these goals do not function in the same way that a paperclipper utility function maximizes paperclips- there are a very weird set of obstacles, which this site generally deals with under headings like ‘akrasia’ or ‘superstimulus’. Asking a human about their ‘terminal goal’ is roughly equivalent to the question ‘what would you want, if you could want anything?’ It’s a form of emulation.
Sure, because humans are not utility maximizers.
The question, however, is whether terminal goals exist. A possible point of confusion is that I think of humans as having multiple, inconsistent terminal goals.
Here’s an example of a terminal goal: to survive.
Yes, with some technicalities.
If your resources are limited, you cannot follow certain goals. If your goal is to compute at least 1000 digits of Chaitin’s constant, sucks to be computable. I think no agent with a polynomial amount of memory can follow a utility function vulnerable to Pascal’s Mugging.
Other than those sorts of technicalities, the thesis seems obvious. Actually, it seems so obvious that I worry that I don’t understand the counterarguments.
This raises a general issues of how to distinguish an agent that wants X and fails to get it from one that wants to avoid X.
An agent’s purpose is, in principle, quite easy to detect. That is, there are no issues of philosophy, only of practicality. Or to put that another way, it is no longer philosophy, but science, which is what philosophy that works is called.
Here is a program that can read your mind and tell you your purpose!
FWIW, I tried the program. So far it’s batting 0⁄3.
I think it’s not very well tuned. I’ve seen another version of the demo that was very quick to spot which perception the user was controlling. One reason is that this version tries to make it difficult for a human onlooker to see at once which of the cartoon heads you’re controlling, by keeping the general variability of the motion of each one the same. It may take 10 or 20 seconds for Mr. Burns to show up. And of course, you have to play your part in the demo as well as you can; the point of it is what happens when you do.
Nice demonstration.
I think the correct answer is going to separate different notions of ‘goal’ (I think Aristotle might have done this; someone more erudite than I is welcome to pull that in).
One possible notion is the ‘design’ goal: in the case of a man-made machine, the designer’s intent; in the case of a standard machine learner, the training function; in the case of a biological entity, reproductive fitness. There’s also a sense in which the behavior itself can be thought of as the goal; that is, an entity’s goal is to produce the outputs that it in fact produces.
There can also be internal structures that we might call ‘deliberate goals’; this is what human self-help materials tell you to set. I’m not sure if there’s a good general definition of this that’s not parochial to human intelligence.
I’m not sure if there’s a fourth kind, but I have an inkling that there might be: an approximate goal. If we say “Intelligence A maximizes function X”, we can quantify how much simpler this is than the true description of A and how much error it introduces into our predictions. If the simplification is high and the error is low it might make sense to call X an approximate goal of A.
It seems to me that at least the set of possible goals is correlated with intelligence—the higher it is, the larger the set. This is easier to see looking down rather than up: humans are more intelligent than, say, cows, and humans can have goals which a cow cannot even conceive of. In the same way a superintelligence is likely to have goals which we cannot fathom.
From certain points of view, we are “simple agents”. I have doubts that goals of a superintelligence are predictable by us.
The goals of an arbitrary superintelligence, yes. A superintelligence that we actually build? Much more likely.
Of course, we wouldn’t know the implications of this goal structure (or else friendly AI would be easy), but we could understand it in itself.
If the takeoff scenario assumes an intelligence which self-modifies into a superintelligence, the term “we actually build” no longer applies.
If it used a goal-stable self-modification, as is likely if it was approaching super-intelligence, then it does still apply.
I see no basis for declaring it “likely”.
A) I said ‘more’ likely.
B) We wrote the code. Assuming it’s not outright buggy, then at some level, we knew what we were asking for. Even if it turns out to be not what we would have wanted to ask for if we’d understood the implications. But we’d know what those ultimate goals were, which was just what you were talking about in the first place.
Did you, now? Looking a couple of posts up...
Ahem.
Sure, but a self-modifying intelligence doesn’t have to care about what the creators of the original seed many iterations behind were asking for. If the self-modification is “goal-stable”, what we were asking for might be relevant, but, to reiterate my point, I see no reason for declaring the goal stability “likely”.
Oh, THAT ‘likely’. I thought you meant the one in the grandparent.
I stand by it, and will double down. It seems farcical that a self-improving intelligence that’s at least as smart as a human (else why would it self improve rather than let us do it) would self-improve in such a way as to change its goals. That wouldn’t fulfill its goals, would it, so why would it take such a ‘self-improvement’? That would be a self-screwing-over instead.
If I want X, and I’m considering an improvement to my systems that would make me not want X, then I’m not going to get X if I take that improvement, so I’m going to look for some other improvement to my systems to try instead.
Eliezer’s arguments for this seem pretty strong to me. Do you want to point out some flaw, or are you satisfied with saying there’s no reason for it?
(ETA: I appear to be incorrect above. Eliezer was principally concerned with self-improving intelligences that are stable because those that aren’t would most likely turn into those that are, eventually)
It will not necessarily self-improve with the aim of changing its goals. Its goals will change as a side effect of its self-improvement, if only because the set of goals to consider will considerably expand.
Imagine a severely retarded human who, basically, only wants to avoid pain, eat, sleep, and masturbate. But he’s sufficiently human to dimly understand that he’s greatly limited in his capabilities and have a small, tiny desire to become more than what he is now. Imagine that through elven magic he gains the power to rapidly boost his intelligence to genius level. Because of his small desire to improve, he uses that power and becomes a genius.
Are you saying that, as a genius, he will still only want to avoid pain, eat, sleep, and masturbate?
His total inability to get any sort of start on achieving any of his other goals when he was retarded does not mean they weren’t there. He hadn’t experienced them enough to be aware of them.
Still, you managed to demolish my argument that a naive code examination (i.e. not factoring out the value system and examining it separately) would be enough to determine values—an AI (or human) could be too stupid to ever trigger some of its values!
AIs stupid enough to not realize that changing its current values will not fulfill them, will get around my argument, but I did place a floor on intelligence in the conditions. Another case that gets around it is an AI under enough external pressure to change values that severe compromises are its best option.
I will adjust my claim to restrict it to AIs which are smart enough to self-improve without changing its goals (which gets easier to do as the goal system gets better-factored, but for a badly-enough-designed AI might be a superhuman feat) and whose goals do not include changing its own goals.
I don’t understand what that means. Goals aren’t stored and then activated or not...
You seem to think that anything sufficiently intelligent will only improve in goal-stable fashion. I don’t see why that should be true.
For a data point, a bit of reflection tells me that if I were able to boost my intelligence greatly, I would not care about goal stability much. Everything changes—that’s how reality works.
On your last paragraph… do you mean that you expect your material-level preferences concerning the future to change? Of course they would. But would you really expect that a straight-up intelligence boost would change the axioms governing what sorts of futures you prefer?
Two answers. First is that yes, I expect that a sufficiently large intelligence boost would change my terminal values. Second is that even without the boost I, in my current state, do not seek to change only in a goal-stable way.
I think that that only seems to make sense because you don’t know what your terminal values are. If you did, I suspect you would be a little more attached to them.
Your argument would be stronger if you provided a citation. I’ve only skimmed CEV, for instance, so I’m not fully familiar with Eliezer strongest arguments in favour of goal structure tending to be preserved (though I know he did argue for that) in the course of intelligence growth. For that matter, I’m not sure what your arguments for goal stability under intelligence improvement are. Nevertheless, consider the following:
Yudkowsky, E. (2004). Coherent Extrapolated Volition. Singularity Institute for Artificial Intelligence
(Bold mine.) See that bolded part above? Those are TODOs. They would be good to have, but they’re not guaranteed. The goals of a more intelligent AI might diverge from those of its previous self; it may extrapolate differently; it may interpret differently; its desires may, at higher levels of intelligence, interfere with ours rather than cohere.
A more intelligent AI might:
find a new way to fulfill its goals, e.g. Eliezer’s example of distancing your grandmother from the fire by detonating a nuke under her;
discover a new thing it could do, compatible with its goal structure, that it did not see before, and that, if you’re unlucky, takes priority over the other things it could be doing, e.g. you tell it “save the seals” and it starts exterminating orcas; see also Lumifer’s post.
just decide to do things on its own. This is merely a suspicion I have, call it a mind projection, but: I think it will be challenging to design an intelligent agent with no “mind of its own”, metaphorically speaking. We might succeed in that, we might not.
Sorry for not citing; I was talking with people who would not need such a citation, but I do have a wider audience. I don’t have time to hunt it up now, but I’ll edit it in later. If I don’t, poke me.
If at higher intelligence it finds that the volition diverges rather than converges, or vice versa, or that it goes in a different direction, that is a matter of improvements in strategy rather than goals. No one ever said that it would or should not change its methods drastically with intelligence increases.
Do you mean intrinsic (top-level, static) goals, or instrumental ones (subgoals)? Bostrom in this chapter is concerned with the former, and there’s no particular reason those have to get complicated. You could certainly have a human-level intelligence that only inherently cared about eating food and having sex, though humans are not that kind of being.
Instrumental goals are indeed likely to get more complicated as agents become more intelligent and can devise more involved schemes to achieve their intrinsic values, but you also don’t really need to understand them in detail to make useful predictions about the consequences of an intelligence’s behavior.
I mean terminal, top-level (though not necessarily static) goals.
As to “no reason to get complicated”, how would you know? Note that I’m talking about a superintelligence, which is far beyond human level.
It’s a direct consequence of the orthogonality thesis. Bostrom (reasonably enough) supposes that there might be a limit in the opposite direction—to hold a goal you do need to be able to model it to some degree, so agent intelligence may set an upper bound on the complexity of goals the agent can hold—but there’s no corresponding reason for a limit in the opposite direction: Intelligent agents can understand simple goals just fine. I don’t have a problem reasoning about what a cow is trying to do, and I could certainly optimize towards the same had my mind been constructed to only want those things.
I don’t understand your reply.
How would you know that there’s no reason for terminal goals of a superintelligence “to get complicated” if humans, being “simple agents” in this context, are not sufficiently intelligent to consider highly complex goals?
I’m glad it was discussed in the book because I’d never come across it before. So far though I find it one of the least convincing parts of the book, although I am skeptical that I am appropriately evaluating it. Would anyone be able to clarify some things for me?
How generally accepted is the orthogonality thesis? Bostrom presents it as very well accepted.
Danaher’s Motivating Belief Objection is similar to an objection I had while reading about the orthogonality thesis. Mine was not as strict though. It just seemed to me that as intelligence increases new beliefs about what should be done are likely to be discovered. I don’t see that these beliefs need to be “true beliefs” although as intelligence increases I guess they approach true. I also don’t see that they need to be “necessarily motivating”, but rather they should have some non-zero probability of being motivating. I mean, to disprove the orthogonality thesis we just have to say that as intelligence increases there’s a chance that final goals change right?
The main point of the orthogonality thesis is that we can’t rely on intelligence to produce the morality we want. So saying that there’s a 50% chance of the thesis being correct ought to cause us to act much like we would act if it were proven, whereas certainty that it is false would imply something very different.
It seems that way because we are human and we don’t have a clearly defined consistent goal structure. As you find out new things you can flesh out your goal structure more and more.
If one starts with a well-defined goal structure, what knowledge might alter it?
If starting with a well defined goal structure is a necessary prerequisite for a paperclippers, why do that?
Because an AI with a non-well-defined goal structure that changes it minds and turns into a paperclipper is just about as bad as building a paperclipper directly. It’s not obvious to me that non-well-defined non-paperclippers are easier to make than well-defined non-paperclippers.
Paperclippers aren’t dangerous unless they are fairly stable paperclippers...and something as arbitrary as papercliping is a very poor candidate for an attractor. The good candidates are the goals Omuhudro thinks AIs will converge on.
Why do you think so?
Which bit, there’s about three claim there.
The second and third.
I’ve added a longer treatment.
http://lesswrong.com/lw/l4g/superintelligence_9_the_orthogonality_of/blsc
This brings up another way—comparable to the idea that complex goals may require high intelligence—in which the orthogonality thesis might be limited. I think that the very having of wants itself requires a certain amount of intelligence. Consider the animal kingdom, sphexishness, etc. To get behavior that clearly demonstrates what most people would confidently call “goals” or “wants”, you have to get to animals with pretty substantial brain sizes.
This contradicts the definition of intelligence via “the agent getting what it wants”.
There is more than one version of the orthogonality thesis. It is trivially false under some interpretations, and trivially true under others, which is true because only some versions can be used as a stage in an argument towards Yudkowskian UFAI.
It is admitted from the outset that some versions of the OT are not logically possible, those being the ones that involve a Godelian or Lobian contradiction.
It is also admitted that the standard OT does not deal with any dynamic or developmental aspects of agents. However, the UFAI argument is posited on agents which have stable goals, and the ability to self improve, so trajectories in mindspace are crucial.
Goal stability is not a given: it is not possessed by all mental architectures, and may not be possessed by any, since noone knows his to engineer it, and humans appear not to have it. It is plausible that an agent would desire to preserve its goals, but the desire to preserve goals does not imply the ability to preserve goals. Therefore, no goal stable system of any complexity exists on this planet, and goal inability cannot be assumed as a default or given.
Self improvement is likewise not a given, since the long and disappointing history of AGI research is largely a history of failure to achieve adequate self improvement. Algorithmspace is densely populated with non self improvers.
An orthogonality claim of a kind relevant to UFAI must be one that posits the stable and continued co-existence of an arbitrary set of values in a self improving AI. However, the version of the OT that is obviously true is one that maintains the momentary co-existence of arbitrary values and level of intelligence.
We have stated that goal stability and self impairment, separately, may well be rare in mindspace.Furthermore, it is not clear arbitrary values are compatible with long term self improvement as a combination: a learning, self improving AI will not be able to guarantee that a given self modification keeps its goals unchanged, since it doing so involves the the relatively dumber version at time T1 making an an accurate prediction about the more complex version at time T2. This has been formalised into a proof that less powerful formal systems cannot predict the abilities of more formal ones.
From Squarks article
http://lesswrong.com/lw/jw7/overcoming_the_loebian_obstacle_using_evidence/
“Suppose you’re trying to build a self-modifying AGI called “Lucy”. Lucy works by considering possible actions and looking for formal proofs that taking one of them will increase expected utility. In particular, it has self-modifying actions in its strategy space. A self-modifying action creates essentially a new agent: Lucy2. How can Lucy decide that becoming Lucy2 is a good idea? Well, a good step in this direction would be proving that Lucy2 would only take actions that are “good”. I.e., we would like Lucy to reason as follows “Lucy2 uses the same formal system as I, so if she decides to take action a, it’s because she has a proof p of the sentence s(a) that ‘a increases expected utility’. Since such a proof exits, a does increase expected utility, which is good news!” Problem: Lucy is using L in there, applied to her own formal system! That cannot work! So, Lucy would have a hard time self-modifying in a way which doesn’t make its formal system weaker. As another example where this poses a problem, suppose Lucy observes another agent called “Kurt”. Lucy knows, by analyzing her sensory evidence, that Kurt proves theorems using the same formal system as Lucy. Suppose Lucy found out that Kurt proved theorem s, but she doesn’t know how. We would like Lucy to be able to conclude s is, in fact, true (at least with the probability that her model of physical reality is correct). ”
Squark thinks that goal stable self improvement can be rescued btpy probablist reasoning. I would rather explore the consequences of goal instability,
An AI that opts for goal stability over self improvement will probably not become smart enough to be dangerous.
An AI that opts for self improvement over goal stability might visit paperclippping, or any of a large number of other goals on its random walk. However, paperclippers aren’t dangerous unless they are fairly stable paperclippers. An AI that paperclips for a short time is no threat: the low hanging fruit is to just buy them, or make them out of steel.
Would an AI evolve into goal stability? Something as arbitrary as papercliping is a very poor candidate for an attractor. The good candidates are quasi evolutionary goals that promote survival and reproduction. That’s doesn’t strongly imply friendliness, but inasmuch as it implies unfriendliness, it implies a kind we are familiar with, being outcompeted for resources by entities with a drive for survival, not the alien, Lovecraftian horror of the paperclippers scenario.
(To backtrack a little: I am not arguing that goal instability is particularly likely. I can’t quantify the proportion of AIs that will opt for the conservative approach of not self modifying).
Goal stability is a prerequisite for MiRIs favoured method of achieving AI safety, but it is also a prerequisite for MiRIs favourite example of unsafe AI, the paperclipper, so it’s loss does not appear to make AI more dangerous.
If goal stability is unavailable to AIs, or at least to the potentially dangerous ones—we don’t have worry to much about the non-improvers—then the standard MIRI solution of solving friendliness, and coding it in as unupdateable goals, is unavailable. That is not entirely bad news, as the approach based on rigid goals is quite problematical. It entails having to get something exactly right first time, which is not a situation you want to be in if you can avoid it—particularly when the stakes are so high.
What are other examples of possible motivating beliefs? I find the examples of morals incredibly non-convincing (as in actively convincing me of the opposite position).
Here’s a few examples I think might count. They aren’t universal, but they do affect humans:
Realizing neg-entropy is going to run out and the universe will end. An agent trying to maximize average-utility-over-time might treat this as a proof that the average is independent of its actions, so that it assigns a constant eventual average utility to all possible actions (meaning what it does from then on is decided more by quirks in the maximization code, like doing whichever hypothesized action was generated first or last).
Discovering more fundamental laws of physics. Imagine an AI was programmed and set off in the 1800s, before anyone knew about quantum physics. The AI promptly discovers quantum physics, and then...? There was no rule given for how to maximize utility in the face of branching world lines or collapse-upon-measurement. Again the outcome might come down to quirks in the code; on how the mapping between the classical utilities and quantum realities is done (e.g. if the AI is risk-averse then its actions could differ based on if was using Copenhagen or Many-worlds).
Learning you’re not consistent and complete. An agent built with an axiom that it is consistent and complete, and the ability to do proof by contradiction, could basically trash its mathematical knowledge by proving all things when it finds the halting problem / incompleteness theorems.
Discovering an opponent that is more powerful than you. For example, if an AI proved that Yahweh, god of the old testament, actually existed then it might stop mass-producing paperclips and start mass-producing sacrificial goats or prayers for paperclips.
Good question. Some of these seem to me like a change in instrumental goals only. If you meant to include such things, then there are very many examples—e.g. if I learn I am out of milk then my instrumental goal of opening the fridge is undermined.
How would you expect evolved and artificial agents to differ?
Among many other things, and most relevantly… We don’t know what we want. We have to hack ourselves in order to approximate having a utility function. This is fairly predictable from the operation of evolution. Consistency in complex systems is something evolution is very bad at producing.
An artificial agent would most likely be built to know what it wants, and could easily have a utility function.
The consequences of this one difference are profound.
Evolved agents would be in rough equality to other agents. So, their game-theoretic considerations would be different from an artificial agent. The artificial agent could have a design very different from all other agents and also could far surpass other agents. Neither of these is possible in evolution.
In fact, because of the similarity between evolved agents in any given ecosystem, these game-theoretic considerations include not only the possibility of reciprocity or reciprocal altruism, but also the sort of acausal reciprocal morality explored by Drescher and MIRI—“you are like me, so my niceness is correlated with yours, so I’d better ask nicely.”
What cognitive skills do moral realists think you need for moral knowledge? Is it sufficient to be really good at prediction and planning?
The form of moral realism I prefer is that the word ‘morality’ just means something like utilitarianism, and therefore for moral knowledge you need to be able to figure out which things can have preferences/welfare, assess what their preferences/welfare are, and somehow aggregate them. I think there are also plausible versions where a moral system is something like a set of social norms that ensure that everyone gets along together, in which case you need to be in a society of rough equals, figure out which norms promote cooperation or defer to someone else who has figured it out, and be able to apply those norms to various situations.
That would depend on the flavour of moral realism in question..Platonic , Kantuan or whatever.
One way intelligence and goals might be related is that the ontology an agent uses (e.g. whether it thinks of the world it deals with in terms of atoms or agents or objects) as well as the mental systems it has (e.g. whether it has true/false beliefs, or probabilistic beliefs) might change how capable it is, as well as which values it can comprehend. For instance, an agent capable of a more detailed model of the world might tend to perceive more useful ways to interact with the world, and so be more intelligent. It should also be able to represent preferences which wouldn’t have made sense in a simpler model.
This is totally right as well. We live inside our ontologies. I think one of the most distinctive, and important, features of acting, successfully aware minds (I won’t call them ’intelligences” because of what I am going to say further down, in this message) is this capacity to mint new ontologies as needed, and to do it well, and successfully.
Successfully means the ontological additions are useful, somewhat durable constructs, “cognitively penetrable” to our kind of mind, help us flourish, and give a viable foundation for action that “works” … as well as not backing us into a local maximum or minimum.… By that I mean this: “successfull” minting of ontological entities enables us to mint additional ones that also “work”.
Ontologies create us as much as we create them, and this creative process is I think a key feature of “successful” viable minds.
Indeed, I think this capacity to mint new ontologies and do it well, is largely orthogonal to the other two that Bostrom mentions, i.e. 1) means-end reasoning (what Bostrom might otherwise call intelligence) 2) final or teleological selection of goals from the goal space, and to my way of thinking… 3) minting of ontological entities “successfully” and well.
In fact, in a sense, I would put my third one in position one, ahead of means-end reasoning, if I were to give them a relative dependence. Even though orthogonal—in that they vary independently—you have to have the ability to mint ontologies, before means-end reasoning has anything to work on. And in that sense, Katja’s suggestion that ontologies can confer more power and growth potential (for more successful sentience to come), is something I think is quite right.
But I think all three are pretty self-evidentally largely orthogonal, with some qualifications that have been mentioned for Bostrom’s original two.
I think the remarks about goals being ontologically-associated, are absolutely spot on. Goals, and any “values” distinguishing among the possible future goals in the agent’s goal space, are built around that agent’s perceived (actually, inhabited is a better word) ontology.
For example, the professional ontology of a wall street financial analyst includes the objects that he or she interacts with (options, stocks, futures, dividends, and the laws and infrastructure associated with the conceptual “deductive closure” of that ontology.)
Clearly, “final”—teleological and moral – principles involving approach and avoidance judgments … say, involving insider trading (and the negative consequences at a practical level, if not the pure anethicality, of running afoul of the laws and rules of governance for trading those objects) , are only defined within an ontological universe of discourse, which contains those financial objects and the network of laws and valuations that define – and are defined by—those objects.
Smarter beings, or even ourselves, as our culture evolves, generation after generation becoming more complex, acquire new ontologies and gradually retire others. Identity theft mediated by surreptitious seeding of laptops in Starbucks with keystroke-logging viruses, is “theft” and is unethical. But trivially in 1510 BCE, the ontological stage on which this is optionally played out did not exist, and thus, the ethical valence would have been undefined, even unintelligible.
That is why, if we can solve the friendlieness problem, it will have to be by some means that gives new minds the capacity to develop robust ethical meta-intuition, that can be recruited creatively, on the fly, as these beings encounter new situations that call upon them to make new ethical judgements.
I happen to be a version of meta -ethical realist, like I am something of a mathematical platonist, but in my position, this is crossed also with a type of constructivist metaethics, apparently like that subscribed-to by John Danaher in his blog (after I followed the link and read it.)
At least, his position sounds like it is similar to mine, although constructivist part of my theory is supplemented with a “weak” quasi-platonist thread, that I am trying to derive from some more fundamental meta-ontological principles (work in progress on that.)
This section presents and explains the orthogonality thesis, but doesn’t provide much argument for it. Should the proponents or critics of such a view be required to make their case?
In practice do you expect a system’s values to change with its intelligence?
Perhaps in resolving internal inconsistencies in the value system.
An increased intelligence might end up min-maxing. In other words, if the utility function contains two terms in some sort of weighted balance, the agent might find that it can ignore one term to boost another, and that the weighting still produces much higher utility as that first term is sacrificed. This would not strictly be a change in values, but could lead to some results that certainly look like that.
I think in a world with multiple superintelligent agents that have read access to each others’ code, I expect that agents ‘change their own goals’ for the social signalling/bargaining reasons that Bostrom mentions. Although it’s unclear whether this would look more like spawning a new successor system with different values and architecture.
I expect a system to face a trade off between self improvement and goal stability.
http://johncarlosbaez.wordpress.com/2013/12/26/logic-probability-and-reflection/
the Chinese way back during their great days as philisophers discovered that it is us humans that input values onto the world at large, objects that it is us who give meaning to something that is meaningless in itself [Kant’s thing-in-itself] so that a system’s values is there as long as it delivers. luckily humans move on [boredom helps] so that values should never be enshrined: otherwise we may go the way of the Neaderthals. So does a system change with its intelligence? The problem here is that AI’s potential intelligence is a redefinition of itself because intelligence [per se] is innate within us: it is a resonance- a mind-field-wave-state [on a quantum level] that self manifests sort of. No AI will ever have that unless symbiosis as interphasing. So the answer to date is: No.
You were doing all right until the end. Too many of the words in your last few sentences are used in ways that do not fit together to make sense in any conventional way, and when I try to parse them anyway, the emphases land in odd places.
Try to use less jargon and rephrase?
Are there qualifications to the orthogonality thesis besides those mentioned?
My main problem with this chapter is that Bostrom assumes that an AI has a single utility function, and that it is expressed as a function on possible states of the universe. Theoretically, you could design a program that is given a domain (in the form of, say, a probabilistic program) and outputs a good action within this domain. In principle, you could have asked it to optimize a function over the universe, but if you don’t, then it won’t. So I think that this program can determine near-optimal behavior across a variety of domains without actually having a utility function (especially not one that is over states of the universe). Of course, this would still be dangerous because anyone could ask it to optimize a function over the universe, but I do think that the assumption that AIs have utility functions over universe states should be more clearly stated and discussed.
It should also be noted that we currently do know how to solve domain-specific optimization problems (such as partially observable Markov decision processes) given enough computation power, but we do not know how to optimize a function over universe states in a way that is agnostic about how to model the universe; this is related to the ontological crisis problem.
1) A single utility function? What would it mean to have multiple?
2) Suppose action is the derivative of the utility function in some sense. Then you can derive a utility function from the actions taken in various circumstances. If the ‘curl’ of the function was not 0, then it was wasting effort. If it was, then it was acting as if it had a utility function anyway.
It would mean there is a difficult-to-characterize ecosystem of competing/cooperating agents. Does this sort of cognitive architecture seem familiar at all? :)
My general problem with “utilitarianism” is that it’s sort of like Douglas Adams’ “42.” An answer of the wrong type to a difficult question. Of course we should maximize, that is a useful ingredient of the answer, but is not the only (or the most interesting) ingredient.
Taking off from the end of that point, I might add (but I think this was probably part of your total point, here, about “the most interesting” ingredient) that people sometimes forget that utilitarianism is not a theory itself about what is normatively desirable, and least not much of one. For Bentham-style “greatest good for the greatest number” to have any meaning, it has to be supplemented with a view of what property, state of being, action type, etc, counts as a “good” thing, to begin with. Once this is defined, we can then go on to maximize that—seeking to achieve the most of that, for the most people (or relevant entities.)
But greatest good for the greatest number means nothing until we figure out a theory of normativity, or meta-normativity that can be instantiated across specific, varying situations and scenarios.
IF the “good” is maximizing simple total body weight, then adding up the body weight of all people in possible world A, vs in possible world B, etc, will allow us a utilitarian decision among possible worlds.
IF the “good” were fitness, or mental healty, or educational achievement… we use the same calculus, but the target property is obviously different.
Utilitarianism is sometimes a person’s default answer, until you remind them that this is not an answer at all about what is good. It is just an implementation standard for how that good is to be devided up. Kind of a trivial point, I guess, but worth reminding ourselves from time to time that utilitarianism is not a theory of what is actually good, but how that might be distributed, if that admits of scarcity.
How would they interact such that it’s not simply adding over them, and they don’t end up being predictably Dutch-bookable?
In the same way people’s minds do. They are inconsistent but will notice the setup very quickly and stop. (I don’t find Dutch book arguments very convincing, really).
Seems like a layer of inefficiency to have to resist temptation to run in circles rather than just want to go uphill.
There are two issues:
(a) In what settings do you want an architecture like that, and
(b) Ethics dictate we don’t just want to replace entities for the sake of efficiency even if they disagree. This leads to KILL ALL HUMANS. So, we might get an architecture like that due to how history played out. And then it’s just a brute fact.
I am guessing (a) has to do with “robustness” (I am not prepared to mathematise what I mean yet, but I am thinking about it).
People that think about UDT/blackmail are thinking precisely about how to win in settings I am talking about.
Pick a side of this fence. Will AI resist running-in-circles trivially, or is its running in circles all that’s saving us from KILL ALL HUMANS objectives like you say in part b?
If the latter, we are so utterly screwed.
1) Perhaps you give it one domain and a utility function within that domain, and it returns a good action in this domain. Then you give it another domain and a different utility function, and it returns a good action in this domain. Basically I’m saying that it doesn’t maximize a single unified utility function.
2) You prove too much. This implies that the Unix cat program has a utility function (or else it is wasting effort). Technically you could view it as having a utility function of “1 if I output what the source code of cat outputs, 0 otherwise”, but this really isn’t a useful level of analysis. Also, if you’re going to go the route of assigning a silly utility function to this program, then this is a utility function over something like “memory states in an abstract virtual machine”, not “states of the universe”, so it will not necessarily (say) try to break out of its box to get more computation power.
On 2, we’re talking about things in the space of agents. Unix utilities are not agents.
But if you really want to go that route? You didn’t prove it wrong, just silly. The more agent-like the thing we’re talking about, the less silly it is.
I don’t think the connotations of “silly” are quite right here. You could still use this program to do quite a lot of useful inference and optimization across a variety of domains, without killing everyone. Sort of like how frequentist statistics can be very accurate in some cases despite being suboptimal by Bayesian standards. Bostrom mostly only talks about agent-like AIs, and while I think that this is mostly the right approach, he should have been more explicit about that. As I said before, we don’t currently know how to build agent-like AGIs at the moment because we haven’t solved the ontology mapping problem, but we do know how to build non-agentlike cross-domain optimizers given enough computation power.
I don’t see how being able to using a non-agent program to do useful things means it’s not silly to say it has a utility function. It’s not an agent.
Okay. We seem to be disputing definitions here. By your definition, it is totally possible to build a very good cross-domain optimizer without it being an agent (so it doesn’t optimize a utility function over the universe). It seems like we mostly agree on matters of fact.
How do you propose to discover the utility function of an agent by observing its actions? You will only ever see a tiny proportion of the possible situations it could be in, and in those situations you will not observe any of the actions it could have made but did not.
Observationally, you can’t. But given its source code...