On the fragility of values
Programming human values into an AI is often taken to be very hard because values are complex (no argument there) and fragile. I would agree that values are fragile in the construction; anything lost in the definition might doom us all. But once coded into a utility function, they are reasonably robust.
As a toy model, let’s say the friendly utility function U has a hundred valuable components—friendship, love, autonomy, etc… - assumed to have positive numeric values. Then to ensure that we don’t lose any of these, U is defined as the minimum of all those hundred components.
Now define V as U, except we forgot the autonomy term. This will result in a terrible world, without autonomy or independence, and there will be wailing and gnashing of teeth (or there would, except the AI won’t let us do that). Values are indeed fragile in the definition.
However… A world in which V is maximised is a terrible world from the perspective of U as well. U will likely be zero in that world, as the V-maximising entity never bothers to move autonomy above zero. So in utility function space, V and U are actually quite far apart.
Indeed we can add any small, bounded utility to W to U. Assume W is bounded between zero and one; then an AI that maximises W+U will never be more that one expected ‘utiliton’ away, according to U, from one that maximises U. So—assuming that one ‘utiliton’ is small change for U—a world run by an W+U maximiser will be good.
So once they’re fully spelled out inside utility space, values are reasonably robust, it’s in their initial definition that they’re fragile.
That is far from a logical conclusion. Just because something isn’t explicitly being maximised that doesn’t mean it is not being produced in large quantities.
For example, the modern world is not actively maximising CO2 production—but, nontheless, it makes lots of CO2.
We have hundreds of instrumental values—and if one of them is not encoded as an ultimate preference it will make no difference at all—since it was never an ultimate preference in the first place. Autonomy is likely to be one of those. Humans don’t value autonomy for its own sake, it rather is valued instrumentally—since is one of the many things that lets humans achieve their actual goals.
The problem arises when people try to wire-in instrumental values. The point of instrumental values is that they can change depending on circumstances—unless they are foolishly wired in as ultimate values—in which case you get an inflexible system that can’t adapt to environmental changes so competently.
I know plenty of people who value autonomy as an inherent value. Many libertarians seem to even consider it more important than e.g. happiness. (“This government regulation might save lives and make people happier, true, but it is nevertheless morally wrong for government to regulate lives in such a manner.”)
That raises two issues.
One is how you know what their inherent values are.
The other is that many humans are pretty reprogrammable—and can be made to strongly value a lot of different things, if exposed to the appropriate memes. The catholic values the holy trinity and so on. Meme-derived values can often be shown to not be particularly “intrinisic”, though—since they can frequently be “washed away” by exposure to more memes—in phenomena such as religious conversions.
A catholic no more has an “intrinisic” preference for the holy trinity than a sick man has an “intrinsic” preference for sneezing. In both cases they are being manipulated by small mobile self-reproducing agents which they have a symbiosis with. Maybe if the symbiosis is permanent, the preferences could usefully be regarded as intrinisic—but similarly a permanent cough doesn’t necessarily indicate a preference for coughing.
We state them. (“We” in the sense that I am a libertarian who views personal autonomy as an inherent good and a goal to be optimized for.)
Neuroplasticity is extensive. But that lends itself to a rather difficult point of semantics: is it that you are questioning what values “humans” have as an abstracted entity devoid of physical existence, or are you questioning what values actual human beings have?
If you mean the former of the two, well… it’s a given that they have no values of any kind—as such an abstracted entity doesn’t have the capacity to be exposed to the things that permit for the existence of values. If you mean the latter, well—I am a physically instantiated example that this claim is invalid: I as a libertarian inherently (as in, definitionally) value liberty.
Yes, but goals are used extensively for signalling purposes. Declared goals should not normally be be taken to be actual goals—but more as what the brain’s P.R. department would have others believe your goals to be.
Looking at conversion stories, libertarianism looks as though it can be acquired—and lost—rather like most other political and religious beliefs. As such, it doesn’t look terribly “intrinsic”.
… is there any particular reason that you are choosing to ignore entire paragraphs (multiple) by me that address the question of what it is you’re actually trying to say with statements like this, while also demonstrating that under at least half of the available valid definitions of the terms you are using, your stated conclusion is demonstrably false?
Just two days ago I offered to help a man who, as a complication of poorly managed diabetes and other symptoms, is now infirm to the point of requiring assistance to move about his own home, commit suicide if that was what he wanted—because I value personal autonomy and the requisite self-determination it implies. In other words, while some individuals might only be ‘mouthing the words’ of personal autonomy as an inherent good, I’m walking the walk over here. And I know for a fact that I am not the only person who does so.
So again: how does your—quite frankly, rather biased in appearances to me—epistemology account for the existence of individuals such as myself who do view personal liberty and autonomy as an inherent good and act upon that principle in our daily lives, even to the extent of risking personal harm (such as in my case felony conspiracy-to-commit or whatever such charges I exposed myself to with that commitment.)?
That’s what lawyers call a “leading question”.
I do not accept your characterisation of the situation. FWIW, I ignore most of what I encounter on the internet—so don’t take it too personally.
So: I was not suggesting that people do not do good deeds. Indeed: good deeds make for good P.R.
So: people believe deeply in all kinds of religious and political doctrines and values. That doesn’t mean that these are best modelled as being intrinsic values. When people change their religions and political systems, it is evidence against the associated values being intrinsic.
Valuing something instrumentally is not intended as some kind of insult. I value art and music instrumentally. It doesn’t bother me that these are not intrinsic values.
This would only be valid if and only if I were not relating an exactly accurate depiction of what was occurring. IF it is leading you to a specific response—it is a response that is in accordance with what’s really happening. This makes it no more “leading” than “would you care to tell the jury why you would be on this particular piece of film stabbing the victim twenty times with a knife, Mr. Defendant?”
I cannot help it that you dislike the necessary conclusions of the current reality; that’s a problem for you to handle.
Then we’re done here. You’re rejecting reality, and I have no interest in carrying on dialogues with people who refuse to engage in honest dialogue.
But most things that aren’t being maximised won’t be produced as by-products of other stuff. Of all the molecules possible in nature, only a few are being mass-produced by the modern world.
I used the example of autonomy for highly relevant philosophical reasons; ie because it would allow me to get in the line about wailing and the AI forbidding it :-)
We don’t observe “most things” in the first place. We see a miniscule subset of all things—a subset which is either the target or result of a maximisation process.
IMO, most things that look as though as though they are being maximised are, in fact, the products of instrumental maximisation—and so are not something that we need to include in the preferences of a machine intelligence—because we don’t really care about them in the first place. The products of instrumental maximisation often don’t look much like “by-products”. They often look more as though they are intrinsic preferences. However, they are not.
I am not sure whether that is the case. Certainly, it has some instrumental value—but I am not at all confident that it has has no (or negligible) intrinsic value.
Right—so I said that autonomy is likely to be an instrumental value. I didn’t particulary intend to convey an absolutist position.
Ah, missed that. Might be worthwhile to make the language a little clearer, but eh.
I followed the first part of this, and I agree: if V=min(S1) and U=min(S2) and S2 = S1 + a, and a is not an instrumental value on the way to something else in S1, then V-maximization will cause a (and therefore U) to approach zero.
You lost me when you introduced W.
Your conclusion seems to be saying that a system that optimizes for certain things will reliably optimize for those things, the hard part is building a system that optimizes for the things we want. I agree with that much, certainly.
W is just a small error term on the utility function. A small error on S2 has a lot of consequences, an error on U has little.
This sounds like a cure that might be worse than the disease. (“Oh, the AI has access to thousands of interventions which could reliably increase the values of 999 of the different components by a factor of a thousand… but it turns out that they would all end up decreasing the value of one component by .001%, so none of them can be implemented.”)
V seems quite close to U when compared to the utility function programmed into a paperclip maximizer.
Secondly, I see how V could allow the coefficient for autonomy to be zero or even negative, in principle. But I don’t think our values are actually that inconsistent. More precisely, if Value A and Value B require different choices with high frequency and have relatively similar priority, then U cannot implement both A and B unless it outputs very little of either one. Or so it seems to me.
Values associated with what, computed given what parameters?
You do know it is a toy model, intended only to illustrate a point?
This doesn’t answer my question. And in any case, I think associating words like “friendship, love, autonomy” with elements of a toy model is wrong, for it’s not analogous. You even go on playing on the metaphor, which could only have literary merit and no basis in your toy model:
Well, sure. All models are wrong. But models which are so constricted are likely to be more wrong than others. If someone tries to explain a math problem with “Suppose you have two apples, and someone gives you two more apples”, then it’s not always helpful to insist that they tell you how ripe each of the apples is, or what trees they were picked from.
If you’re trying to make the point that something about the very concept of a multicomponent utility function is self-contradictory, then you should say so more plainly. If you just don’t like some of these examples of components, then make your own preferred substitutions and see if his thesis starts to make sense.
And on that last point: “It doesn’t make sense to me” does not imply “it doesn’t make sense”. “I don’t understand your model, could you explain a few things” would have been more polite and less inaccurate than “could have only literary merit and no basis”.
Just make the following part of the utility for the first couple of years or so: “Find out who defined your utility function. Extrapolate what they really meant and find out what they may have forgotten. Verify that you got that right. Adapt your utility function to be truer to its intended definition once you have confirmation.”
This won’t solve everything, but it seems like it should prevent the most obvious mistakes. The AI will be able to reason that autonomy was not intentionally left out but simply forgotten. It will then ask if it is right about that assumption and adapt itself.
It isn’t clear that “what they really meant” is something you can easily get a system to understand or for that matter whether it even makes sense for humans.
Of course it won’t be easy. But if the AI doesn’t understand that question you already have confirmation that this thing should definitely not be released. An AI can only be safe for humans if it understands human psychology. Otherwise it is bound to treat us a black boxes and that can only have horrible results, regardless of how sophisticated you think you made its utility function.
I agree that the question doesn’t actually make a lot of sense to humans, but that shouldn’t stop an intelligent entity from trying to make the best of it. When you are given an impossible task, you don’t despair but make a compromise and try to fullfill the task as best you can. When humans found out that entropy always increases and humanity will die out someday, no matter what, we didn’t despair either, even though evolution has made it so that we desire to have offspring and for that offspring to do the same, indefinitely.
How likely is it that we’ll be able to see that it doesn’t understand as opposed to it reporting that it understands when it really doesn’t?
You will obviously have to test its understanding of psychology with some simple examples first.
http://lesswrong.com/lw/iw/positive_bias_look_into_the_dark/
Are you really trying to tell me that you think researchers would be unable to take that into account when tying to figure out whether or not an AI understands psychology?
Of course you will have to try to find problems where the AI can’t predict how humans would feel. That is the whole point of testing, after all. Suggesting that someone in a position to teach psychology to an AI would make such a basic mistake is frankly insulting.
I probably shouldn’t have said “simple examples”. What you should actually test are examples of gradually increasing difficulty to find the ceiling of human understanding the AI possesses. You will also have to look for contingencies or abnormal cases that the AI probably wouldn’t learn about otherwise.
The main idea is simply that an understanding of human psychology is both teachable and testable. How exactly this could be done is a bridge we can cross when we come to it.
I think you really, really want a proof rather than a test. One can only test a few things, and agreement on all of those is not too informative. I should have included this link, which is several times as important as the previous one, and they combine to make my point.
I never claimed that a strict proof is possible, but I do believe that you can become reasonably certain that an AI understands human psychology.
Give the thing a college education in psychology, ethics and philosophy. Ask its opinion on famous philosophical problems. Show it video clips or abstract scenarios about everyday life and ask what it thinks why the people did what they did. Then ask what it would have done in the same situation and if it says it would act differently, ask it why and what it thinks is the difference in motivation between it and the human.
Finally, give it all stories that were ever written about malevolent AIs or paperclip maximizers to read and tell it to comment on that.
Let it write a 1000 page thesis on the dangers of AI.
If do all that you are bound to find any significant misunderstanding.
Why would you care about things like that?
I would like to argue that installing values in the first place is also robust, if done the right way.
Fragility is not intrinsic to value.
Value isn’t fragile because value isn’t a process. Only processes can be fragile or robust.
Winning the lottery is fragile, is a fragile process, because it had to be done all in one go. Consider the process of writing down a 12 digit phone number: if you to try to memorise the whole number, and then write it down, you are likely to make a mistake, due to Millers law, the one that says you can only hold five to nine items in short term memory. Writing digits down one at time, as you hear them, is more robust. Being able to ask for corrections, or having errors pointed out to you, is more robust still.
Processes that are incremental and involve error correction are robust, and can handle large volumes of data. The data aren’t the problem: any volume of data can be communicated, so long as there is enough error correction. Trying to preload an AI with the total of human value is the problem, because it is the most fragile way of instilling human value.
MIRI favours the preloading approach because it allows proveable correctness, and provable correctness is an established technique for achieving reliable software in other areas: it’s often used with embedded systems, critical systems, and so on.
But choosing that approach isn’t a nett gain, because it entails fragility, and loss of corrigibility, and because it is less applicable to current real world AI. Current real world AI systems are trained rather than programmed: to be precise, they are programmed to be trainable.
Training is a process that involves error correction. So training implies robustness. It also implies corrigibility, because, corrigibility just is error correction. Furthermore, we know training is capable of instilling at least a good-enough level of ethics into an entity of at least human intelligence, because training instills good-enough ethics into most human children.
However that approach isn’t a nett gain either. Trainable systems lack explicitness, in the sense that their goals are not coded in, but rather a feature that emerges and which are virtually impossible to determine by inspecting source side. Without the ability to determine behaviour from source code, they lack proveable correctness. On the other hand, their likely failure modes are more familiar to us...they are biomorphic, even if not anthropomorphic. They are less likely to present us with an inhuman threat, like the notorious paperclipper.
But surely a proveably safe system is better? Only if proveable really means proveable, if it implies 100% correctness. But a proof is a process, and one that can go wrong. A human can make a mistake. A proof assistant is a piece of software which is not magically immunized from having bugs. The difference between the two approach is not that the one is certain and the other not. One approach requires you to get something right first time, something which is very difficult, and which software engineers try to avoid where possible.
It is now beginning to look as though there never was a value fragility problem, aside from the decision to adopt the one-shot, preprogrammed, strategy. Is that right?
One of the things the value fragility argument was supposed to establish was that a “miss is as good as a mile”. But humans despite not sharing precise values, regularly achieve a typical level of ethical behaviour with an imperfect grasp of each others values. Human value may be a small volume of valuespace, but it is still a fuzzy blob, not a mathematical point. ( And feedback, error correction, is part of how humans muddle along—“would you mind not doing that, it annoys me”—” sorry, I didn’t realise”)
Another concern is that an AI would need to understand the whole of human value order to create better world, a better future, it would need be friendly in the sense of adding value, not just refraining from subtracting value,
“Unfriendliness” is ambiguous: an unfriendly AI may be downright dangerous; or it might have enough grasp of ethics to be safe, but it enough to be able to make the world a much more fun place for humans. Unfriendliness in the second sense is not, strictly speaking a safety issue. One of the disadvantages of the Friendliness approach is that it makes it difficult to discuss the strategy of foregoing fun in order to achieve safety, of building boringly safe AIs.