The Value Learning Problem
I’m pleased to announce a new paper from MIRI about The Value Learning Problem.
Abstract:
A superintelligent machine would not automatically act as intended: it will act as programmed, but the fit between human intentions and formal specification could be poor. We discuss methods by which a system could be constructed to learn what to value. We highlight open problems specific to inductive value learning (from labeled training data), and raise a number of questions about the construction of systems which model the preferences of their operators and act accordingly.
This is the sixth of six papers supporting the MIRI technical agenda. It motivates the need for value learning, a bit, and gives some early thoughts on how the problem could be approached (while pointing to some early open problems in the field).
I’m pretty excited to have the technical agenda and all its supporting papers published. Next week I’ll be posting an annotated bibliography that gives more reading for each subject. The introduction to the value learning paper has been reproduced below.
Consider a superintelligent system, in the sense of Bostrom (2014), tasked with curing cancer by discovering some process which eliminates cancerous cells from a human body without causing harm to the human (no easy task to specify in its own right). The resulting behavior may be quite unsatisfactory. Among the behaviors not ruled out by this goal specification are stealing resources, proliferating robotic laboratories at the expense of the biosphere, and kidnapping human test subjects.
The intended goal, hopefully, was to cure cancer without doing any of those things, but computer systems do not automatically act as intended. Even a system smart enough to figure out what was intended is not compelled to act accordingly: human beings, upon learning that natural selection ``intended” sex to be pleasurable only for purposes of reproduction, do not thereby conclude that contraceptives are abhorrent. While one should not anthropomorphize natural selection, humans are capable of understanding the process which created them while being unmotivated to alter their preferences accordingly. For similar reasons, when constructing an artificially intelligent system, it is not sufficient to construct a system intelligent enough to understand human intentions; the system must also be purposefully constructed to pursue them (Bostrom 2014, chap. 8).
How can this be done? Human goals are complex, culturally laden, and context-dependent. Furthermore, the notion of ``intention” itself may not lend itself to clean formal specification. By what methods could an intelligent machine be constructed to reliably learn what to value and to act as its operators intended?
A superintelligent machine would be useful for its ability to find plans that its programmers never imagined, to identify shortcuts that they never noticed or considered. That capability is a double-edged sword: a machine that is extraordinarily effective at achieving its goals might have unexpected negative side effects, as in the case of robotic laboratories damaging the biosphere. There is no simple fix: a superintelligent system would need to learn detailed information about what is and isn’t considered valuable, and be motivated by this knowledge, in order to safely solve even simple tasks.
This value learning problem is the focus of this paper. Section 2 discusses an apparent gap between most intuitively desirable human goals and attempted simple formal specifications. Section 3 explores the idea of frameworks through which a system could be constructed to learn concrete goals via induction on labeled data, and details possible pitfalls and early open problems. Section 4 explores methods by which systems could be built to safely assist in this process.
Given a system which is attempting to act as intended, philosophical questions arise: How could a system learn to act as intended when the operators themselves have poor introspective access to their own goals and evaluation criteria? These philosophical questions are discussed briefly in Section 5.
A superintelligent system under the control of a small group of operators would present a moral hazard of extraordinary proportions. Is it possible to construct a system which would act in the interests of not only its operators, but of all humanity, and possibly all sapient life? This is a crucial question of philosophy and ethics, touched upon only briefly in Section 6, which also motivates a need for caution and then concludes.
Minor note: the PDF includes the copyright statement
which implies that you are, among other things, forbidding people from redistributing the file other than by linking directly to it, which may not be what you actually want. (For instance, it would technically be illegal to print out copies of your paper and distribute those copies without explicit permission, and while usually one would assume distribution of printed copies to be okay for a paper posted online for free, you are explicitly reserving all rights including the right for redistribution.) May I suggest using some Creative Commons license, such as CC-BY or CC-BY-ND, instead?
This seems like an appropriate place to cite my concept learning paper? The quoted paragraphs seems like it’s basically asking the question of “how do humans learn their concepts”, and ambiguity identification is indeed one of the classic questions within the field. E.g. Tenenbaum 2011 discusses ambiguity identification:
Or see this talk or even e.g. just the first 10 minutes of it, where the concept learning problem is basically defined as being the same thing as the ambiguity identification problem.
Good point, thanks! I added references to both you and Tenenbaum in that section.
Oh good, you guys are reading Tenenbaum. Now I can come over some day and give you my talk on probabilistic programming for funsies rather than in a desperate attempt to catch FAI researchers up on what computational cognitive science has been doing for years.
In all honesty, my expected lifespan got a little bit longer after updating on you guys indeed knowing about the Tenenbaum lab and its work.
Sounds interesting, do you have some reference (video/slides/paper)?
/u/JoshuaFox was telling me to wait until I actually recorded the voiced lecture before posting to LW, but oh well, here it is. I’ll make a full and proper Discussion post when I’ve gotten better from my flu, taken tomorrow’s exam, submitted tomorrow’s abstracts to MSR, and thus fully done my full-time job before taking time to just record a lecture in an empty room somewhere.
Thanks!
Cool, thanks!
Sounds interesting, do you have some reference (video/slides/paper)?
I think you replied to the wrong comment. :)
Indeed. Thanks for noticing.
Thanks, I think this is an important area and having an overview of your thinking is useful.
My impression is that it would be more useful still if it were written to make plainer the differing degrees of support available for its different claims. You make a lot of claims, which vary from uncontroversial theorems to common beliefs in the AI safety community to things that seem like they’re probably false (not necessarily for deep reasons, but at least false-as-stated). And the language of support doesn’t seem to be stronger for the first category than the last. If you went further in flagging the distinction between things that are accepted and things that you guess are true, I’d be happier trusting the paper and pointing other people to it.
I’ll give examples, though this is more representative than a claim that you should change these details.
On page 2, you say “In linear programming, the maximum of an objective function tends to occur on a vertex of the space.” Here “tends to” seems unnecessary hedging—I think this is just a theorem! Perhaps there’s an interpretation where it fails, but you hedge far less on other much more controversial things.
On the other hand the very next sentence: “Similarly, the optimal solution to a goal tends to occur on an edge (hyperface) of the possibility space.” appears to have a similar amount of hedging for what is a much weaker sense of “tends”, and what’s a much weaker conclusion (being in a hyperface is much weaker than being at a vertex).
Another example: the top paragraph of the right column of page 3 uses “must” but seems to presuppose an internal representation with utility functions.
Thanks. I’ve re-worded these particular places, and addressed a few other things that pattern-matched on a quick skim. I don’t have time to go back over this paper with a fine comb, but if you find other examples, I’m happy to tweak the wording :-)
Thanks for the quick update! Perhaps this will be most useful when writing new things, as I agree that it may not be worth your time to rewrite carefully (and should have said that).
It is. If there exists an optimal solution, at least one vertex will be optimal, and as RyanCarey points out, if a hyperface is optimal it will have at least one vertex.
A stronger statement is that the Simplex algorithm will always return an optimal vertex (interior point algorithms will return the center of the hyperface, which is only a vertex if that’s the only optimal point).
… Even if the optimum occurs along an edge, it’ll at least include vertices.
Was the existing literature on preference learning covered or critiqued in this paper?
Not really. Reinforcement learning is mentioned, and inverse reinforcement learning is briefly discussed, but I’m not aware of much other preference learning literature that is relevant to this particular type of value learning (highly advanced systems learning all of human values). (Exception: Kaj’s recent paper, which I’ll shortly add as a citation.)
I can’t imagine there isn’t a single paper out there in the literature about supervised learning of VNM-style utility functions over rich, or even weak, hypothesis spaces.
Here’s a trivial example pulled off one minute’s Googling. It “counts” because the kernel trick is sufficiently rich to include all possible functions over Hilbert spaces.
I do think that if you’ve researched this more thoroughly than I have (I’d bet you have, since it’s your job), the paper really ought to include a critique of the existing literature, so as to characterize what sections of the unevaluated-potential-solution tree for the value-learning problem should be explored first.
I am uncertain about the notion of using simulation or extrapolation to deduce the system operator’s intentions (as brought up in Section 5). Pitfall one is that the operator is human and subject to the usual passions and prejudices. Presumably there would be some mechanism in place to prevent the AI from carrying out the wishes of a human mad with power.
Pitfall two is a mathematical issue. Models of nonlinear phenomena can be very sensitive to initial conditions. In a complex model, it can be difficult to get a good error bounds. So, I’d ask just how complex a model one would need to get useful information and whether a model that complex is tractable. It seems to be taken for granted that one could accurately simulate someone else’s brain, but I’m not convinced.
Otherwise, it’s an interesting look at the difficulties inherent in divining human intentions. We have enough trouble getting our intentions and values across to other people. I figure that before we get a superintellgent AI, we’ll go through a number of stupid ones followed by mediocre ones. Hopefully the experience will grant some further insight into these problems and suggest a good approach.
This seems like a distracting example that is likely to set off a lot of people’s politics behavior.
For instance, it may be misread as saying that humans who don’t draw that conclusion are somehow broken.
I was just about to post this quote as a quite well-chosen example which uses an easily understood analogy to defuse all those arguments that an AI should be smart enough to know what is ‘intended’ in one quick sweep (one might say yudkowskyesk so).
*yudkowskily
I think the word Gunnar was going for was “Yudkowskyesquely”, unfortunately.
Such humans are Natural_Selection!Broken, but the point is that that’s not Human!Broken.
Do you have a better example of what the algorithm feels like from the inside? (Pointing out that an example could be problematic seems less useful than supplying a fix also.)
Well, how about just generalizing it away from the politically pointy example?
Although just as the original example is prone to political criticism, this one may be prone to the critique that, as a matter of fact, in the generations after the discovery of evolution, quite a few humans did attempt to adopt their interpretation of evolution’s goals as the proper goals of humanity.
The sex example is more concrete. This new one blurs the point.
Your revised example is just as prone to that, isn’t it?
Which makes me guess (I know, guessing is rude, sorry) that this isn’t your real objection, and you’re just reacting to the keyword “contraceptives”.
It doesn’t look to me as if fubarobfusco’s example is as prone to that problem.
With the original example:
There actually are people—quite a lot of them—who believe that (1) sex’s natural purpose is reproduction, and that (2) because of #1 it is wrong to use contraception when having sex.
(They generally believe #1 on the grounds that “God made it so” rather than “natural selection made it so”.)
If such a person reads the paper as currently written, they are likely to find the statement that “human beings [...] do not thereby conclude that contraceptives are abhorrent” as an attack on their reasoning; after all, they draw just such a deduction.
(Although not quite the same deduction, and actually a more defensible one given their premises: God is suppose to have wise and benevolent intentions, whereas natural selection is not.)
If someone who isn’t in that category but does disapprove of contraception for politicoreligious reasons reads the paper as currently written, they may go through roughly the process of reasoning described above on behalf of their political/religious allies and get offended.
This would be a shame.
With fubarobfusco’s modified version:
There are way way fewer people who consider themselves obliged to maximize their number of descendants.
(Whether you imagine them basing that on natural selection, or divine design, or whatever else.)
Those people, as well as being few in number, are not a group with much political influence or social status.
There is accordingly much less danger that readers will either be offended on their own behalf, or take offence on others’ behalf.
Perhaps fubarobfusco is “reacting to the keyword ‘contraceptives’” in the following sense: he sees that word, recognizes that there is a whole lot of political/religious controversy around it, and feels that it would be best avoided. I’m not sure there’s anything wrong with that.
Hm. Yeah, point taken, though I’d probably have to be American to be able to take this seriously on a gut level.
Still, the original example was clearer. It had a clear opposition, Bad according to genes, Good according to humans (even if not all of them). The modified example would lose that, as people generally do leave, and want to leave, descendants. It doesn’t convey that sense of a sharp break with the “original intention”.
Can’t seem to think of an equally strong example that would be less likely to be objectionable...
Yes, but that’s because most human beings interpret normative force as being a command coming from an authority figure, and vice-versa. Let them hallucinate an authority figure and they’ll think there’s reason to do what It says.
Regarding 2: So, I am a little surprised that step 2: Valuable goals cannot be directly specified is taken as a given.
If we consider an AI as rational optimizer of the ONE TRUE UTILITY FUNCTION, we might want to look for best available approximations of it short term. The function i have in mind is life expectancy(DALY or QALY), since to me, it is easier to measure than happiness. It also captures a lot of intuition when you ask a person the following hypothetical:
if you could be born in any society on earth today, what one number would be most congruent with your preference? Average life expectancy captures very well which societies are good to be born at.
I am also aware of a ton of problems with this, since one has to be careful to consider humans vs human/cyborg hybrids, time spent in cryo-sleep or normal sleep vs experiential mind-moments. However, i’d rather have an approximate starting point for direct specification, rather than give up on the approach all-together.
Regarding 5: There is an interesting “problem” with “do what i would want if i had more time to think” that happens not in the case of failure, but in the case of success. Let’s say we have our happy go lucky life expectancy maximizing death-defeating FAI. It starts to look at society and sees that some widely accepted acts are totally horrifying from its perspective. It’s “morality” surpasses ours, which is just an obvious consequence of it’s intelligence surpassing ours. Something like the amount of time we make children sit at their desks at school destroys their health to the point of disallowing immortality. This particular example might not be so hard to convince people of, but there could be others. At this point, they would go against a large number of people, to try and create its own schools which teach how bad the other schools are (or something). The governments don’t like this and shut it down because we still can for some reason.
Basically the issue is: this AI behaving in a friendly manner, which we would understand if we had enough time and intelligence. But we don’t. So we don’t have enough intelligence to determine if it is actually friendly or not.
Regarding 6: I feel that you haven’t even begun to approach the problem of a sub-group of people controlling the AI. The issue gets into the question of peaceful transitions of that power over the long term. There is also an issue of if you come up with a scheme of who gets to call the shots around the AI that’s actually a good idea, convincing people that it is a good idea instead of the default “let the government do it” is in itself a problem. It’s similar in principle to 5.
Whoa, how are you measuring the disability/quality adjustment? That sounds like sneaking in ‘happiness’ measurements, and there are a bunch of challenges: we already run into issues where people who have a condition rate it as less bad than people who don’t have it. (For example, sighted people rate being blind as worse than blind people rate being blind.)
There’s a general principle in management that really ought to be a larger part of the discussion of value learning: Goodhart’s Law. Right now, life expectancy is higher in better places, because good things are correlated. But if you directed your attention to optimizing towards life expectancy, you could find many things that make life less good but longer (or your definition of “QALY” needs to include the entirety of what goodness is, in which case we have made the problem no easier).
But here’s where we come back to Goodhart’s Law: regardless of what simple measure you pick, it will be possible to demonstrate a perverse consequence of optimizing for that measure, because simplicity necessarily cuts out complexity that we don’t want to lose. (If you didn’t cut out the complexity, it’s not simple!)
Well, i get where you are coming from with Goodhart’s Law, but that’s not the question. Formally speaking, if we take the set of all utility functions with complexity < N = FIXED complexity number, then one of them is going to be the “best”, i.e. most correlated with the “true utility” function which we can’t compute.
As you point out, with we are selecting utilities that are too simple, such as straight up life expectancy, then even the “best” function is not “good enough” to just punch into an AGI because it will likely overfit and produce bad consequences. However we can still reason about “better” or “worse” measures of societies. People might complain about un-employment rate, but it’s a crappy metric to base your decision about which societies are over-all better than others, plus it’s easier to game.
The use of at least “trying” to formalize values means we can at least have a set of metrics, that’s not too large that we might care about in arguments like: “but the AGI reduced GDP, well it also reduced suicide rate”? Which is more important? Without a simple guidance of simply something we value, it’s going to be a long and UN-productive debate.
I don’t think correlation is a useful way to think about this. Utility functions are mappings from consequence spaces to a single real line, and it doesn’t make much sense to talk about statistical properties of mappings. Projections in vector spaces is probably closer, or you could talk about a ‘perversity measure’ where you look at all optimal solutions to the simpler mapping and find the one with the worst score under the complex mapping. (But if you could rigorously calculate that, you have the complex utility function, and might as well use it!)
I think the MIRI value learning approach is operating at a higher meta-level here. That is, they want to create a robust methodology for learning human values, which starts with figuring out what robustness means. You’ve proposed that we instead try to figure out what values are, but I don’t see any reason to believe that us trying to figure out what values are is going to be robust.
I don’t understand? A super intelligence will of course not act as programed, nor as intended because to be the “super intelligent“ definition it will have emergent properties.
Plus “… instrumental incentives to manipulate or deceive its operators, and the system should not resist operator correction or shut- down.” Don’t act like any well adjusted 2 year old? If we really want intelligence running around, we are going to have to learn to let go of control.
Humans can barely value what’s in their scope. Intelligence can only value what’s in their scope because they are not “over there” and really can only follow best practices which might or might not work, and likely won’t work with outliers. We simply can’t take action “for the good of all humanity”, because we don’t know what’s good for everyone. We like to think we do, but we don’t. People used to think binding women’s feet was a good idea. Additionally, even if another takes our advice, only they experience the consequences of their actions: The feedback loop is broken: bureaucracy. This seems to be a persistent issue with AI. It is mathematically unsolvable: a local scope cannot know what’s best for a non local scope (without invoking omniscience. In practical terms, this is why projection of power is so expensive, why empires always fail, and why nature does not have empires.
There is a simple fix, but it requires scary thinking. Evolution obviously has intelligence: It made everything we are and experience. So just copy it. Like any other complex adaptive system it has a few simple initial conditions. https://www.castpoints.com/
If done correctly, we don’t get Skynet, we get another subset of evolution evolving.
Humans don’t understand intelligence. We, and computers, are not that intelligent. We mostly express evolution’s intelligence. That’s why people want to get into “flow” states.
These questions and objections are touched upon in many parts of the sequences (the series of blog posts which seeded LessWrong, and which were written to address questions like this specifically). In your case, I’d recommend reading almost all of those posts, as they were targeted precisely towards these sorts of objections. That’s a lot of reading, though; if you want specific answers to the questions you posed, then it sounds like you may be interested in the evolution mini sequence (which responds to the claim all we can hope to do is “express evolution’s intelligence”; see also thou art godshatter), and probably also the mysterious answers to mysterious questions sequence (which talks about ways to approach topics that you don’t understand; see The Futility of Emergence in particular), and also maybe the metaethics sequence and the fun theory sequence which give some reasons to expect that “another subset of evolution evolving” is not such a good outcome.