Natural Value Learning
The main idea of this post is to make a distinction between natural and unnatural value learning. The secondary point of this post is that we should be suspicious of unnatural schemes for value learning, though the point is not that we should reject them. (I am not fully satisfied with the term, so please do suggest a different term if you have one.)
Epistemic status: I’m not confident that actual value learning schemes will have to be natural, given all the constraints. I’m mainly confident that people should have something like this concept in their mind, though I don’t give any arguments for this.
Natural value learning
By a “value learning process” I mean a process by which machines come to learn and value what humans consider good and bad. I call a value learning process natural to the extent that the role humans play in this process is basically similar to the role they play in the process of socializing other humans (mostly children, also asocial adults) to learn and value what is good and bad. To give a more detailed picture of the distinction I have in mind, here are some illustrations of what I’m pointingat, each giving a property I associate with natural and unnatural value learning respectively:
Natural alignment | Unnatural alignment |
Humans play the same role, and do the same kinds of things, within the process of machine value learning as they do when teaching values to children. | Humans in some significant way play a different role, or have to behave differently within the process of machine value learning than they do when teaching values to children. |
The machine value learning process is adapted to humans. | Humans have to adapt to the machine value learning process. |
Humans who aren’t habituated to the technical problems of AI or machine value learning would still consider the process as it unfolds to be natural. They can intuitively think of the process in analogy to the process as it has played out in their experience with humans. | Humans who aren’t habituated to the technical problems of AI or machine value learning would perceive the machine value learning process to be unnatural, alien or “computer-like”. If they naively used their intuitions and habits from teaching values to children, they would be confused in important ways. |
Concrete examples
To give a concrete idea of what I would consider natural vs unnatural value learning setups, here are some concrete scenarios in order of ascending naturality:
Disclaimer: I am not in any way proposing any of these scenarios as realistic or good or something to aim for (in fact I tried to write them somewhat comically so as not to raise this question). They are purely intended to clarify the idea given in this post, nothing more.
Not very natural. A superintelligent system is built that has somehow been correctly endowed with the goal of learning and optimizing human values (somehow). It somehow has some extremely efficient predictive algorithms that are a descendant of current ML algorithms. It scans the internet, builds a model of the distribution of human minds, and predicts what humanity would want if it were smarter and so forth. It successfully figures out what is good for humanity, reveals itself, and implements what humanity truly wants.
Somewhat more but still not very natural. A large AI research lab trains a bunch of agents in a simulation as a project branded as developing “aligned general intelligence”. As a part of this, the agents have to learn what humans want by performing various tasks in simulated and auto-generated situations that require human feedback to get right. A lot of data is required, so many thousands of humans are hired to sit in front of computer screens, look at the behaviour of the agents, and fill in scores or maybe english sentences to evaluate that behaviour. In order to be time-efficient, they evaluate a lot of such examples in succession. Specialized AI systems are used to generate adversarial examples, which end up being weird situations where the agents make alien decisions the humans wouldn’t have expected. Interpretability tools are used to inspect the cognition of these agents to provide the human evaluators with descriptions of what they were thinking. The human evaluators have to score the correctness of the reasoning patterns that led to the agent’s actions based on those descriptions. Somehow, the agents end up internalizing human values but are vastly faster and more capable on real world tasks.
More natural. A series of general-purpose robots and software assistants are developed that help humans around the home and the office, and start out (somehow) with natural language abilities, knowledge of intuitive physics, and so forth. They are marketed as “learning the way a human child learns”. At first, these assistants are considered dumb regarding understanding of human norms/values, but they are very conservative, so they don’t do much harm. Ordinary people use these robots/agents at first for very narrow tasks, but by massively distributed feedback given through the corrections of ordinary human consumers, they begin to gain and internalize intuitive understanding of at first quite basic everyday human norms. For example, the robots learn not to clean the plate while the human is still eating from it, because humans in their normal daily lives react negatively when the robots do so. Similarly, they learn not to interrupt an emotional conversation between humans, and so forth. Over time, humans trust these agents with more and more independent decision-making, and thereby the agents receive more general feedback. They eventually somehow actually generalize broadly human values to the point where people trust the moral understanding of these agents as much or more than they would that of a human.
Disclaimer: I already said this before but I feel the need to say it again: I don’t consider especially this last scenario to be realistic/solving the core AI safety problem. The scenarios are merely meant to illustrate the concept of natural value learning.
Why does this distinction matter?
It seems to me that naturality tracks something about the expected reliability of value learning. Broadly I think we should be more wary of proposals to the extent that they are unnatural.
I won’t try to argue for this point, but broadly my model is based on the model that human values are fragile and we don’t really understand mechanistically how they are represented, or how they are learned. Eg. there are some ideas around values, meta-values, and different levels of explicitness, but they seem to me to be quite far from a solid understanding that would give me confidence in a process that is significantly unnatural.
You seem to be taking for granted: (1) that children learn values at all, (2) that they learn values from their parents. Is that a fair characterization?
For my part, I think (1) is complicated (because I think very important parts of “values” are hardwired by the genes) and that (2) is mostly false (because I suspect children predominantly learn culture from their peers and from older children, and learn it much less from the previous generation).
Not really a fair characterization I think: 2 mostly seems orthogonal to me (though I probably disagree with your claim. i.e. most important things are passed from previous generations. e.g. children learn that theft is bad, racism is bad etc, all of those things are passed from either parents or other adults. I don’t care a lot about the distinction parents vs other adults/society in this case. I know about the research that parenting has little influence, I don’t want to go into it preferably). 1 seems more relevant. In fact maybe the main reason for me to think this post is irrelevant is that the inductive biases in AI systems will be too different from that of humans (although note, genes still alow for a lot of variability in ethics and so on). But I still think it might be a good idea to keep in mind that “information in the brain about values has a higher risk to not get communicated into the training signal if the method of elliciting that information is not adapted to the way humans normally express the information”, if indeed it is true.
If a kid’s parents and teachers and other authority figures tell them that stealing is bad, while everyone in the kid’s peer group (and the next few grades up) steal all the time, and they never get in trouble, and they talk endlessly about how awesome it is, I think there’s a very good chance that the kid will wind up feeling that stealing is great, just make sure the adults don’t find out.
I speak from personal experience! As a kid, I used the original Napster to illegally download music. My parents categorized illegal music downloads as a type of theft, and therefore terribly unethical. So I did it without telling them. :-P
As a more mundane example, I recall that my parents and everyone in their generation thought that clothes should fit on your body, while my friends in middle school thought that clothes should be much much too large for your body. You can guess what size clothing I desperately wanted to wear.
(I think there’s some variation from kid to kid. Certainly some kids at some ages look up to their parents and feel motivated to be like them.)
In my mind, “different inductive bias” is less important here than “different reward functions”. (Details.) For example, high-functioning psychopaths are perfectly capable of understanding and imitating the cultural norms that they grew up in. They just don’t want to.
I agree that cultures exist and are not identical.
I tend to think that learning and following the norms of a particular culture (further discussion) isn’t too hard a problem for an AGI which is motivated to do so, and hence I think of that motivation as the hard part. By contrast, I think I’m much more open-minded than you to the idea that there might be lots of ways to do the actual cultural learning. For example, the “natural” way for humans to learn Bedouin culture is to grow up as a Bedouin. But I think it’s fair to say that humans can also learn Bedouin culture quite well by growing up in a different culture and then moving into a Bedouin culture as an adult. And I think humans can even (to a lesser-but-still-significant extent) learn Bedouin culture by reading about it and watching YouTube videos etc.
“I tend to think that learning and following the norms of a particular culture (further discussion) isn’t too hard a problem for an AGI which is motivated to do so”. If the AGI is motivated to do so then the value learning problem is already solved and nothing else matters (in particular my post becomes irrelevant), because indeed it can learn the further details in whichever way it wants. We somehow already managed to create an agent with an internal objective that points to Bedouin culture (human values), which is the whole/complete problem.
I could say more about the rest of your comment but just checking if the above changes your model of my model significantly?
Also regarding “I think I’m much more open-minded than you to …”: to be clear, I’m not at all convinced about this I’m open to this distinction not mattering at all. I hope I didn’t come accross as not open minded about this.
There’s sorta a use/mention distinction between:
An AGI with the motivation “I want to follow London cultural norms (whatever those are)”, versus
An AGI with the motivation “I want to follow the following 500 rules (avoid public nudity, speak English, don’t lick strangers, …), which by the way comprise London cultural norms as I understand them”
Normally I think of “value learning” (or in this case, “norm learning”) as related to the second bullet point—i.e., the AI watches one or more people and learn their actual preferences and desires. I also had the impression that your OP was along the lines of the second (not first) bullet point.
If that’s right, and if we figure out how to make an agent with the first-bullet-point motivation, then I wouldn’t say that “the value learning problem is already solved”, instead I would say that we have made great progress towards safe & beneficial AGI in a way that does not involve “solving value learning”. Instead the agent will hopefully go ahead and solve value learning all by itself.
(I’m not confident that my definitions here are standard or correct, and I’m certainly oversimplifying in various ways.)
Upvoted for an interesting direction of exploration, but I’m not sure I agree with (or understand, perhaps) the underlying assumption that “natural-feeling” is more likely to be safe or good. This seems a little different from the common naturalistic fallacy (what’s natural is always good, what’s artificial is always bad). It’s more a glossing over the underlying problem that we have no Safe Natural Intelligence—people are highly variable and many many of them are terrifying and horrible.
The thing underlying the intuition is more something like: We have a method of feedback that humans understand and that works fairly well, and is adapted to the way values are stored in human brains. If we try to have humans give feedback in ways that are not adapted to that, I expect information to be lost. The fact that it “feels natural” is a proxy for “the method of feedback to machines is adapted to the way humans normally give feedback to other humans” without which I am at least concerned about information loss (not claiming it’s inevitable). I don’t inherently care about the “feeling” of naturalness.
Regarding no Safe Natural Intelligence: I agree that there is no such thing, but this is not really a strong argument against? This doesn’t make me somehow suddenly feel comfortable about “unnatural” (I need a better term) methods for humans to provide feedback to AI agents. The fact that there are bad people doesn’t negate the idea that the only source of information about what is good seems to be stored in brains and that we need to extract this information in a way that is adapted to how those brains normally express that information.
Maybe I should have called it “human-adapted methods of human feedback” or something.
I think it’s a pretty strong argument. There are no humans I’d trust with the massively expanded capabilities that AI will bring, so I have to believe that the training methods for humans are insufficient.
We WANT divergence from “business as usual” human beliefs and actions, and one of the ways to get there is by different specifications and training mechanisms. The hard part is we don’t yet know how to specify precisely how we want it to differ.
I dunno, I’m not at all sure what “naturalness” is supposed to be doing here below the appearance level—how are the algorithms different?
I haven’t specified anything about the algorithms, but they will maybe somehow have to be different. The point is that the format of the human feedback is different. Really this post is about the format in which humans provide feedback rather than about the structure of the AI systems (i.e. a difference in method of generating the training signal rather than a difference in learning algorithm).