Natural Value Learning

The main idea of this post is to make a distinction between natural and unnatural value learning. The secondary point of this post is that we should be suspicious of unnatural schemes for value learning, though the point is not that we should reject them. (I am not fully satisfied with the term, so please do suggest a different term if you have one.)

Epistemic status: I’m not confident that actual value learning schemes will have to be natural, given all the constraints. I’m mainly confident that people should have something like this concept in their mind, though I don’t give any arguments for this.


Natural value learning

By a “value learning process” I mean a process by which machines come to learn and value what humans consider good and bad. I call a value learning process natural to the extent that the role humans play in this process is basically similar to the role they play in the process of socializing other humans (mostly children, also asocial adults) to learn and value what is good and bad. To give a more detailed picture of the distinction I have in mind, here are some illustrations of what I’m pointingat, each giving a property I associate with natural and unnatural value learning respectively:

Natural alignment

Unnatural alignment

Humans play the same role, and do the same kinds of things, within the process of machine value learning as they do when teaching values to children. Humans in some significant way play a different role, or have to behave differently within the process of machine value learning than they do when teaching values to children.
The machine value learning process is adapted to humans.Humans have to adapt to the machine value learning process.
Humans who aren’t habituated to the technical problems of AI or machine value learning would still consider the process as it unfolds to be natural. They can intuitively think of the process in analogy to the process as it has played out in their experience with humans.Humans who aren’t habituated to the technical problems of AI or machine value learning would perceive the machine value learning process to be unnatural, alien or “computer-like”. If they naively used their intuitions and habits from teaching values to children, they would be confused in important ways.

Concrete examples

To give a concrete idea of what I would consider natural vs unnatural value learning setups, here are some concrete scenarios in order of ascending naturality:

Disclaimer: I am not in any way proposing any of these scenarios as realistic or good or something to aim for (in fact I tried to write them somewhat comically so as not to raise this question). They are purely intended to clarify the idea given in this post, nothing more.

Not very natural. A superintelligent system is built that has somehow been correctly endowed with the goal of learning and optimizing human values (somehow). It somehow has some extremely efficient predictive algorithms that are a descendant of current ML algorithms. It scans the internet, builds a model of the distribution of human minds, and predicts what humanity would want if it were smarter and so forth. It successfully figures out what is good for humanity, reveals itself, and implements what humanity truly wants.

Somewhat more but still not very natural. A large AI research lab trains a bunch of agents in a simulation as a project branded as developing “aligned general intelligence”. As a part of this, the agents have to learn what humans want by performing various tasks in simulated and auto-generated situations that require human feedback to get right. A lot of data is required, so many thousands of humans are hired to sit in front of computer screens, look at the behaviour of the agents, and fill in scores or maybe english sentences to evaluate that behaviour. In order to be time-efficient, they evaluate a lot of such examples in succession. Specialized AI systems are used to generate adversarial examples, which end up being weird situations where the agents make alien decisions the humans wouldn’t have expected. Interpretability tools are used to inspect the cognition of these agents to provide the human evaluators with descriptions of what they were thinking. The human evaluators have to score the correctness of the reasoning patterns that led to the agent’s actions based on those descriptions. Somehow, the agents end up internalizing human values but are vastly faster and more capable on real world tasks.

More natural. A series of general-purpose robots and software assistants are developed that help humans around the home and the office, and start out (somehow) with natural language abilities, knowledge of intuitive physics, and so forth. They are marketed as “learning the way a human child learns”. At first, these assistants are considered dumb regarding understanding of human norms/​values, but they are very conservative, so they don’t do much harm. Ordinary people use these robots/​agents at first for very narrow tasks, but by massively distributed feedback given through the corrections of ordinary human consumers, they begin to gain and internalize intuitive understanding of at first quite basic everyday human norms. For example, the robots learn not to clean the plate while the human is still eating from it, because humans in their normal daily lives react negatively when the robots do so. Similarly, they learn not to interrupt an emotional conversation between humans, and so forth. Over time, humans trust these agents with more and more independent decision-making, and thereby the agents receive more general feedback. They eventually somehow actually generalize broadly human values to the point where people trust the moral understanding of these agents as much or more than they would that of a human.

Disclaimer: I already said this before but I feel the need to say it again: I don’t consider especially this last scenario to be realistic/​solving the core AI safety problem. The scenarios are merely meant to illustrate the concept of natural value learning.


Why does this distinction matter?

It seems to me that naturality tracks something about the expected reliability of value learning. Broadly I think we should be more wary of proposals to the extent that they are unnatural.

I won’t try to argue for this point, but broadly my model is based on the model that human values are fragile and we don’t really understand mechanistically how they are represented, or how they are learned. Eg. there are some ideas around values, meta-values, and different levels of explicitness, but they seem to me to be quite far from a solid understanding that would give me confidence in a process that is significantly unnatural.