I liked the author’s summary, so I’ve reproduced it with minor stylistic changes:
A low-level model of some humans has everything there is to know about human values embedded within it, in exactly the same way that human values are embedded in physical humans. The embedding, however, is nontrivial. Thus, predictive power alone is not sufficient to define human values. The missing part is the embedding of values within the model.
However, this also applies if we replace the phrase “human values” with “trees”. Yet we have a whole class of neural networks in which a simple embedding lights up in response to trees. This is because trees are a natural abstraction, and we should expect to see real systems trained for predictive power use natural abstractions internally.
Human values are a little different from trees: they’re a property of an abstract object (humans) rather than an abstract object themselves. Nonetheless, the author still expects that a broad class of systems trained for predictive power will end up with simple embeddings of human values (~70% chance).
Since an unsupervised learner has a simple embedding of human values, a supervised/reinforcement learner can easily score well on values-proxy-tasks by directly using that model of human values. In other words, the system uses an actual model of human values as a proxy for our proxy of human values (~10-20% chance). This is what is meant by _alignment by default_.
When this works, it’s basically a best-case scenario, so we can safely use the system to design a successor without worrying about amplification of alignment errors (among other things).
Planned opinion:
I broadly agree with the perspective in this post: in particular, I think we really should have more optimism because of the tendency of neural nets to learn “natural abstractions”. There is structure and regularity in the world and neural nets often capture it (despite being able to memorize random noise); if we train neural nets on a bunch of human-relevant data it really should learn a lot about humans, including what we care about.
However, I am less optimistic than the author about the specific path presented here (and he only assigns 10% chance to it). In particular, while I do think human values are a “real” thing that a neural net will pick up on, I don’t think that they are well-defined enough to align an AI system arbitrarily far into the future: our values do not say what to do in all possible situations; to see this we need only to look at the vast disagreements among moral philosophers (who often focus on esoteric situations). If an AI system were to internalize and optimize our current system of values, as the world changed the AI system would probably become less and less aligned with humans. We could instead talk about an AI system that has internalized both current human values and the process by which they are constructed, but that feels much less like a natural abstraction to me.
I _am_ optimistic about a very similar path, in which instead of training the system to pursue (a proxy for) human values, we train the system to pursue some “meta” specification like “be helpful to the user / humanity” or “do what we want on reflection”. It seems to me that “being helpful” is also a natural abstraction, and it seems more likely that an AI system pursuing this specification would continue to be beneficial as the world (and human values) changed drastically.
Planned summary for the Alignment Newsletter:
Planned opinion:
LGTM