We sometimes phrase AI alignment as the problem of aligning the behavior or values of AI with what humanity wants or humanity’s values or humanity’s intent, but this leaves open the questions of just what precisely it means for an AI to be “aligned” with just what precisely we mean by “wants,” “values,” or “intent”. So when we say we want to build aligned AI, what precisely do we mean to accomplish beyond vaguely building an AI that does-what-I-mean-not-what-I-say?
[Question] What precisely do we mean by AI alignment?
Where did the rest of this article go? There’s just a paragraph at the start, on both LW2/GW.
It seems that there are two questions here: what “humanity’s goals” means, and what “alignment with those goals” means. An example of an answer to the former is Yudkowsky’s Coherent Extrapolated Volition (in a nutshell, what we’d do if we knew more and thought faster).
Edit: Alternatively, in place of “humanity’s goals”, this might be asking what “goals” itself means.
Edit: This might be too simple (to be original and thus useful), but can’t you just define “alignment” to be the degree to which the utility functions match?
Perhaps this just shifts the problem to “utility function”—it’s not as if humans have an accessible and well-defined utility function in practice.
Would we want to build an AI with a similarly ill-defined utility function, or should we make it more well-defined at the expense of encoding human values worse? Is it practically possible to build an AI whose values perfectly match our current understanding of our values, or will any attempted slightly-incoherent goal system differ enough from our own that it’s better to just build a coherent system?
In a sentence: If it’s aligned, and things go wrong, maybe you can still turn it off.
I am quite confused on whether this is meant as a joke or not.
I would see that as the definition of control as opposed to alignment.
I’m sure someone else is able to write a more thoughtful/definitive answer, but I’ll try here to point to two key perspectives on the problem that are typically discussed under this name.
The first perspective is what Rohin Shah has called the motivation-competence split of AGI. One person who’s written about this perspective very clearly is Paul Christiano, so I’ll quote him:
I believe the general idea is to build a system that is trying to help you, and to not run a computation that is acting adversarially in any situation. Correspondingly, Paul Christiano’s research often takes the frame of the following problem:
Here’s some more writing on this perspective:
Clarifying “AI Alignment” by Paul Christiano
The Steering Problem by Paul Christiano
The second perspective is what Rohin Shah has called the definition-optimization split of AGI. One person who’s written about this perspective very clearly is Nate Soares, I’ll quote him:
There are many AI systems you could build today that would help with this problem, and furthermore, given that much compute you could likely use it for something useful to the goal of making as much diamond as possible. But there is no single program that will continue to usefully create as much diamond as possible as you give it increasing computational power—at some point it will do something weird and unhelpful cf. Bostrom’s “Perverse Instantiations”, and Paul Christiano on What does the universal prior actually look like?
Again, Nate:
The question of aligning an AI, is creating it such that if the AI you created were to become far more intelligent than any system that has ever existed (including humans), it would continue to do the useful thing you asked it to do, and not do something else.
Here’s some more writing on this perspective.
The Rocket Alignment Problem by Eliezer Yudkowsky.
MIRI’s Approach by Nate Soares.
Methodology of unbounded analysis (unfinished) by Eliezer Yudkowsky.
---
Overall, I think that it’s the case that neither of these two perspectives is cleanly formalised or well-specified, and that’s a key part of the problem with making sure AGI goes well—being able to clearly state exactly what we’re confused about in the long run about how to build an AGI is half the battle.
Personally, when I hear ‘AI alignment’ in a party/event/blog, I expect a discussion of AGI design with the following assumption:
Precisely what we’re confused about, and which research will resolve our confusion, is an open question. The word ‘alignment’ captures the spirit of certain key ideas about what problems need solving, but is not a finished problem statement.
Added: Another quote from Nate Soares on the definition of alignment:
So it is not Nate’s opinion that the problem is well-specified at present.
I like to consider humanity-AI alignment in light of brain-brain alignment. If the purpose of alignment is self-preservation at the simple scale and fulfilment of individual desires at the complex scale, then brain-brain alignment hasn’t faired greatly. While we as a species are still around, our track record is severely blemished.
Another scale of alignment to consider is the alignment of a single brain with itself. The brain given to us by natural selection is not perfect, despite being in near instantaneous communication with itself (as opposed to the limited communication bandwidth between humans). Being a human, you should be familiar with the struggle of aligning the numerous working parts of your brain on a moment-by-moment basis. While we as a species are still around, the rate of failure among humans for preservation and attainment of desire is awfully low (suicide, self-sabotage, etc.).
In light of this, I do find the idea of designing an intelligent agent, which does-what-I-mean-not-what-I-say, very strange. Where the goal is self-preservation and attainment of desire for both parties, there is nothing that suggests to me that one human can firstly decide very well what they mean, or secondly express what they have decided that they mean, through verbal or written communication, well enough to even align a fellow human (with a high success rate).
I am not suggesting that aligning a generally intelligent agent is impossible, just that at a brief glance it would appear more difficult than aligning two human brains or a single brain with itself. I am also not suggesting that this applies to agents that cannot set their own intention or are designed to have their intention modified by human input. I really have no intuition at all about agents that range between AlphaGo Zero and whatever comes just before humans in their capacity to generalise.
At this philosophical glance, to align one generally intelligent artificial entity with all of humanity’s values and desires seems very unlikely. True alignment could only come from an intelligent entity with bandwidth and architecture greater than that of the human brain, and that would still be an alignment with itself.
For me this intuition leads to the conclusion that the crux of the alignment problem is the poor architecture of the human brain and our bandwidth constraints, for even at the easiest point of alignment (single brain alignment) we see consistent failure. It would seem to me that alignment with artificial entities that at all compare to the generalisation capacity of humans should be forestalled till we can transition ourselves to a highly manipulable non-biological medium (with greater architecture and bandwidth than the human brain).