I think type-1 research is most useful, type-3 is second best, and type-2 is least useful. Here’s why I think so.
All of your arguments seem to be about importance. Tractability and neglectedness matter too. (Unless your argument is that type 2 research will be of literally zero use for aligning superintelligent AI.) In particular, you seem dismissive of:
It’s easier to learn all the prerequisites for type-2 research and to actually do it.
Even if you are perfectly altruistic and not limited by money, other people, reputation, etc this is a legitimate reason to focus on type 2 research from a perspective of minimizing x-risk.
At some point, humanity will create a superintelligent AI, unless we go extinct before. When that happens, we won’t be making important decisions anymore. Instead, the AI will.
The AI will be making important decisions long before it becomes near-omnipotent, as you put it. In particular, it should be doing all the work of aligning future AI systems well before it is near-omnipotent.
(However, you still need a pretty strong guarantee about that AI system, such that when it aligns a future AI system, that future AI system remains aligned with us, so I think overall the intuition is right.)
Human-level AI might be alignable using hacky, empirical testing, engineering, and “good enough” alignment.
All alignment is “good enough” alignment, there is no such thing as “perfect” alignment except in idealized theory. All you get is more or less confidence in the AI system you’re building. (I say this not to be pedantic, because I legitimately don’t know what your threshold is for dismissing “hacky” alignment, or what you mean when you say it won’t work on superintelligent AI systems, or what would count as “not hacky”. I may or may not agree with you depending on the answer.)
Superintelligent AI is an extremely powerful optimization process. Hence, if it’s unaligned even a little, it’ll be catastrophic.
I agree with the intuition overall, but it isn’t a particularly strong intuition. For example, any other human is at least a little unaligned with me, but there are at least some humans where I’d feel okay making them God (in the sense that in expectation I think the world would be better moment-to-moment by my values than it is today moment-to-moment; this could be an existential catastrophe because we don’t control the future, but it isn’t extinction).
I don’t see why it’ll be easier to work on the alignment of superintelligent AI in the future rather than now, so we’d better start now.
We’ll have a better idea of how superintelligent AI will be built in the future.
There are too many confusing things about superintelligent AI alignment [...] Hence, deconfusion is very important.
I agree with the intuition in general.
Note though it’s quite possible that some things we’re confused about are also simply irrelevant to the thing we care about. (I would claim this of embedded agency with not much confidence.)
All alignment is “good enough” alignment, there is no such thing as “perfect” alignment except in idealized theory.
I strongly disagree with this. It may be true in some technical sense—e.g. we can’t be 100% certain there’s not a bug in our code—but I do think there exists a sharp, qualitative distinction between systems which are optimizing-for-the-thing-we-call-human-values and systems which aren’t doing that. Most likely underlying generator of disagreement: I think there’s a natural, precise notion of what we mean when we point to “human values”, in much the same way that there’s a natural, precise notion of what we mean when we point to a flower. There’s still multiple steps between pointing to flowers and pointing to human values, but one feature I expect to carry over is that it’s not an underspecified or fully-subjective notion—there is a well-defined sense in which the physical system of molecules comprising a human brain “wants things”, and a well-defined notion of what that system wants.
I broadly agree with this perspective (and in fact it’s one of my reasons for optimism about AI alignment).
But usually when LessWrongers argue against “good enough” alignment, they’re arguing against alignment methods, saying that “nothing except proofs” will work, because only proofs give near-100% confidence. (I might be strawmanning this argument, I don’t really understand it.)
You’re talking about the internal structure of the AI system (is the AI system actually in fact optimizing for “human values”, or something else), where I do expect a sharper, qualitative distinction. I’m claiming that our ability to get on the right side of that distinction is relatively smooth across the methods that we could use.
Part of my optimism about AI alignment (relative to LW) comes from thinking that since there (probably) is a relatively sharp qualitative divide between “aligned computation” and “unaligned computation”, the “engineering approach” has more of a shot at working. (This isn’t a big factor in my optimism though.)
I almost ended up writing a whole post more or less psychologizing this point recently.
Quotes from the probably-never-to-be-published post, which I might as well fillet out to present here:
Last year I was thinking about how humans refer to things. For example, when I say “human values,” it seems like I am pointing to something (some thing), as surely as if I was using my finger to point at some material object. And so if we want an AI to learn about human values, it sure would be nice if it could follow that pointer out to the thing-being-pointed-to.
At the time, it wasn’t at all obvious to me that I had already stepped off the path, but I had. Rather than trying to understand this thing humans do—refer to things—in terms of the map-making problem humans actually face [From earlier: The physical world is really complicated. Humans get some information about the world via the senses, and then we model it so that we can make sense of our senses, predict the world, and make plans. This can be a really useful starting point for explanations of confusing phenomena.], I had framed the problem with an analogy to physical objects. As if the analogy was clean, and as if objects were natural (dare I say directly-perceived) building blocks of the world.
It’s a very tricky mistake to avoid, this thing of thinking that reality will respect your labels. I wanted to understand the “human values” label, and so I mistakenly tried to look for the process by which we associate that label with some natural object, or even natural pattern, out in the world that corresponds to “human values.” But reality doesn’t have objects for things just because we have labels for them. This is the fallacy of essentialism—the notion that if we have a word like “roundness,” then there must be some thing out in the world that is roundness. The roundness-essence, if you will.
EDIT: To forestall the obvious objection to the last sentence that roundness is a pattern, and surely with a little elbow grease you could write down something about spherical symmetry that is equivalent to roundness-essence, the most relevant point to human values is that even if we have a label for a pattern, that pattern still doesn’t have to exist. The label-making process of the human brain does not first require comprehension of some referent of the label.
Rather than finding a theory in which we can find a precise notion of human values, we need a theory in which we can do okay despite not having a precise notion of human values (yes, I agree that sounds paradoxical). And by the naturalization thesis, this sort of reasoning plausibly also applies to an aligned AI.
This isn’t “rah rah type 2 research, boo type 1 research.” What I mean is that I think the indeterminacy of human values connects the two together, like the critical point of water allows for a continuous transition between liquid and gas.
Counterargument: suppose a group of humans split off from the rest of humanity long enough ago that they have no significant shared cultural background. They develop language independently. Assuming they live in an area with trees, do they still develop a word for “tree”, recognize individual trees as objects, and generally have a notion of tree which matches our notion? I think the answer is pretty clearly “yes”—in part because the number of examples a baby needs to learn what a word means is not nearly large enough to narrow down the massive object space unless they already have some latent classification for those objects.
It’s true that the label-making making process of the human brain does not require a referent in order to generate a word, but most words have them anyway—including (but not limited to) any word whose meaning can be reasonably-reliably communicated to someone who’s never heard it before using less than a million examples.
One human can have a word for a pattern which doesn’t exist. Two humans can use that word. But if you put the two humans in separate, identical rooms and ask them both to point to the <word>, and they consistently point to the same thing, then that’s pretty clear evidence that the pattern exists in the world. “Human values” are a bit too abstract for that exact test, but I think we have more than enough analogous evidence to conclude that they do exist.
Okay, let’s go with “tree.” Is an acorn a tree? A seedling? What if the seedling is just a sprouted acorn in a plastic bag, versus a sprouted acorn that’s planted in the ground? A dead, fallen-over tree? What about a big unprocessed log? The same log but with its bark stripped off?
How likely do you think it is that there’s some culture out there that disagrees with you about at least two of these? How likely is it that you would disagree with yourself, given different contextual cues?
Trees obviously exist. And I agree with you that a clever clusterer will probably find some cluster that more or less overlaps with “tree” (though who knows, there’s probably a culture out there that has a word for woody-stemmed plants but not for trees specifically, or no word for trees but words for each of the three different kinds of trees in their environment specifically).
But an AI that’s trying to find the “one true definition of trees” will quickly run into problems. There is no thing, nothing with the properties intuitive to an object or substance, that defines trees. And if you make an AI that goes out and looks at the world and comes up with its own clusterings and then tries to learn what “tree” means from relatively few examples, this is precisely a ‘good-enough’ hack of the type 2 variety.
Is an acorn a tree? A seedling? What if the seedling is just a sprouted acorn in a plastic bag, versus a sprouted acorn that’s planted in the ground? A dead, fallen-over tree? What about a big unprocessed log? The same log but with its bark stripped off?
How likely do you think it is that there’s some culture out there that disagrees with you about at least two of these? How likely is it that you would disagree with yourself, given different contextual cues?
Wrong questions. A cluster does not need to have sharp classification boundaries in order for the cluster itself to be precisely defined, and it’s precise definition of the cluster itself that matters.
An even-more-simplified example: suppose we have a cluster in some dataset which we model as normal with mean 3.55 and variance 2.08. There may be points on the edge of the cluster which are ambiguously/uncertainly classified, and that’s fine. The precision of the cluster itself is not about sharp classification, it’s about precise estimation of the parameters (i.e. mean 3.55 and variance 2.08, plus however we’re quantifying normality). If our algorithm is “working correctly”, then there is an actual pattern out in the world corresponding to our cluster, and that pattern is the thing we want to point to—not any particular point within the pattern.
Back to trees. The one true definition of trees does not unambiguously classify all objects as tree or not-tree; that is not the sense in which it is precisely defined. Rather, there is some precisely-defined generative model for observations-of-trees, and the concept of “tree” points to that model. Assuming the human-labelling-algorithm is “working correctly”, that generative model matches an actual pattern in the world, and the precision of the model follows from the pattern. None of this requires unambiguous classification of logs as tree/not-tree.
On to human values. (I’ll just talk about one human at the moment, because cross-human disagreements are orthogonal to the point here.) The answer to “what does this human want?” does not always need to be unambiguous—indeed it should not always be unambiguous, because that is not the actual nature of human values. Rather, I have some precisely-defined generative model for observations-involving-my-values. Assuming my algorithm is “working correctly”, there is an actual pattern out in the world corresponding to that cluster, and that pattern is the thing we want to point to. That’s not just “good enough”; pointing to that pattern (assuming it exists) is perfect alignment. That’s what “mission accomplished” looks like. It’s the thing we’re modelling when we model our own desires.
Rather, there is some precisely-defined generative model for observations-of-trees, and the concept of “tree” points to that model. Assuming the human-labelling-algorithm is “working correctly”, that generative model matches an actual pattern in the world, and the precision of the model follows from the pattern. None of this requires unambiguous classification of logs as tree/not-tree.
This contains the ad-hoc assumption that if there’s one history in which I’ll say logs are trees, and another history in which I won’t, then what I’m doing is approximating a “real concept” in which logs are sorta-trees.
This is a modeling assumption about humans that doesn’t have to be true. You could just as well say that in the two different worlds, I’m actually referring to two related but distinct concepts. (Or you could model me as picking things to say about trees in a way that doesn’t talk about the properties of some “concept of trees” at all.)
The root problem is that “pointing to a real pattern” is not something humans can do in a vacuum. “I’m a great communicator, but people just don’t understand me,” as the joke goes. As far as I can tell, what you mean is that you’re envisioning an AI that learns about patterns in the world, and then matches those patterns to some collection of data that it’s been told to assume is “pointing to a pattern.” And there is no unique scheme for this—at the very least, you’ve got a choice of universal turing machine, as well as a free parameter describing the expected human level of abstraction. And this isn’t a case where any choice will do, because we’re in the limited-data regime, where different ontologies can easily lead to different categorizations.
This contains the ad-hoc assumption that if there’s one history in which I’ll say logs are trees, and another history in which I won’t, then what I’m doing is approximating a “real concept” in which logs are sorta-trees.
That is not an assumption, it is an implication of the use of the concept “tree” to make predictions. For instance, if I can learn general facts about trees by examining a small number of trees, then I know that “tree” corresponds to a real pattern out in the world. This extends to logs: to the extent that a log is a tree, I can learn general facts about trees by examining logs (and vice versa), and verify what I’ve learned by looking at more trees/logs.
Pointing to a real pattern is indeed not something humans can do in a vacuum. Fortunately we do not live in a vacuum; we live in a universe with lots of real patterns in it. Different algorithms will indeed result in somewhat different classifications/patterns learned at any given time, but we can still expect a fairly large class of algorithms to converge to the same classifications/patterns over time, precisely because they are learning from the same universe. A perfectly-aligned AI will not have a perfect model of human values at any given time, but it can update in the right direction—in some sense it’s the update-procedure which is “aligned” with the true pattern, not the model itself which is “aligned”.
That’s why we often talk about perfectly “pointing” to human values, rather than building a perfect model of human values. It’s not about having a perfect model at any given time, it’s about “having a pointer” to the real-world pattern of human values, allowing us to do things like update our model in the right direction.
As far as I can tell, what you mean is that you’re envisioning an AI that learns about patterns in the world, and then matches those patterns to some collection of data that it’s been told to assume is “pointing to a pattern.” And there is no unique scheme for this—at the very least, you’ve got a choice of universal turing machine, as well as a free parameter describing the expected human level of abstraction. And this isn’t a case where any choice will do, because we’re in the limited-data regime...
I definitely do not imagine that some random architecture would get it right with realistic amounts of data. Picking an architecture which matches the structure of our universe closely enough to perform well with limited data is a key problem—it’s exactly the sort of thing that e.g. my work on abstraction will hopefully help with.
(Also, matching the patterns to some collection of data intended to point to the pattern is not the only way of doing things, or even a very good way given the difficulty of verification, though for purposes of this discussion it’s a fine approach to examine.)
That is not an assumption, it is an implication of the use of the concept “tree” to make predictions.
I would disagree in spirit—an AI can happily find a referent to the “tree” token that depends on context in a way that works like a word with multiple possible definitions.
Picking an architecture which matches the structure of our universe closely enough to perform well with limited data is a key problem
I hope this is where we can start agreeing. Because the problem isn’t just finding something that performs well according to a known scoring rule. We don’t quite know how to implement the notion “this method for learning human values performs well” on a computer without basically already referring to some notion of human values for “performs well.”
We either need to ground “performs well” in some theory of humans as approximate agents that doesn’t need to know about their values, or we need some theory that avoids the chicken-and-egg problem altogether by simultaneously learning human models and the standards to judge them by.
I hope this is where we can start agreeing. Because the problem isn’t just finding something that performs well according to a known scoring rule. We don’t quite know how to implement the notion “this method for learning human values performs well” on a computer without basically already referring to some notion of human values for “performs well.”
To clarify, when said “performs well”, I did not mean “learns human values well”, nor did I have any sort of scoring rule in mind. I intended to mean that the algorithm learns patterns which are actually present in the world—much like earlier when I talked about “the human-labelling-algorithm ‘working correctly’”.
Probably not the best choice of words on my part; sorry for causing a tangent.
I would disagree in spirit—an AI can happily find a referent to the “tree” token that depends on context in a way that works like a word with multiple possible definitions.
I’m sure it could, but I am claiming that such a thing would have worse predictive power. Roughly speaking: if there’s one notion of tree that includes saplings, and another that includes logs, then the model misses the ability to learn facts about saplings by examining logs. Conversely, if it doesn’t miss those sorts of things, then it isn’t actually behaving like a word with multiple possible referents. (I don’t actually think it’s that simple—the referent of “tree” is more than just a comparison class—but it hopefully suffices to make the point.)
To clarify, when said “performs well”, I did not mean “learns human values well”, nor did I have any sort of scoring rule in mind. I intended to mean that the algorithm learns patterns which are actually present in the world—much like earlier when I talked about “the human-labelling-algorithm ‘working correctly’”.
Ah well. I’ll probably argue with you more about this elsewhere, then :)
This is very well-said, but I still want to dispute the possibility of “perfect alignment”. In your clustering analogy: the very existence of clusters presupposes definitions of entities-that-correspond-to-points, dimensions-of-the-space-of-points, and measurements-of-given-points-in-given-dimensions. All of those definitions involve imperfect modeling assumptions and simplifications. Your analogy also assumes that a normal-mixture-model is capable of perfectly capturing reality; I’m aware that this is provably asymptotically true for an infinite-cluster Dirichlet process mixture, but we don’t live in asymptopia and in reality it is effectively yet another strong assumption that holds at best weakly.
In other words, while I agree with (and appreciate your clear expression of) your main point that it’s possible to have a well-defined category without being able to do perfect categorization, I dispute the idea that it is possible even in theory to have a perfectly-defined one.
All of those definitions involve imperfect modeling assumptions and simplifications. Your analogy also assumes that a normal-mixture-model is capable of perfectly capturing reality; I’m aware that this is provably asymptotically true for an infinite-cluster Dirichlet process mixture, but we don’t live in asymptopia and in reality it is effectively yet another strong assumption that holds at best weakly.
This is a critical point; it’s the reason we want to point to the pattern in the territory rather than to a human’s model itself. It may be that the human is using something analogous to a normal-mixture-model, which won’t perfectly match reality. But in order for that to actually be predictive, it has to find some real pattern in the world (which may not be perfectly normal, etc). The goal is to point to that real pattern, not to the human’s approximate representation of that pattern.
Now, two natural (and illustrative) objections to this:
If the human’s representation is an approximation, then there may not be a unique pattern to which their notions correspond; the “corresponding pattern” may be underdefined.
If we’re trying to align an AI to a human, then presumably we want the AI to use the human’s own idea of the human’s values, not some “idealized” version.
The answer to both of these is the same: we humans often update our own notion of what our values are, in response to new information. The reality-pattern we want to point to is the pattern toward which we are updating; it’s the thing our learning-algorithm is learning about. I think this is what coherent extrapolated volition is trying to get at: it asks “what would we want if we knew more, thought faster, …”. Assuming that the human-label-algorithm is working correctly, and continues working correctly, those are exactly the sort of conditions generally associated with convergence of the human’s model to the true reality-pattern.
Here are my responses to your comments, sorted by how interesting they
are to me, descending. Also, thanks for your input!
Non-omnipotent AI aligning omnipotent AI
The AI will be making important decisions long before it becomes
near-omnipotent, as you put it. In particular, it should be doing all
the work of aligning future AI systems well before it is
near-omnipotent.
Please elaborate. I can imagine multiple versions of what you’re
imagining. Is one of the following scenarios close to what you mean?
Scientists use AI-based theorem provers to prove theorems about AI
alignment.
There’s an AI, with which you can have conversations. It tries to
come up with new mathematical definitions and theorems related to
what you’re discussing.
The AI (or multiple AIs) is not near-omnipotent yet, but it already
controls most of the humanity’s resources and makes most of the
decisions, so it does research into AI instead of humans.
I think, the requirements for how well the non-omnipotent AI in the 3rd
scenario should be aligned are basically the same as for a
near-omnipotent AI. If the non-omnipotent AI in the 3rd scenario is very
misaligned, but it’s not catastrophic because the AI is not smart
enough, the near-omnipotent AI it’ll create will also be misaligned,
and that’ll be catastrophic.
Embedded agency
Note though it’s quite possible that some things we’re confused
about are also simply irrelevant to the thing we care about. (I would
claim this of embedded agency with not much confidence.)
So, you think embedded agency research is unimportant for AI alignment.
On the contrast, I think it’s very important. I worry about it mainly
for 3 reasons. Suppose we don’t figure out embedded agency. Then
An AI won’t be able to safely self-modify
An AI won’t be able to comprehend that it can be killed or damaged
or modified by others
I am not sure about this one. I am very interested to know if this
is not the case. I think, if we build an AI without understanding
embedded agency, and that AI builds a new AI, that new AI also
won’t understand embedded agency. In other words, the set of AIs
built without taking embedded agency into account is closed under
the operation of an AI building a new AI. [Upd: comments under this comment mostly refute this]
I am even less sure about this item, but maybe such an AI will be
too dogmatic (as in dogmatic prior) about how the world might work,
because it is sure that it can’t be killed or damaged or modified.
Due to this, if the physics laws turn out to be weird (e.g. we live
in a multiverse, or we’re in a simulation), the AI might fail to
understand that and thus fail to turn the whole world into hedonium
(or whatever it is that we would want it to do with the world).
If an AI built without taking embedded agency into account meets
very smart aliens someday, it might fuck up due to its inability to
imagine that someone can predict its actions.
Usefulness of type-2 research for aligning superintelligent AI
Unless your argument is that type 2 research will be of literally zero
use for aligning superintelligent AI.
I think that if one man-year of type-1 research produces 1 unit of
superintelligent AI alignment, one man-year of type-2 research produces
about 0.15 units of superintelligent AI alignment.
As I see it, the mechanisms by which type-2 research helps align
superintelligent AI are:
It may produce useful empirical data which’ll help us make type-1
theoretical insights.
Thinking about type-2 research contains a small portion of type-1
thinking.
For example, if someone works on making contemporary neural networks
robust to out-of-distribution examples, and they do that mainly by
experimenting, their experimental data might provide insights about the
nature of robustness in abstract, and also, surely some portion of their
thinking will be dedicated to theory of robustness.
My views on tractability and neglectedness
Tractability and neglectedness matter too.
Alright, I agree with you about tractability.
About neglectedness, I think type-2 research is less neglected than
type-1 and type-3 and will be less neglected in the next 10 years or so,
because
It’s practical, you can sell it to companies which want to make
robots or unbreakable face detection or whatever.
The AI (or multiple AIs) is not near-omnipotent yet, but it already controls most of the humanity’s resources and makes most of the decisions, so it does research into AI instead of humans.
I agree that you still need a strong guarantee of alignment in this scenario (as I mentioned my original comment).
On the contrast, I think it’s very important. I worry about it mainly for 3 reasons. Suppose we don’t figure out embedded agency. Then [...]
Why don’t these arguments apply to humans? Evolution didn’t understand embedded agency, but managed to create humans who seem to do okay at being embedded agents.
(I buy this as an argument that an AI system needs to not ignore the fact that it is embedded, but I don’t buy it as an argument that we need to be deconfused about embedded agency.)
I think that if one man-year of type-1 research produces 1 unit of superintelligent AI alignment, one man-year of type-2 research produces about 0.15 units of superintelligent AI alignment.
Cool, that’s more concrete, thanks. (I disagree, but there isn’t really an obvious point to argue on, the cruxes are in the other points.)
About neglectedness, I think type-2 research is less neglected than type-1 and type-3 and will be less neglected in the next 10 years or so, because
Agreed. Tbc, I wasn’t arguing it was neglected, just that you seemed to be ignoring tractability and neglectedness, which seemed like a mistake.
I see MIRI’s research on agent foundations (including embedded agency) as something like “We want to understand ${an aspect of how agents should work}, so let’s take the simplest case first and see if we understand everything about it. The simplest case is the case when the agent is nearly omniscient and knows all logical consequences. Hmm, we can’t figure out this simplest case yet—it breaks down if the conditions are sufficiently weird”. Since it turns out that it’s difficult to understand embedded agency even for such simple cases, it seems plausible that an AI trained to understand embedded agency by a naive learning procedure (similar to the evolution) will break down under sufficiently weird conditions.
Why don’t these arguments apply to humans? Evolution didn’t understand embedded agency, but managed to create humans who seem to do okay at being embedded agents.
(I buy this as an argument that an AI system needs to not ignore the fact that it is embedded, but I don’t buy it as an argument that we need to be deconfused about embedded agency.)
Hmm, very good argument. Since I think humans have imperfect understanding of embedded agency, thanks to you I now no longer think that “If we build an AI without understanding embedded agency, and that AI builds a new AI, that new AI also won’t understand embedded agency” since that would imply we can’t get the “lived happily ever after” at all. We can ignore the case where we can’t get the “lived happily ever after” at all, because in that case nothing matters anyway.
I suppose, we could run evolutionary search or something, selecting for AIs which can understand the typical cases of being modified by itself or by the environment, which we include in the training dataset. I wonder how we can make it understand very atypical cases of modification. A near-omnipotent AI will be a very atypical case.
Can we come up with a learning procedure to have the AI learn embedded agency on its own? It seems plausible to me that we will need to understand embedded agency better to do this, but I don’t really know.
Btw, in another comment, you say
But usually when LessWrongers argue against “good enough” alignment, they’re arguing against alignment methods, saying that “nothing except proofs” will work, because only proofs give near-100% confidence.But usually when LessWrongers argue against “good enough” alignment, they’re arguing against alignment methods, saying that “nothing except proofs” will work, because only proofs give near-100% confidence.
I basically subscribe to the argument that nothing except proofs will work in the case of superintelligent agentic AI.
Re: embedded agency, while these are all potentially relevant points (especially self-modification), I don’t see any of them as the main reason to study embedded agents from an alignment standpoint. I see the main purpose of embedded agency research as talking about humans, not designing AIs—in particular, in order to point to human values, we need a coherent notion of what it means for an agenty system embedded in its environment (i.e. a human) to want things. As the linked post discusses, a lot of the issues with modelling humans as utility-maximizers or using proxies for our goals stem directly from more general embedded agency issues.
All of your arguments seem to be about importance. Tractability and neglectedness matter too. (Unless your argument is that type 2 research will be of literally zero use for aligning superintelligent AI.) In particular, you seem dismissive of:
Even if you are perfectly altruistic and not limited by money, other people, reputation, etc this is a legitimate reason to focus on type 2 research from a perspective of minimizing x-risk.
The AI will be making important decisions long before it becomes near-omnipotent, as you put it. In particular, it should be doing all the work of aligning future AI systems well before it is near-omnipotent.
(However, you still need a pretty strong guarantee about that AI system, such that when it aligns a future AI system, that future AI system remains aligned with us, so I think overall the intuition is right.)
All alignment is “good enough” alignment, there is no such thing as “perfect” alignment except in idealized theory. All you get is more or less confidence in the AI system you’re building. (I say this not to be pedantic, because I legitimately don’t know what your threshold is for dismissing “hacky” alignment, or what you mean when you say it won’t work on superintelligent AI systems, or what would count as “not hacky”. I may or may not agree with you depending on the answer.)
I agree with the intuition overall, but it isn’t a particularly strong intuition. For example, any other human is at least a little unaligned with me, but there are at least some humans where I’d feel okay making them God (in the sense that in expectation I think the world would be better moment-to-moment by my values than it is today moment-to-moment; this could be an existential catastrophe because we don’t control the future, but it isn’t extinction).
We’ll have a better idea of how superintelligent AI will be built in the future.
I agree with the intuition in general.
Note though it’s quite possible that some things we’re confused about are also simply irrelevant to the thing we care about. (I would claim this of embedded agency with not much confidence.)
I strongly disagree with this. It may be true in some technical sense—e.g. we can’t be 100% certain there’s not a bug in our code—but I do think there exists a sharp, qualitative distinction between systems which are optimizing-for-the-thing-we-call-human-values and systems which aren’t doing that. Most likely underlying generator of disagreement: I think there’s a natural, precise notion of what we mean when we point to “human values”, in much the same way that there’s a natural, precise notion of what we mean when we point to a flower. There’s still multiple steps between pointing to flowers and pointing to human values, but one feature I expect to carry over is that it’s not an underspecified or fully-subjective notion—there is a well-defined sense in which the physical system of molecules comprising a human brain “wants things”, and a well-defined notion of what that system wants.
I broadly agree with this perspective (and in fact it’s one of my reasons for optimism about AI alignment).
But usually when LessWrongers argue against “good enough” alignment, they’re arguing against alignment methods, saying that “nothing except proofs” will work, because only proofs give near-100% confidence. (I might be strawmanning this argument, I don’t really understand it.)
You’re talking about the internal structure of the AI system (is the AI system actually in fact optimizing for “human values”, or something else), where I do expect a sharper, qualitative distinction. I’m claiming that our ability to get on the right side of that distinction is relatively smooth across the methods that we could use.
Part of my optimism about AI alignment (relative to LW) comes from thinking that since there (probably) is a relatively sharp qualitative divide between “aligned computation” and “unaligned computation”, the “engineering approach” has more of a shot at working. (This isn’t a big factor in my optimism though.)
I almost ended up writing a whole post more or less psychologizing this point recently.
Quotes from the probably-never-to-be-published post, which I might as well fillet out to present here:
EDIT: To forestall the obvious objection to the last sentence that roundness is a pattern, and surely with a little elbow grease you could write down something about spherical symmetry that is equivalent to roundness-essence, the most relevant point to human values is that even if we have a label for a pattern, that pattern still doesn’t have to exist. The label-making process of the human brain does not first require comprehension of some referent of the label.
Rather than finding a theory in which we can find a precise notion of human values, we need a theory in which we can do okay despite not having a precise notion of human values (yes, I agree that sounds paradoxical). And by the naturalization thesis, this sort of reasoning plausibly also applies to an aligned AI.
This isn’t “rah rah type 2 research, boo type 1 research.” What I mean is that I think the indeterminacy of human values connects the two together, like the critical point of water allows for a continuous transition between liquid and gas.
Counterargument: suppose a group of humans split off from the rest of humanity long enough ago that they have no significant shared cultural background. They develop language independently. Assuming they live in an area with trees, do they still develop a word for “tree”, recognize individual trees as objects, and generally have a notion of tree which matches our notion? I think the answer is pretty clearly “yes”—in part because the number of examples a baby needs to learn what a word means is not nearly large enough to narrow down the massive object space unless they already have some latent classification for those objects.
It’s true that the label-making making process of the human brain does not require a referent in order to generate a word, but most words have them anyway—including (but not limited to) any word whose meaning can be reasonably-reliably communicated to someone who’s never heard it before using less than a million examples.
One human can have a word for a pattern which doesn’t exist. Two humans can use that word. But if you put the two humans in separate, identical rooms and ask them both to point to the <word>, and they consistently point to the same thing, then that’s pretty clear evidence that the pattern exists in the world. “Human values” are a bit too abstract for that exact test, but I think we have more than enough analogous evidence to conclude that they do exist.
Okay, let’s go with “tree.” Is an acorn a tree? A seedling? What if the seedling is just a sprouted acorn in a plastic bag, versus a sprouted acorn that’s planted in the ground? A dead, fallen-over tree? What about a big unprocessed log? The same log but with its bark stripped off?
How likely do you think it is that there’s some culture out there that disagrees with you about at least two of these? How likely is it that you would disagree with yourself, given different contextual cues?
Trees obviously exist. And I agree with you that a clever clusterer will probably find some cluster that more or less overlaps with “tree” (though who knows, there’s probably a culture out there that has a word for woody-stemmed plants but not for trees specifically, or no word for trees but words for each of the three different kinds of trees in their environment specifically).
But an AI that’s trying to find the “one true definition of trees” will quickly run into problems. There is no thing, nothing with the properties intuitive to an object or substance, that defines trees. And if you make an AI that goes out and looks at the world and comes up with its own clusterings and then tries to learn what “tree” means from relatively few examples, this is precisely a ‘good-enough’ hack of the type 2 variety.
Wrong questions. A cluster does not need to have sharp classification boundaries in order for the cluster itself to be precisely defined, and it’s precise definition of the cluster itself that matters.
An even-more-simplified example: suppose we have a cluster in some dataset which we model as normal with mean 3.55 and variance 2.08. There may be points on the edge of the cluster which are ambiguously/uncertainly classified, and that’s fine. The precision of the cluster itself is not about sharp classification, it’s about precise estimation of the parameters (i.e. mean 3.55 and variance 2.08, plus however we’re quantifying normality). If our algorithm is “working correctly”, then there is an actual pattern out in the world corresponding to our cluster, and that pattern is the thing we want to point to—not any particular point within the pattern.
Back to trees. The one true definition of trees does not unambiguously classify all objects as tree or not-tree; that is not the sense in which it is precisely defined. Rather, there is some precisely-defined generative model for observations-of-trees, and the concept of “tree” points to that model. Assuming the human-labelling-algorithm is “working correctly”, that generative model matches an actual pattern in the world, and the precision of the model follows from the pattern. None of this requires unambiguous classification of logs as tree/not-tree.
On to human values. (I’ll just talk about one human at the moment, because cross-human disagreements are orthogonal to the point here.) The answer to “what does this human want?” does not always need to be unambiguous—indeed it should not always be unambiguous, because that is not the actual nature of human values. Rather, I have some precisely-defined generative model for observations-involving-my-values. Assuming my algorithm is “working correctly”, there is an actual pattern out in the world corresponding to that cluster, and that pattern is the thing we want to point to. That’s not just “good enough”; pointing to that pattern (assuming it exists) is perfect alignment. That’s what “mission accomplished” looks like. It’s the thing we’re modelling when we model our own desires.
This contains the ad-hoc assumption that if there’s one history in which I’ll say logs are trees, and another history in which I won’t, then what I’m doing is approximating a “real concept” in which logs are sorta-trees.
This is a modeling assumption about humans that doesn’t have to be true. You could just as well say that in the two different worlds, I’m actually referring to two related but distinct concepts. (Or you could model me as picking things to say about trees in a way that doesn’t talk about the properties of some “concept of trees” at all.)
The root problem is that “pointing to a real pattern” is not something humans can do in a vacuum. “I’m a great communicator, but people just don’t understand me,” as the joke goes. As far as I can tell, what you mean is that you’re envisioning an AI that learns about patterns in the world, and then matches those patterns to some collection of data that it’s been told to assume is “pointing to a pattern.” And there is no unique scheme for this—at the very least, you’ve got a choice of universal turing machine, as well as a free parameter describing the expected human level of abstraction. And this isn’t a case where any choice will do, because we’re in the limited-data regime, where different ontologies can easily lead to different categorizations.
That is not an assumption, it is an implication of the use of the concept “tree” to make predictions. For instance, if I can learn general facts about trees by examining a small number of trees, then I know that “tree” corresponds to a real pattern out in the world. This extends to logs: to the extent that a log is a tree, I can learn general facts about trees by examining logs (and vice versa), and verify what I’ve learned by looking at more trees/logs.
Pointing to a real pattern is indeed not something humans can do in a vacuum. Fortunately we do not live in a vacuum; we live in a universe with lots of real patterns in it. Different algorithms will indeed result in somewhat different classifications/patterns learned at any given time, but we can still expect a fairly large class of algorithms to converge to the same classifications/patterns over time, precisely because they are learning from the same universe. A perfectly-aligned AI will not have a perfect model of human values at any given time, but it can update in the right direction—in some sense it’s the update-procedure which is “aligned” with the true pattern, not the model itself which is “aligned”.
That’s why we often talk about perfectly “pointing” to human values, rather than building a perfect model of human values. It’s not about having a perfect model at any given time, it’s about “having a pointer” to the real-world pattern of human values, allowing us to do things like update our model in the right direction.
I definitely do not imagine that some random architecture would get it right with realistic amounts of data. Picking an architecture which matches the structure of our universe closely enough to perform well with limited data is a key problem—it’s exactly the sort of thing that e.g. my work on abstraction will hopefully help with.
(Also, matching the patterns to some collection of data intended to point to the pattern is not the only way of doing things, or even a very good way given the difficulty of verification, though for purposes of this discussion it’s a fine approach to examine.)
I would disagree in spirit—an AI can happily find a referent to the “tree” token that depends on context in a way that works like a word with multiple possible definitions.
I hope this is where we can start agreeing. Because the problem isn’t just finding something that performs well according to a known scoring rule. We don’t quite know how to implement the notion “this method for learning human values performs well” on a computer without basically already referring to some notion of human values for “performs well.”
We either need to ground “performs well” in some theory of humans as approximate agents that doesn’t need to know about their values, or we need some theory that avoids the chicken-and-egg problem altogether by simultaneously learning human models and the standards to judge them by.
To clarify, when said “performs well”, I did not mean “learns human values well”, nor did I have any sort of scoring rule in mind. I intended to mean that the algorithm learns patterns which are actually present in the world—much like earlier when I talked about “the human-labelling-algorithm ‘working correctly’”.
Probably not the best choice of words on my part; sorry for causing a tangent.
I’m sure it could, but I am claiming that such a thing would have worse predictive power. Roughly speaking: if there’s one notion of tree that includes saplings, and another that includes logs, then the model misses the ability to learn facts about saplings by examining logs. Conversely, if it doesn’t miss those sorts of things, then it isn’t actually behaving like a word with multiple possible referents. (I don’t actually think it’s that simple—the referent of “tree” is more than just a comparison class—but it hopefully suffices to make the point.)
Ah well. I’ll probably argue with you more about this elsewhere, then :)
This is very well-said, but I still want to dispute the possibility of “perfect alignment”. In your clustering analogy: the very existence of clusters presupposes definitions of entities-that-correspond-to-points, dimensions-of-the-space-of-points, and measurements-of-given-points-in-given-dimensions. All of those definitions involve imperfect modeling assumptions and simplifications. Your analogy also assumes that a normal-mixture-model is capable of perfectly capturing reality; I’m aware that this is provably asymptotically true for an infinite-cluster Dirichlet process mixture, but we don’t live in asymptopia and in reality it is effectively yet another strong assumption that holds at best weakly.
In other words, while I agree with (and appreciate your clear expression of) your main point that it’s possible to have a well-defined category without being able to do perfect categorization, I dispute the idea that it is possible even in theory to have a perfectly-defined one.
This is a critical point; it’s the reason we want to point to the pattern in the territory rather than to a human’s model itself. It may be that the human is using something analogous to a normal-mixture-model, which won’t perfectly match reality. But in order for that to actually be predictive, it has to find some real pattern in the world (which may not be perfectly normal, etc). The goal is to point to that real pattern, not to the human’s approximate representation of that pattern.
Now, two natural (and illustrative) objections to this:
If the human’s representation is an approximation, then there may not be a unique pattern to which their notions correspond; the “corresponding pattern” may be underdefined.
If we’re trying to align an AI to a human, then presumably we want the AI to use the human’s own idea of the human’s values, not some “idealized” version.
The answer to both of these is the same: we humans often update our own notion of what our values are, in response to new information. The reality-pattern we want to point to is the pattern toward which we are updating; it’s the thing our learning-algorithm is learning about. I think this is what coherent extrapolated volition is trying to get at: it asks “what would we want if we knew more, thought faster, …”. Assuming that the human-label-algorithm is working correctly, and continues working correctly, those are exactly the sort of conditions generally associated with convergence of the human’s model to the true reality-pattern.
Here are my responses to your comments, sorted by how interesting they are to me, descending. Also, thanks for your input!
Non-omnipotent AI aligning omnipotent AI
Please elaborate. I can imagine multiple versions of what you’re imagining. Is one of the following scenarios close to what you mean?
Scientists use AI-based theorem provers to prove theorems about AI alignment.
There’s an AI, with which you can have conversations. It tries to come up with new mathematical definitions and theorems related to what you’re discussing.
The AI (or multiple AIs) is not near-omnipotent yet, but it already controls most of the humanity’s resources and makes most of the decisions, so it does research into AI instead of humans.
I think, the requirements for how well the non-omnipotent AI in the 3rd scenario should be aligned are basically the same as for a near-omnipotent AI. If the non-omnipotent AI in the 3rd scenario is very misaligned, but it’s not catastrophic because the AI is not smart enough, the near-omnipotent AI it’ll create will also be misaligned, and that’ll be catastrophic.
Embedded agency
So, you think embedded agency research is unimportant for AI alignment. On the contrast, I think it’s very important. I worry about it mainly for 3 reasons. Suppose we don’t figure out embedded agency. Then
An AI won’t be able to safely self-modify
An AI won’t be able to comprehend that it can be killed or damaged or modified by others
I am not sure about this one. I am very interested to know if this is not the case. I think, if we build an AI without understanding embedded agency, and that AI builds a new AI, that new AI also won’t understand embedded agency. In other words, the set of AIs built without taking embedded agency into account is closed under the operation of an AI building a new AI.[Upd: comments under this comment mostly refute this]I am even less sure about this item, but maybe such an AI will be too dogmatic (as in dogmatic prior) about how the world might work, because it is sure that it can’t be killed or damaged or modified. Due to this, if the physics laws turn out to be weird (e.g. we live in a multiverse, or we’re in a simulation), the AI might fail to understand that and thus fail to turn the whole world into hedonium (or whatever it is that we would want it to do with the world).
If an AI built without taking embedded agency into account meets very smart aliens someday, it might fuck up due to its inability to imagine that someone can predict its actions.
Usefulness of type-2 research for aligning superintelligent AI
I think that if one man-year of type-1 research produces 1 unit of superintelligent AI alignment, one man-year of type-2 research produces about 0.15 units of superintelligent AI alignment.
As I see it, the mechanisms by which type-2 research helps align superintelligent AI are:
It may produce useful empirical data which’ll help us make type-1 theoretical insights.
Thinking about type-2 research contains a small portion of type-1 thinking.
For example, if someone works on making contemporary neural networks robust to out-of-distribution examples, and they do that mainly by experimenting, their experimental data might provide insights about the nature of robustness in abstract, and also, surely some portion of their thinking will be dedicated to theory of robustness.
My views on tractability and neglectedness
Alright, I agree with you about tractability.
About neglectedness, I think type-2 research is less neglected than type-1 and type-3 and will be less neglected in the next 10 years or so, because
It’s practical, you can sell it to companies which want to make robots or unbreakable face detection or whatever.
Humans have bias towards near-term thinking.
Neural networks are a hot topic.
I basically mean the third scenario:
I agree that you still need a strong guarantee of alignment in this scenario (as I mentioned my original comment).
Why don’t these arguments apply to humans? Evolution didn’t understand embedded agency, but managed to create humans who seem to do okay at being embedded agents.
(I buy this as an argument that an AI system needs to not ignore the fact that it is embedded, but I don’t buy it as an argument that we need to be deconfused about embedded agency.)
Cool, that’s more concrete, thanks. (I disagree, but there isn’t really an obvious point to argue on, the cruxes are in the other points.)
Agreed. Tbc, I wasn’t arguing it was neglected, just that you seemed to be ignoring tractability and neglectedness, which seemed like a mistake.
I see MIRI’s research on agent foundations (including embedded agency) as something like “We want to understand ${an aspect of how agents should work}, so let’s take the simplest case first and see if we understand everything about it. The simplest case is the case when the agent is nearly omniscient and knows all logical consequences. Hmm, we can’t figure out this simplest case yet—it breaks down if the conditions are sufficiently weird”. Since it turns out that it’s difficult to understand embedded agency even for such simple cases, it seems plausible that an AI trained to understand embedded agency by a naive learning procedure (similar to the evolution) will break down under sufficiently weird conditions.
Hmm, very good argument. Since I think humans have imperfect understanding of embedded agency, thanks to you I now no longer think that “If we build an AI without understanding embedded agency, and that AI builds a new AI, that new AI also won’t understand embedded agency” since that would imply we can’t get the “lived happily ever after” at all. We can ignore the case where we can’t get the “lived happily ever after” at all, because in that case nothing matters anyway.
I suppose, we could run evolutionary search or something, selecting for AIs which can understand the typical cases of being modified by itself or by the environment, which we include in the training dataset. I wonder how we can make it understand very atypical cases of modification. A near-omnipotent AI will be a very atypical case.
Can we come up with a learning procedure to have the AI learn embedded agency on its own? It seems plausible to me that we will need to understand embedded agency better to do this, but I don’t really know.
Btw, in another comment, you say
I basically subscribe to the argument that nothing except proofs will work in the case of superintelligent agentic AI.
Re: embedded agency, while these are all potentially relevant points (especially self-modification), I don’t see any of them as the main reason to study embedded agents from an alignment standpoint. I see the main purpose of embedded agency research as talking about humans, not designing AIs—in particular, in order to point to human values, we need a coherent notion of what it means for an agenty system embedded in its environment (i.e. a human) to want things. As the linked post discusses, a lot of the issues with modelling humans as utility-maximizers or using proxies for our goals stem directly from more general embedded agency issues.