To me the x-risk of corrupting human values by well-motivated AI is comparable to the x-risk caused by badly-motivated AI (and both higher than 20% conditional on superhuman AI within 20 subjective years), but I’m not sure how to argue this with you.
If you think this risk is very large, presumably there is some positive argument for why it’s so large? That seems like the most natural way to run the argument. I agree it’s not clear what exactly the norms of argument here are, but the very basic one seems to be sharing the reason for great concern.
In the case of alignment there are a few lines of argument that we can flesh out pretty far. The basic structure is something like: “(a) if we built AI with our current understanding there is a good chance it would not be trying to do what we wanted or have enough overlap to give the future substantial value, (b) if we built sufficiently competent AI, the future would probably be shaped by its intentions, (c) we have a significant risk of not developing sufficiently better understanding prior to having the capability to build sufficiently competent AI, (d) we have a significant risk of building sufficiently competent AI even if we don’t have sufficiently good understanding.” (Each of those claims obviously requires more argument, etc.)
One version of the case for worrying about value corruption would be:
It seems plausible that the values pursued by humans are very sensitive to changes in their environment.
It may be that historical variation is itself problematic, and we care mostly about our particular values.
Or it may be that values are “hardened” against certain kinds of environment shift that occur in nature, and that they will go to some lower “default” level of robustness under new kinds of shifts.
Or it may be that normal variation is OK for decision-theoretic reasons (since we are the beneficiaries of past shifts) but new kinds of variation are not OK.
If so, the rate of change in subjective time could be reasonably high—perhaps the change that occurs within one generation could shift value far enough to reduce value by 50% (if that change wasn’t endorsed for decision-theoretic reasons / hardened against).
It’s plausible, perhaps 50%, that AI will accelerate kinds of change that lead to value drift radically more than it accelerates an understanding of how to prevent such drift.
A good understanding of how to prevent value drift might be used / be a major driver of how well we prevent such drift. (Or maybe some other foreseeable institutional characteristics could have a big effect on how much drift occurs.)
If so, then it matters a lot how well we understand how to prevent such drift at the time when we develop AI. Perhaps there will be several generations worth of subjective time / drift-driving change before we are able to do enough additional labor to obsolete our current understanding (since AI is accelerating change but not the relevant kind of labor).
Our current understanding may not be good, and there may be a realistic prospect of having a much better understanding.
This kind of story is kind of conjunctive, so I’d expect to explore a few lines of argument like this, and then try to figure out what are the most important underlying uncertainties (e.g. steps that appear in most arguments of this form, or a more fundamental underlying cause for concern that generates many different arguments).
My most basic concerns with this story are things like:
In “well-controlled” situations, with principals who care about this issue, it feels like we already have an OK understanding of how to avert drift (conditioned on solving alignment). It seems like the basic idea is to decouple evolving values from the events in the world that are actually driving competitiveness / interacting with the natural world / realizing people’s consumption / etc., which is directly facilitated by alignment. The extreme form of this is having some human in a box somewhere (or maybe in cold storage) who will reflect and grow on their own schedule, and who will ultimately assume control of their resources once reaching maturity. We’ve talked a little bit about this, and you’ve pointed out some reasons this kind of scheme isn’t totally satisfactory even if it works as intended, but quantitatively the reasons you’ve pointed to don’t seem to be probable enough (per economic doubling, say) to make the cost-benefit analysis work out.
In most practical situations, it doesn’t seem like “understanding of how to avert drift” is the key bottleneck to averting drift—it seems like the basic problem is that most people just don’t care about averting drift at all, or have any inclination to be thoughtful about how their own preferences evolve. That’s still something you can intervene on, but it feels like a huge morass where you are competing with many other forces.
In the end I’m doing a pretty rough calculation that depends on a whole bunch of stuff, but those feel like they are maybe the most likely differences in view / places where I have something to say. Overall I still think this problem is relatively important, but that’s how I get to the intuitive view that it’s maybe ~10x lower impact. I would grant the existence of (plenty of) people for whom it’s higher impact though.
As a rule of thumb, “if one x-risk seems X times bigger than another, it should have about X times as many people working on it” is intuitive appealingly to me, and suggests we should have at least 2 people working on “value corruption” even if you think that risk is 10x smaller, but I’m not sure if that makes sense to you.
I think that seems roughly right, probably modulated by some O(1) factor factor reflecting tractability or other factors not captured in the total quantity of risk—maybe I’d expect us to have 2-10x more resources per unit risk devoted to more tractable risks.
In this case I’d be happy with the recommendation of ~10x more people working on motivation than on value drift, that feels like the right ballpark for basically the same reason that motivation feels ~10x more impactful.
I’m just hoping that you won’t (intentionally or unintentionally) discourage people from working on “value corruption” so strongly that they don’t even consider looking into that problem and forming their own conclusions based on their own intuitions/priors. [...] I don’t want people to be excessively discouraged from working on the latter by statements like “motivation contains the most urgent part”.
I agree I should be more careful about this.
I do think that motivation contains the most urgent/important part and feel pretty comfortable expressing that view (for the same reasons I’m generally inclined to express my views), but could hedge more when making statements like this.
(I think saying “X is more urgent than Y” is basically compatible with the view “There should be 10 people working on X for each person working on Y,” even if one also believes “but actually on the current margin investment in Y might be a better deal.” Will edit the post to be a bit softer here though.
ETA: actually I think the language in the post basically reflects what I meant, the broader definition seems worse because it contains tons of stuff that is lower priority. The narrower definition doesn’t contain every problem that is high priority, it just contains a single high priority problem, which is better than a really broad basket containing a mix of important and not-that-important stuff. But I will likely write a separate post or two at some point about value drift and other important problems other than motivation.)
If you think this risk is very large, presumably there is some positive argument for why it’s so large?
Yeah, I didn’t literally mean that I don’t have any arguments, but rather that we’ve discussed it in the past and it seems like we didn’t get close to resolving our disagreement. I tend to think that Aumann Agreement doesn’t apply to humans, and it’s fine to disagree on these kinds of things. Even if agreement ought to be possible in principle (which again I don’t think is necessarily true for humans), if you think that even from your perspective the value drift/corruption problem is currently overly neglected, then we can come back and revisit this at another time (e.g., when you think there’s too many people working on this problem, which might never actually happen).
it seems like the basic problem is that most people just don’t care about averting drift at all, or have any inclination to be thoughtful about how their own preferences evolve
I don’t understand how this is compatible with only 2% loss from value drift/corruption. Do you perhaps think the actual loss is much bigger, but almost certainly we just can’t do anything about it, so 2% is how much you expect we can potentially “save” from value drift/corruption? Or are you taking an anti-realist position and saying something like, if someone doesn’t care about averting drift/corruption, then however their values drift that doesn’t constitute any loss?
The narrower definition doesn’t contain every problem that is high priority, it just contains a single high priority problem, which is better than a really broad basket containing a mix of important and not-that-important stuff.
I don’t understand “better” in what sense. Whatever it is, why wouldn’t it be even better to have two terms, one of which is broadly defined so as to include all the problems that might be urgent but also includes lower priority problems and problems whose priority we’re not sure about, and another one that is defined to be a specific urgent problem. Do you currently have any objections to using “AI alignment” as the broader term (in line with the MIRI/Arbital definition and examples) and “AI motivation” as the narrower term (as suggested by Rohin)?
Do you currently have any objections to using “AI alignment” as the broader term (in line with the MIRI/Arbital definition and examples) and “AI motivation” as the narrower term (as suggested by Rohin)?
Yes:
The vast majority of existing usages of “alignment” should then be replaced by “motivation,” which is more specific and usually just as accurate. If you are going to split a term into new terms A and B, and you find that the vast majority of existing usage should be A, then I claim that “A” should be the one that keeps the old word.
The word “alignment” was chosen (originally be Stuart Russell I think) precisely because it is such a good name for the problem of aligning AI values with human values, it’s a word that correctly evokes what that problem is about. This is also how MIRI originally introduced the term. (I think they introduced it here, where they said “We call a smarter-than-human system that reliably pursues beneficial goals “aligned with human interests” or simply “aligned.””) Everywhere that anyone talks about alignment they use the analogy with “pointing,” and even MIRI folks usually talk about alignment as if it was mostly or entirely about pointing your AI in the right direction.
In contrast, “alignment” doesn’t really make sense as a name for the entire field of problems about making AI good. For the problem of making AI beneficial we already have the even older term “beneficial AI,” which really means exactly that. In explaining why MIRI doesn’t like that term, Rob said
Some of the main things I want from a term are:
A. It clearly and consistently keeps the focus on system design and engineering, and whatever technical/conceptual groundwork is needed to succeed at such. I want to make it easy for people (if they want to) to just hash out those technical issues, without feeling any pressure to dive into debates about bad actors and inter-group dynamics, or garden-variety machine ethics and moral philosophy, which carry a lot of derail / suck-the-energy-out-of-the-room risk.
[…] [“AI safety” or “beneficial AI”] doesn’t work so well for A—it’s commonly used to include things like misuse risk.”
[continuing last point] The proposed usage of “alignment” doesn’t meet this desiderata though, it has exactly the same problem as “beneficial AI,” except that it’s historically associated with this community. In particular it absolutely includes “garden-variety machine ethics and moral philosophy.” Yes, there is all sorts of stuff that MIRI or I wouldn’t care about that is relevant to “beneficial” AI, but under the proposed definition of alignment it’s also relevant to “aligned” AI. (This statement by Rob also makes me think that you wouldn’t in fact be happy with what he at least means by “alignment,” since I take it you explicitly mean to include moral philosophy?)
People have introduced a lot of terms and change terms frequently. I’ve changed the language on my blog multiple times at other people’s request. This isn’t costless, it really does make things more and more confusing.
I think “AI motivation” is not a good term for this area of study: it (a) suggests it’s about the study of AI motivation rather than engineering AI to be motivated to help humans, (b) is going to be perceived as aggressively anthropomorphizing (even if “alignment” is only slightly better), (c) is generally less optimized (related to the second point above, “alignment” is quite a good term for this area).
Probably “alignment” / “value alignment” would be a better split of terms than “alignment” vs. “motivation”. “Value alignment” has traditionally been used with the de re reading, but I could clarify that I’m working on de dicto value alignment when more precision is needed (everything I work on is also relevant on the de re reading, so the other interpretation is also accurate and just less precise).
I guess I have an analogous question for you: do you currently have any objections to using “beneficial AI” as the broader term, and “AI alignment” as the narrower term?
This is also how MIRI originally introduced the term. (I think they introduced it here, where they said “We call a smarter-than-human system that reliably pursues beneficial goals “aligned with human interests” or simply “aligned.”“)
But that definition seems quite different from your “A is trying to do what H wants it to do.” For example, if H has a wrong understanding of his/her true or normative values and as a result wants A to do something that is actually harmful, then under your definition A would be still be “aligned” but under MIRI’s definition it wouldn’t be (because it wouldn’t be pursuing beneficial goals).
This statement by Rob also makes me think that you wouldn’t in fact be happy with what he at least means by “alignment,” since I take it you explicitly mean to include moral philosophy?
I think that’s right. When I say MIRI/Arbital definition of “alignment” I’m referring to what’s they’ve posted publicly, and I believe it does include moral philosophy. Rob’s statement that you quoted seems to be a private one (I don’t recall seeing it before and can’t find it through Google search) but I can certainly see how it muddies the waters from your perspective.
Probably “alignment” / “value alignment” would be a better split of terms than “alignment” vs. “motivation”. “Value alignment” has traditionally been used with the de re reading, but I could clarify that I’m working on de dicto value alignment when more precision is needed
This seems fine to me, if you could give the benefit of doubt as to when more precision is needed. I’m basically worried about this scenario: You or someone else writes something like “I’m cautiously optimistic about Paul’s work.” The reader recalls seeing you say that you work on “value alignment”. They match that to what they’ve read from MIRI about how aligned AI “reliably pursues beneficial goals”, and end up thinking that is easier than you’d intend, or think there is more disagreement between alignment researchers about the difficulty of the broader problem than there is actually is. If you could consistently say that the goal of your work is “de dicto value alignment” then that removes most of my worry.
I guess I have an analogous question for you: do you currently have any objections to using “beneficial AI” as the broader term, and “AI alignment” as the narrower term?
This actually seems best to me on the merits of the terms alone (i.e., putting historical usage aside), and I’d be fine with it if everyone could coordinate to switch to these terms/definitions.
But that definition seems quite different from your “A is trying to do what H wants it to do.” For example, if H has a wrong understanding of his/her true or normative values and as a result wants A to do something that is actually harmful, then under your definition A would be still be “aligned” but under MIRI’s definition it wouldn’t be (because it wouldn’t be pursuing beneficial goals).
“Do what H wants me to do” seems to me to be an example of a beneficial goal, so I’d say a system which is trying to do what H wants it to do is pursuing a beneficial goals. It may also be pursuing subgoals which turn out to be harmful, if e.g. it’s wrong about what H wants or has other mistaken empirical beliefs. I don’t think anyone could be advocating the definition “pursues no harmful subgoals,” since that basically requires perfect empirical knowledge (it seems just as hard as never taking a harmful action). Does that seem right to you?
I’ve been assuming that “reliably pursues beneficial goals” is weaker than the definition I proposed, but practically equivalent as a research goal.
I’m basically worried about this scenario: You or someone else writes something like “I’m cautiously optimistic about Paul’s work.” The reader recalls seeing you say that you work on “value alignment”. They match that to what they’ve read from MIRI about how aligned AI “reliably pursues beneficial goals”, and end up thinking that is easier than you’d intend, or think there is more disagreement between alignment researchers about the difficulty of the broader problem than there is actually is. If you could consistently say that the goal of your work is “de dicto value alignment” then that removes most of my worry.
I think it’s reasonable for me to be more careful about clarifying what any particular line of research agenda does or does not aim to achieve. I think that in most contexts that is going to require more precision than just saying “AI alignment” regardless of how the term was defined, I normally clarify by saying something like “an AI which is at least trying to help us get what we want.”
This actually seems best to me on the merits of the terms alone (i.e., putting historical usage aside), and I’d be fine with it if everyone could coordinate to switch to these terms/definitions.
My guess is that MIRI folks won’t like the “beneficial AI” term because it is too broad a tent. (Which is also my objection to the proposed definition of “AI alignment,” as “overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.”) My sense is that if that were their position, then you would also be unhappy with their proposed usage of “AI alignment,” since you seem to want a broad tent that makes minimal assumptions about what problems will turn out to be important. Does that seem right?
(They might also dislike “beneficial AI” because of random contingent facts about how it’s been used in the past, and so might want a different term with the same meaning.)
My own feeling is that using “beneficial AI” to mean “AI that produces good outcomes in the world” is basically just using “beneficial” in accordance with its usual meaning, and this isn’t a case where a special technical term is needed (and indeed it’s weird to have a technical term whose definition is precisely captured by a single—different—word).
“Do what H wants me to do” seems to me to be an example of a beneficial goal, so I’d say a system which is trying to do what H wants it to do is pursuing a beneficial goals. It may also be pursuing subgoals which turn out to be harmful, if e.g. it’s wrong about what H wants or has other mistaken empirical beliefs. I don’t think anyone could be advocating the definition “pursues no harmful subgoals,” since that basically requires perfect empirical knowledge (it seems just as hard as never taking a harmful action). Does that seem right to you?
I guess both “reliable” and “beneficial” are matters of degree so “aligned” in the sense of “reliably pursues beneficial goals” is also a matter of degree. “Do what H wants A to do” would be a moderate degree of alignment whereas “Successfully figuring out and satisfying H’s true/normative values” would be a much higher degree of alignment (in that sense of alignment). Meanwhile in your sense of alignment they are at best equally aligned and the latter might actually be less aligned if H has a wrong idea of metaethics or what his true/normative values are and as a result trying to figure out and satisfy those values is not something that H wants A to do.
I think that in most contexts that is going to require more precision than just saying “AI alignment” regardless of how the term was defined, I normally clarify by saying something like “an AI which is at least trying to help us get what we want.”
That seems good too.
My guess is that MIRI folks won’t like the “beneficial AI” term because it is too broad a tent. (Which is also my objection to the proposed definition of “AI alignment,” as “overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.“) My sense is that if that were their position, then you would also be unhappy with their proposed usage of “AI alignment,” since you seem to want a broad tent that makes minimal assumptions about what problems will turn out to be important. Does that seem right?
This paragraph greatly confuses me. My understanding is that someone from MIRI (probably Eliezer) wrote the Arbital article defining “AI alignment” as “overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world”, which satisfies my desire to have a broad tent term that makes minimal assumptions about what problems will turn out to be important. I’m fine with calling this “beneficial AI” instead of “AI alignment” if everyone can coordinate on this (but I don’t know how MIRI people feel about this). I don’t understand why you think ‘MIRI folks won’t like the “beneficial AI” term because it is too broad a tent’ given that someone from MIRI gave a very broad definition to “AI alignment”. Do you perhaps think that Arbital article was written by a non-MIRI person?
“Do what H wants A to do” would be a moderate degree of alignment whereas “Successfully figuring out and satisfying H’s true/normative values” would be a much higher degree of alignment (in that sense of alignment).
In what sense is that a more beneficial goal?
“Successfully do X” seems to be the same goal as X, isn’t it?
“Figure out H’s true/normative values” is manifestly a subgoal of “satisfy H’s true/normative values.” Why would we care about that except as a subgoal?
So is the difference entirely between “satisfy H’s true/normative values” and “do what H wants”? Do you disagree with one of the previous two bullet points? Is the difference that you think “reliably pursues” implies something about “actually achieves”?
If the difference is mostly between “what H wants” and “what H truly/normatively values”, then this is just a communication difficulty. For me adding “truly” or “normatively” to “values” is just emphasis and doesn’t change the meaning.
I try to make it clear that I’m using “want” to refer to some hard-to-define idealization rather than some narrow concept, but I can see how “want” might not be a good term for this, I’d be fine using “values” or something along those lines if that would be clearer.
(This is why I wrote:
What H wants” is even more problematic than “trying.” Clarifying what this expression means, and how to operationalize it in a way that could be used to inform an AI’s behavior, is part of the alignment problem. Without additional clarity on this concept, we will not be able to build an AI that tries to do what H wants it to do.
If the difference is mostly between “what H wants” and “what H truly/normatively values”, then this is just a communication difficulty. For me adding “truly” or “normatively” to “values” is just emphasis and doesn’t change the meaning.
Ah, yes that is a big part of what I thought was the difference. (Actually I may have understood at some point that you meant “want” in an idealized sense but then forgot and didn’t re-read the post to pick up that understanding again.)
ETA: I guess another thing that contributed to this confusion is your talk of values evolving over time, and of preferences about how they evolve, which seems to suggest that by “values” you mean something like “current understanding of values” or “interim values” rather than “true/normative values” since it doesn’t seem to make sense to want one’s true/normative values to change over time.
I try to make it clear that I’m using “want” to refer to some hard-to-define idealization rather than some narrow concept, but I can see how “want” might not be a good term for this, I’d be fine using “values” or something along those lines if that would be clearer.
I don’t think “values” is good either. Both “want” and “values” are commonly used words that typically (in everyday usage) mean something like “someone’s current understanding of what they want” or what I called “interim values”. I don’t see how you can expect people not to be frequently confused if you use either of them to mean “true/normative values”. Like the situation with de re / de dicto alignment, I suggest it’s not worth trying to economize on the adjectives here.
Another difference between your definition of alignment and “reliably pursues beneficial goals” is that the latter has “reliably” in it which suggests more of a de re reading. To use your example “Suppose A thinks that H likes apples, and so goes to the store to buy some apples, but H really prefers oranges.” I think most people would call an A that correctly understands H’s preferences (and gets oranges) more reliably pursuing beneficial goals.
Given this, perhaps the easiest way to reduce confusions moving forward is to just use some adjectives to distinguish your use of the words “want”, “values”, or “alignment” from other people’s.
If the difference is mostly between “what H wants” and “what H truly/normatively values”, then this is just a communication difficulty. For me adding “truly” or “normatively” to “values” is just emphasis and doesn’t change the meaning.
So “wants” means a want more general than an object-level desire (like wanting to buy oranges), and it already takes into account the possibility of H changing his mind about what he wants if H discovers that his wants contradict his normative values?
If that’s right, how is this generalization defined? (E.g. The CEV was “what H wants in the limit of infinite intelligence, reasoning time and complete information”.)
I don’t understand why you think ‘MIRI folks won’t like the “beneficial AI” term because it is too broad a tent’ given that someone from MIRI gave a very broad definition to “AI alignment”. Do you perhaps think that Arbital article was written by a non-MIRI person?
I don’t really know what anyone from MIRI thinks about this issue. It was a guess based on (a) the fact that Rob didn’t like a number of possible alternative terms to “alignment” because they seemed to be too broad a definition, (b) the fact that virtually every MIRI usage of “alignment” refers to a much narrower class of problems than “beneficial AI” is usually taken to refer to, (c) the fact that Eliezer generally seems frustrated with people talking about other problems under the heading of “beneficial AI.”
(But (c) might be driven by powerful AI vs. nearer-term concerns / all the other empirical errors Eliezer thinks people are making, (b) isn’t that indicative, and (a) might be driven by other cultural baggage associated with the term / Rob was speaking off the cuff and not attempting to speak formally for MIRI.)
I’d consider it great if we standardized on “beneficial AI” to mean “AI that has good consequences” and “AI alignment” to refer to the narrower problem of aligning AI’s motivation/preferences/goals.
I don’t understand how this is compatible with only 2% loss from value drift/corruption. Do you perhaps think the actual loss is much bigger, but almost certainly we just can’t do anything about it, so 2% is how much you expect we can potentially “save” from value drift/corruption? Or are you taking an anti-realist position and saying something like, if someone doesn’t care about averting drift/corruption, then however their values drift that doesn’t constitute any loss?
10x worse was originally my estimate for cost-effectiveness, not for total value at risk.
People not caring about X prima facie decreases the returns to research on X. But may increase the returns for advocacy (or acquiring resources/influence, or more creative interventions). That bullet point was really about the returns to research.
People not caring about X prima facie decreases the returns to research on X. But may increase the returns for advocacy (or acquiring resources/influence, or more creative interventions). That bullet point was really about the returns to research.
It’s not obvious that applies here. If people don’t care strongly about how their values evolve over time, that seemingly gives AIs / AI designers an opening to have greater influence over how people’s values evolve over time, and implies a larger (or at least not obviously smaller) return on research into how to do this properly. Or if people care a bit about protecting their values from manipulation from other AIs but not a lot, it seems really important/valuable to reduce the cost of such protection as much as possible.
As for advocacy, it seems a lot easier (at least for someone in my position) to convince a relatively small number of AI designers to build AIs that want to help their users evolve their values in a positive way (or figuring out what their true or normative values are, or protecting their values against manipulation), than to convince all the potential users to want that themselves.
If people care less about some aspect of the future, then trying to get influence over that aspect of the future is more attractive (whether by building technology that they accept as a default, or by making an explicit trade, or whatever).
A better understanding of how to prevent value drift can still be helpful if people care a little bit, and can be particularly useful to the people who care a lot (and there will be fewer people working to develop such understanding if few people care).
I think that both
(a) Trying to have influence over aspects of value change that people don’t much care about, and
(b) better understanding the important processes driving changes in values
are reasonable things to do to make the future better. (Though some parts of (a) especially are somewhat zero-sum and I think it’s worth being thoughtful about that.)
(I don’t agree with the sign of the effect described in your comment, but don’t think it’s an important point / may just be a disagreement about what else we are holding equal so it seems good to drop.)
Trying to have influence over aspects of value change that people don’t much care about … [is] reasonable … to do to make the future better
This could refer to value change in AI controllers, like Hugh in HCH, or alternatively to value change in people living in the AI-managed world. I believe the latter could be good, but the former seems very questionable (here “value” refers to true/normative/idealized preference). So it’s hard for the same people to share the two roles. How do you ensure that value change remains good in the original sense without a reference to preference in the original sense, that hasn’t experienced any value change, a reference that remains in control? And for this discussion, it seems like the values of AI controllers (or AI+controllers) is what’s relevant.
It’s agent tiling for AI+controller agents, any value change in the whole seems to be a mistake. It might be OK to change values of subagents, but the whole shouldn’t show any value drift, only instrumentally useful tradeoffs that sacrifice less important aspects of what’s done for more important aspects, but still from the point of view of unchanged original values (to the extent that they are defined at all).
If you think this risk is very large, presumably there is some positive argument for why it’s so large? That seems like the most natural way to run the argument. I agree it’s not clear what exactly the norms of argument here are, but the very basic one seems to be sharing the reason for great concern.
In the case of alignment there are a few lines of argument that we can flesh out pretty far. The basic structure is something like: “(a) if we built AI with our current understanding there is a good chance it would not be trying to do what we wanted or have enough overlap to give the future substantial value, (b) if we built sufficiently competent AI, the future would probably be shaped by its intentions, (c) we have a significant risk of not developing sufficiently better understanding prior to having the capability to build sufficiently competent AI, (d) we have a significant risk of building sufficiently competent AI even if we don’t have sufficiently good understanding.” (Each of those claims obviously requires more argument, etc.)
One version of the case for worrying about value corruption would be:
It seems plausible that the values pursued by humans are very sensitive to changes in their environment.
It may be that historical variation is itself problematic, and we care mostly about our particular values.
Or it may be that values are “hardened” against certain kinds of environment shift that occur in nature, and that they will go to some lower “default” level of robustness under new kinds of shifts.
Or it may be that normal variation is OK for decision-theoretic reasons (since we are the beneficiaries of past shifts) but new kinds of variation are not OK.
If so, the rate of change in subjective time could be reasonably high—perhaps the change that occurs within one generation could shift value far enough to reduce value by 50% (if that change wasn’t endorsed for decision-theoretic reasons / hardened against).
It’s plausible, perhaps 50%, that AI will accelerate kinds of change that lead to value drift radically more than it accelerates an understanding of how to prevent such drift.
A good understanding of how to prevent value drift might be used / be a major driver of how well we prevent such drift. (Or maybe some other foreseeable institutional characteristics could have a big effect on how much drift occurs.)
If so, then it matters a lot how well we understand how to prevent such drift at the time when we develop AI. Perhaps there will be several generations worth of subjective time / drift-driving change before we are able to do enough additional labor to obsolete our current understanding (since AI is accelerating change but not the relevant kind of labor).
Our current understanding may not be good, and there may be a realistic prospect of having a much better understanding.
This kind of story is kind of conjunctive, so I’d expect to explore a few lines of argument like this, and then try to figure out what are the most important underlying uncertainties (e.g. steps that appear in most arguments of this form, or a more fundamental underlying cause for concern that generates many different arguments).
My most basic concerns with this story are things like:
In “well-controlled” situations, with principals who care about this issue, it feels like we already have an OK understanding of how to avert drift (conditioned on solving alignment). It seems like the basic idea is to decouple evolving values from the events in the world that are actually driving competitiveness / interacting with the natural world / realizing people’s consumption / etc., which is directly facilitated by alignment. The extreme form of this is having some human in a box somewhere (or maybe in cold storage) who will reflect and grow on their own schedule, and who will ultimately assume control of their resources once reaching maturity. We’ve talked a little bit about this, and you’ve pointed out some reasons this kind of scheme isn’t totally satisfactory even if it works as intended, but quantitatively the reasons you’ve pointed to don’t seem to be probable enough (per economic doubling, say) to make the cost-benefit analysis work out.
In most practical situations, it doesn’t seem like “understanding of how to avert drift” is the key bottleneck to averting drift—it seems like the basic problem is that most people just don’t care about averting drift at all, or have any inclination to be thoughtful about how their own preferences evolve. That’s still something you can intervene on, but it feels like a huge morass where you are competing with many other forces.
In the end I’m doing a pretty rough calculation that depends on a whole bunch of stuff, but those feel like they are maybe the most likely differences in view / places where I have something to say. Overall I still think this problem is relatively important, but that’s how I get to the intuitive view that it’s maybe ~10x lower impact. I would grant the existence of (plenty of) people for whom it’s higher impact though.
I think that seems roughly right, probably modulated by some O(1) factor factor reflecting tractability or other factors not captured in the total quantity of risk—maybe I’d expect us to have 2-10x more resources per unit risk devoted to more tractable risks.
In this case I’d be happy with the recommendation of ~10x more people working on motivation than on value drift, that feels like the right ballpark for basically the same reason that motivation feels ~10x more impactful.
I agree I should be more careful about this.
I do think that motivation contains the most urgent/important part and feel pretty comfortable expressing that view (for the same reasons I’m generally inclined to express my views), but could hedge more when making statements like this.
(I think saying “X is more urgent than Y” is basically compatible with the view “There should be 10 people working on X for each person working on Y,” even if one also believes “but actually on the current margin investment in Y might be a better deal.” Will edit the post to be a bit softer here though.
ETA: actually I think the language in the post basically reflects what I meant, the broader definition seems worse because it contains tons of stuff that is lower priority. The narrower definition doesn’t contain every problem that is high priority, it just contains a single high priority problem, which is better than a really broad basket containing a mix of important and not-that-important stuff. But I will likely write a separate post or two at some point about value drift and other important problems other than motivation.)
Yeah, I didn’t literally mean that I don’t have any arguments, but rather that we’ve discussed it in the past and it seems like we didn’t get close to resolving our disagreement. I tend to think that Aumann Agreement doesn’t apply to humans, and it’s fine to disagree on these kinds of things. Even if agreement ought to be possible in principle (which again I don’t think is necessarily true for humans), if you think that even from your perspective the value drift/corruption problem is currently overly neglected, then we can come back and revisit this at another time (e.g., when you think there’s too many people working on this problem, which might never actually happen).
I don’t understand how this is compatible with only 2% loss from value drift/corruption. Do you perhaps think the actual loss is much bigger, but almost certainly we just can’t do anything about it, so 2% is how much you expect we can potentially “save” from value drift/corruption? Or are you taking an anti-realist position and saying something like, if someone doesn’t care about averting drift/corruption, then however their values drift that doesn’t constitute any loss?
I don’t understand “better” in what sense. Whatever it is, why wouldn’t it be even better to have two terms, one of which is broadly defined so as to include all the problems that might be urgent but also includes lower priority problems and problems whose priority we’re not sure about, and another one that is defined to be a specific urgent problem. Do you currently have any objections to using “AI alignment” as the broader term (in line with the MIRI/Arbital definition and examples) and “AI motivation” as the narrower term (as suggested by Rohin)?
Yes:
The vast majority of existing usages of “alignment” should then be replaced by “motivation,” which is more specific and usually just as accurate. If you are going to split a term into new terms A and B, and you find that the vast majority of existing usage should be A, then I claim that “A” should be the one that keeps the old word.
The word “alignment” was chosen (originally be Stuart Russell I think) precisely because it is such a good name for the problem of aligning AI values with human values, it’s a word that correctly evokes what that problem is about. This is also how MIRI originally introduced the term. (I think they introduced it here, where they said “We call a smarter-than-human system that reliably pursues beneficial goals “aligned with human interests” or simply “aligned.””) Everywhere that anyone talks about alignment they use the analogy with “pointing,” and even MIRI folks usually talk about alignment as if it was mostly or entirely about pointing your AI in the right direction.
In contrast, “alignment” doesn’t really make sense as a name for the entire field of problems about making AI good. For the problem of making AI beneficial we already have the even older term “beneficial AI,” which really means exactly that. In explaining why MIRI doesn’t like that term, Rob said
[continuing last point] The proposed usage of “alignment” doesn’t meet this desiderata though, it has exactly the same problem as “beneficial AI,” except that it’s historically associated with this community. In particular it absolutely includes “garden-variety machine ethics and moral philosophy.” Yes, there is all sorts of stuff that MIRI or I wouldn’t care about that is relevant to “beneficial” AI, but under the proposed definition of alignment it’s also relevant to “aligned” AI. (This statement by Rob also makes me think that you wouldn’t in fact be happy with what he at least means by “alignment,” since I take it you explicitly mean to include moral philosophy?)
People have introduced a lot of terms and change terms frequently. I’ve changed the language on my blog multiple times at other people’s request. This isn’t costless, it really does make things more and more confusing.
I think “AI motivation” is not a good term for this area of study: it (a) suggests it’s about the study of AI motivation rather than engineering AI to be motivated to help humans, (b) is going to be perceived as aggressively anthropomorphizing (even if “alignment” is only slightly better), (c) is generally less optimized (related to the second point above, “alignment” is quite a good term for this area).
Probably “alignment” / “value alignment” would be a better split of terms than “alignment” vs. “motivation”. “Value alignment” has traditionally been used with the de re reading, but I could clarify that I’m working on de dicto value alignment when more precision is needed (everything I work on is also relevant on the de re reading, so the other interpretation is also accurate and just less precise).
I guess I have an analogous question for you: do you currently have any objections to using “beneficial AI” as the broader term, and “AI alignment” as the narrower term?
But that definition seems quite different from your “A is trying to do what H wants it to do.” For example, if H has a wrong understanding of his/her true or normative values and as a result wants A to do something that is actually harmful, then under your definition A would be still be “aligned” but under MIRI’s definition it wouldn’t be (because it wouldn’t be pursuing beneficial goals).
I think that’s right. When I say MIRI/Arbital definition of “alignment” I’m referring to what’s they’ve posted publicly, and I believe it does include moral philosophy. Rob’s statement that you quoted seems to be a private one (I don’t recall seeing it before and can’t find it through Google search) but I can certainly see how it muddies the waters from your perspective.
This seems fine to me, if you could give the benefit of doubt as to when more precision is needed. I’m basically worried about this scenario: You or someone else writes something like “I’m cautiously optimistic about Paul’s work.” The reader recalls seeing you say that you work on “value alignment”. They match that to what they’ve read from MIRI about how aligned AI “reliably pursues beneficial goals”, and end up thinking that is easier than you’d intend, or think there is more disagreement between alignment researchers about the difficulty of the broader problem than there is actually is. If you could consistently say that the goal of your work is “de dicto value alignment” then that removes most of my worry.
This actually seems best to me on the merits of the terms alone (i.e., putting historical usage aside), and I’d be fine with it if everyone could coordinate to switch to these terms/definitions.
“Do what H wants me to do” seems to me to be an example of a beneficial goal, so I’d say a system which is trying to do what H wants it to do is pursuing a beneficial goals. It may also be pursuing subgoals which turn out to be harmful, if e.g. it’s wrong about what H wants or has other mistaken empirical beliefs. I don’t think anyone could be advocating the definition “pursues no harmful subgoals,” since that basically requires perfect empirical knowledge (it seems just as hard as never taking a harmful action). Does that seem right to you?
I’ve been assuming that “reliably pursues beneficial goals” is weaker than the definition I proposed, but practically equivalent as a research goal.
I think it’s reasonable for me to be more careful about clarifying what any particular line of research agenda does or does not aim to achieve. I think that in most contexts that is going to require more precision than just saying “AI alignment” regardless of how the term was defined, I normally clarify by saying something like “an AI which is at least trying to help us get what we want.”
My guess is that MIRI folks won’t like the “beneficial AI” term because it is too broad a tent. (Which is also my objection to the proposed definition of “AI alignment,” as “overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.”) My sense is that if that were their position, then you would also be unhappy with their proposed usage of “AI alignment,” since you seem to want a broad tent that makes minimal assumptions about what problems will turn out to be important. Does that seem right?
(They might also dislike “beneficial AI” because of random contingent facts about how it’s been used in the past, and so might want a different term with the same meaning.)
My own feeling is that using “beneficial AI” to mean “AI that produces good outcomes in the world” is basically just using “beneficial” in accordance with its usual meaning, and this isn’t a case where a special technical term is needed (and indeed it’s weird to have a technical term whose definition is precisely captured by a single—different—word).
I guess both “reliable” and “beneficial” are matters of degree so “aligned” in the sense of “reliably pursues beneficial goals” is also a matter of degree. “Do what H wants A to do” would be a moderate degree of alignment whereas “Successfully figuring out and satisfying H’s true/normative values” would be a much higher degree of alignment (in that sense of alignment). Meanwhile in your sense of alignment they are at best equally aligned and the latter might actually be less aligned if H has a wrong idea of metaethics or what his true/normative values are and as a result trying to figure out and satisfy those values is not something that H wants A to do.
That seems good too.
This paragraph greatly confuses me. My understanding is that someone from MIRI (probably Eliezer) wrote the Arbital article defining “AI alignment” as “overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world”, which satisfies my desire to have a broad tent term that makes minimal assumptions about what problems will turn out to be important. I’m fine with calling this “beneficial AI” instead of “AI alignment” if everyone can coordinate on this (but I don’t know how MIRI people feel about this). I don’t understand why you think ‘MIRI folks won’t like the “beneficial AI” term because it is too broad a tent’ given that someone from MIRI gave a very broad definition to “AI alignment”. Do you perhaps think that Arbital article was written by a non-MIRI person?
In what sense is that a more beneficial goal?
“Successfully do X” seems to be the same goal as X, isn’t it?
“Figure out H’s true/normative values” is manifestly a subgoal of “satisfy H’s true/normative values.” Why would we care about that except as a subgoal?
So is the difference entirely between “satisfy H’s true/normative values” and “do what H wants”? Do you disagree with one of the previous two bullet points? Is the difference that you think “reliably pursues” implies something about “actually achieves”?
If the difference is mostly between “what H wants” and “what H truly/normatively values”, then this is just a communication difficulty. For me adding “truly” or “normatively” to “values” is just emphasis and doesn’t change the meaning.
I try to make it clear that I’m using “want” to refer to some hard-to-define idealization rather than some narrow concept, but I can see how “want” might not be a good term for this, I’d be fine using “values” or something along those lines if that would be clearer.
(This is why I wrote:
)
Ah, yes that is a big part of what I thought was the difference. (Actually I may have understood at some point that you meant “want” in an idealized sense but then forgot and didn’t re-read the post to pick up that understanding again.)
ETA: I guess another thing that contributed to this confusion is your talk of values evolving over time, and of preferences about how they evolve, which seems to suggest that by “values” you mean something like “current understanding of values” or “interim values” rather than “true/normative values” since it doesn’t seem to make sense to want one’s true/normative values to change over time.
I don’t think “values” is good either. Both “want” and “values” are commonly used words that typically (in everyday usage) mean something like “someone’s current understanding of what they want” or what I called “interim values”. I don’t see how you can expect people not to be frequently confused if you use either of them to mean “true/normative values”. Like the situation with de re / de dicto alignment, I suggest it’s not worth trying to economize on the adjectives here.
Another difference between your definition of alignment and “reliably pursues beneficial goals” is that the latter has “reliably” in it which suggests more of a de re reading. To use your example “Suppose A thinks that H likes apples, and so goes to the store to buy some apples, but H really prefers oranges.” I think most people would call an A that correctly understands H’s preferences (and gets oranges) more reliably pursuing beneficial goals.
Given this, perhaps the easiest way to reduce confusions moving forward is to just use some adjectives to distinguish your use of the words “want”, “values”, or “alignment” from other people’s.
So “wants” means a want more general than an object-level desire (like wanting to buy oranges), and it already takes into account the possibility of H changing his mind about what he wants if H discovers that his wants contradict his normative values?
If that’s right, how is this generalization defined? (E.g. The CEV was “what H wants in the limit of infinite intelligence, reasoning time and complete information”.)
I don’t really know what anyone from MIRI thinks about this issue. It was a guess based on (a) the fact that Rob didn’t like a number of possible alternative terms to “alignment” because they seemed to be too broad a definition, (b) the fact that virtually every MIRI usage of “alignment” refers to a much narrower class of problems than “beneficial AI” is usually taken to refer to, (c) the fact that Eliezer generally seems frustrated with people talking about other problems under the heading of “beneficial AI.”
(But (c) might be driven by powerful AI vs. nearer-term concerns / all the other empirical errors Eliezer thinks people are making, (b) isn’t that indicative, and (a) might be driven by other cultural baggage associated with the term / Rob was speaking off the cuff and not attempting to speak formally for MIRI.)
I’d consider it great if we standardized on “beneficial AI” to mean “AI that has good consequences” and “AI alignment” to refer to the narrower problem of aligning AI’s motivation/preferences/goals.
10x worse was originally my estimate for cost-effectiveness, not for total value at risk.
People not caring about X prima facie decreases the returns to research on X. But may increase the returns for advocacy (or acquiring resources/influence, or more creative interventions). That bullet point was really about the returns to research.
It’s not obvious that applies here. If people don’t care strongly about how their values evolve over time, that seemingly gives AIs / AI designers an opening to have greater influence over how people’s values evolve over time, and implies a larger (or at least not obviously smaller) return on research into how to do this properly. Or if people care a bit about protecting their values from manipulation from other AIs but not a lot, it seems really important/valuable to reduce the cost of such protection as much as possible.
As for advocacy, it seems a lot easier (at least for someone in my position) to convince a relatively small number of AI designers to build AIs that want to help their users evolve their values in a positive way (or figuring out what their true or normative values are, or protecting their values against manipulation), than to convince all the potential users to want that themselves.
I agree that:
If people care less about some aspect of the future, then trying to get influence over that aspect of the future is more attractive (whether by building technology that they accept as a default, or by making an explicit trade, or whatever).
A better understanding of how to prevent value drift can still be helpful if people care a little bit, and can be particularly useful to the people who care a lot (and there will be fewer people working to develop such understanding if few people care).
I think that both
(a) Trying to have influence over aspects of value change that people don’t much care about, and
(b) better understanding the important processes driving changes in values
are reasonable things to do to make the future better. (Though some parts of (a) especially are somewhat zero-sum and I think it’s worth being thoughtful about that.)
(I don’t agree with the sign of the effect described in your comment, but don’t think it’s an important point / may just be a disagreement about what else we are holding equal so it seems good to drop.)
This could refer to value change in AI controllers, like Hugh in HCH, or alternatively to value change in people living in the AI-managed world. I believe the latter could be good, but the former seems very questionable (here “value” refers to true/normative/idealized preference). So it’s hard for the same people to share the two roles. How do you ensure that value change remains good in the original sense without a reference to preference in the original sense, that hasn’t experienced any value change, a reference that remains in control? And for this discussion, it seems like the values of AI controllers (or AI+controllers) is what’s relevant.
It’s agent tiling for AI+controller agents, any value change in the whole seems to be a mistake. It might be OK to change values of subagents, but the whole shouldn’t show any value drift, only instrumentally useful tradeoffs that sacrifice less important aspects of what’s done for more important aspects, but still from the point of view of unchanged original values (to the extent that they are defined at all).