How to prevent “aligned” AIs from unintentionally corrupting human values? We know that ML systems tend to have problems with adversarial examples and distributional shifts in general. There seems to be no reason not to expect that human value functions have similar problems, which even “aligned” AIs could trigger unless they are somehow designed not to. For example, such AIs could give humans so much power so quickly or put them in such novel situations that their moral development can’t keep up, so their value systems no longer give sensible answers. (Sort of the AI assisted version of the classic “power corrupts” problem.) AIs could give us new options that are irresistible to some parts of our motivational systems, like more powerful versions of video game and social media addiction. Even in the course of trying to figure out how the world could be made better for us, they could in effect be searching for adversarial examples on our value functions. Finally, at our own request or in a sincere attempt to help us, they could generate philosophical or moral arguments that are wrong but extremely persuasive.
My position on this (that might be clear from previous discussions):
I agree this is a real problem.
From a technical perspective, I think this is even further from the alignment problem (than other AI safety problems), so I definitely think it should be studied separately and deserves a separate name.(Though the last bullet point in this comment implicitly gives an argument in the other direction.)
I’d normally frame this problem as “society’s values will evolve over time, and we have preferences about how they evolve.” New technology might change things in ways we don’t endorse. Natural pressures like death may lead to changes we don’t endorse (though that’s a tricky values call). The constraint of remaining economically/militarily competitive could also force our values to evolve in a bad way (alignment is an instance of that problem, and eventually AI+alignment would address the other natural instance by decoupling human values from the competence needed to remain competitive). And of course there is a hard problem in that we don’t know how to deliberate/reflect. The “figure out how to deliberate” problem seems like it is relatively easily postponed, since you don’t have to solve it until you are doing deliberation, but the “help people avoid errors in deliberation” may be more urgent.
The reason I consider alignment more urgent is entirely quantitative and very empirically contingent, I don’t think there is any simple argument against. I think there is a >1/3 chance that AI will be solidly superhuman within 20 subjective years, and that in those scenarios alignment destroys maybe 20% of the total value of the future, leading to 0.3%/year of losses from alignment, and right now it looks reasonably tractable. Influencing the trajectory of society’s values in other ways seems significantly worse than that to me (maybe 10x less cost-effective?). I think it would be useful to do some back-of-the-envelope calculations for the severity of value drift and the case for working on it.
I don’t think I’m likely to work on this problem unless I either become much more pessimistic about working on alignment (e.g. because the problem is much harder or easier than I currently believe), I feel like I’ve already poked at it enough that VOI from more poking is lower than just charging ahead on alignment. But that is a stronger judgment than the last section, and I think is largely due to comparative advantage considerations, and I would certainly be supportive of work on this topic (e.g. would be happy to fund, would engage with it, etc.)
This is a leading contender for what I would do if alignment seemed unappealing, though I think that broader institutional improvement / capability enhancement / etc. seems more appealing. I’d definitely spend more time thinking about it.
I think that important versions of these problems really do exist with or without AI, although I agree that AI will accelerate the point at which they become critical while it’s not obvious whether it will accelerate solutions. I don’t think this is particularly important but does make me feel even more comfortable with the naming issue—this isn’t really a problem about AI at all, it’s just one of many issues that is modulated by AI.
I think the main way AI is relevant to the cost-effectiveness analysis of shaping-the-evolution-of-values is that it may decrease the amount of work that can be done on these problems between now and when they become serious (if AI is effectively accelerating the timeline for catastrophic value change without accelerating work on making values evolve in a way we’d endorse).
To the extent that the value of working on these problems is dominated by that scenario—”AI has a large comparative disadvantage at helping us solve philosophical problems / thinking about long-term trajectory / etc.”—then I think that one of the most promising interventions on this problem is improving the relative capability of AI at problems of this form. My current view is that working on factored cognition (and similarly on debate) is a reasonable approach to that. This isn’t a super important consideration, but it overall makes me (a) a bit more excited about factored cognition (especially in worlds where the broader iterated amplification program breaks down), (b) a bit less concerned about figuring out whether relative capabilities is more or less important than alignment.
I would like to have clearer ways of talking and thinking about these problems, but (a) I think the next step is probably developing a better understanding (or, if someone has a much better understanding, then a development of a better shared understanding), (b) I really want a word other than “alignment,” and probably multiple words. I guess the one that feels most urgently-unnamed right now is something like: understanding how values evolve and what features may introduce that evolution in a way we don’t endorse, including both social dynamics, environmental factors, the need to remain competitive, and the dynamics of deliberation and argumentation.
I’d normally frame this problem as “society’s values will evolve over time, and we have preferences about how they evolve.”
This statement of the problem seems to assume a subjectivist or anti-realist view of metaethics (items 4 or 5 on this list). Consider the analogous statement, “mathematicians’ beliefs about mathematical statements will evolve over time, and they have preferences about how their beliefs evolve”. I think a lot of mathematicians would object to that and instead say that they prefer to have true beliefs about mathematics, and their “preferences about how their beliefs evolve” are just their best guesses about how to arrive at true beliefs.
Assuming you agree that we can’t be certain about which metaethical position is correct yet, I think by implicitly adopting a subjectivist/anti-realist framing, you make the problem seem easier than we should expect it to be. It implies that instead of the AI (and indirectly the AI designer) potentially having (if a realist or relativist metaethical position is correct) an obligation/opportunity to help the user figure out what their true or normative values are, which may involve solving difficult metaethical and other philosophical questions, the AI can just follow the user’s preferences about how their values evolve.
Additionally, this framing also makes the potential consequences of failing to solve the problem sound less serious than it could potentially be. I.e., if there is such a thing as someone’s true or normative values, then failing to optimize the universe for those values is really bad, but if they just have preferences about how their values evolve, then even if their values fail to evolve in that way, at least whatever values the universe ends up being optimized for are still their values, so not all is lost.
I think I would prefer to frame the problem as “How can we design/use AI to prevent the corruption of human values, especially corruption caused/exacerbated by the development of AI?” and would consider this an instance of the more general problem “When considering AI safety, it’s not safe to assume that the human user/operator/supervisor is a generally safe agent.”
Influencing the trajectory of society’s values in other ways seems significantly worse than that to me (maybe 10x less cost-effective?). I think it would be useful to do some back-of-the-envelope calculations for the severity of value drift and the case for working on it.
To me the x-risk of corrupting human values by well-motivated AI is comparable to the x-risk caused by badly-motivated AI (and both higher than 20% conditional on superhuman AI within 20 subjective years), but I’m not sure how to argue this with you. Even if the total risk of “value corruption” is 10x smaller, it seems like the marginal impact of an additional researcher on “value corruption” could be higher given that there are now about 20(?) full time researchers working mostly on AI motivation but zero on this problem (as far as I know), and then we also have to consider the effect of a marginal researcher on the future growth of each field, and future effects on public opinion and policy makers. Unfortunately, I don’t know how to calculate these things even in a back-of-the-envelope way. As a rule of thumb, “if one x-risk seems X times bigger than another, it should have about X times as many people working on it” is intuitive appealingly to me, and suggests we should have at least 2 people working on “value corruption” even if you think that risk is 10x smaller, but I’m not sure if that makes sense to you.
I don’t think I’m likely to work on this problem unless I either become much more pessimistic about working on alignment
I see no reason to convince you personally to work on “value corruption” since your intuition on the relative severity of the risks is so different from mine, and under either of our views we obviously still need people to work on motivation / alignment-in-your-sense. I’m just hoping that you won’t (intentionally or unintentionally) discourage people from working on “value corruption” so strongly that they don’t even consider looking into that problem and forming their own conclusions based on their own intuitions/priors.
To the extent that the value of working on these problems is dominated by that scenario—“AI has a large comparative disadvantage at helping us solve philosophical problems / thinking about long-term trajectory / etc.“—then I think that one of the most promising interventions on this problem is improving the relative capability of AI at problems of this form. My current view is that working on factored cognition (and similarly on debate) is a reasonable approach to that. This isn’t a super important consideration, but it overall makes me (a) a bit more excited about factored cognition (especially in worlds where the broader iterated amplification program breaks down), (b) a bit less concerned about figuring out whether relative capabilities is more or less important than alignment.
This seems totally reasonable to me, but 1) others may have other ideas about how to intervene on this problem, and 2) even within factored cognition or debate there are probably research directions that skew towards being more applicable to motivation and research directions that skew towards being more applicable to “value corruption” and I don’t want people to be excessively discouraged from working on the latter by statements like “motivation contains the most urgent part”.
To me the x-risk of corrupting human values by well-motivated AI is comparable to the x-risk caused by badly-motivated AI (and both higher than 20% conditional on superhuman AI within 20 subjective years), but I’m not sure how to argue this with you.
If you think this risk is very large, presumably there is some positive argument for why it’s so large? That seems like the most natural way to run the argument. I agree it’s not clear what exactly the norms of argument here are, but the very basic one seems to be sharing the reason for great concern.
In the case of alignment there are a few lines of argument that we can flesh out pretty far. The basic structure is something like: “(a) if we built AI with our current understanding there is a good chance it would not be trying to do what we wanted or have enough overlap to give the future substantial value, (b) if we built sufficiently competent AI, the future would probably be shaped by its intentions, (c) we have a significant risk of not developing sufficiently better understanding prior to having the capability to build sufficiently competent AI, (d) we have a significant risk of building sufficiently competent AI even if we don’t have sufficiently good understanding.” (Each of those claims obviously requires more argument, etc.)
One version of the case for worrying about value corruption would be:
It seems plausible that the values pursued by humans are very sensitive to changes in their environment.
It may be that historical variation is itself problematic, and we care mostly about our particular values.
Or it may be that values are “hardened” against certain kinds of environment shift that occur in nature, and that they will go to some lower “default” level of robustness under new kinds of shifts.
Or it may be that normal variation is OK for decision-theoretic reasons (since we are the beneficiaries of past shifts) but new kinds of variation are not OK.
If so, the rate of change in subjective time could be reasonably high—perhaps the change that occurs within one generation could shift value far enough to reduce value by 50% (if that change wasn’t endorsed for decision-theoretic reasons / hardened against).
It’s plausible, perhaps 50%, that AI will accelerate kinds of change that lead to value drift radically more than it accelerates an understanding of how to prevent such drift.
A good understanding of how to prevent value drift might be used / be a major driver of how well we prevent such drift. (Or maybe some other foreseeable institutional characteristics could have a big effect on how much drift occurs.)
If so, then it matters a lot how well we understand how to prevent such drift at the time when we develop AI. Perhaps there will be several generations worth of subjective time / drift-driving change before we are able to do enough additional labor to obsolete our current understanding (since AI is accelerating change but not the relevant kind of labor).
Our current understanding may not be good, and there may be a realistic prospect of having a much better understanding.
This kind of story is kind of conjunctive, so I’d expect to explore a few lines of argument like this, and then try to figure out what are the most important underlying uncertainties (e.g. steps that appear in most arguments of this form, or a more fundamental underlying cause for concern that generates many different arguments).
My most basic concerns with this story are things like:
In “well-controlled” situations, with principals who care about this issue, it feels like we already have an OK understanding of how to avert drift (conditioned on solving alignment). It seems like the basic idea is to decouple evolving values from the events in the world that are actually driving competitiveness / interacting with the natural world / realizing people’s consumption / etc., which is directly facilitated by alignment. The extreme form of this is having some human in a box somewhere (or maybe in cold storage) who will reflect and grow on their own schedule, and who will ultimately assume control of their resources once reaching maturity. We’ve talked a little bit about this, and you’ve pointed out some reasons this kind of scheme isn’t totally satisfactory even if it works as intended, but quantitatively the reasons you’ve pointed to don’t seem to be probable enough (per economic doubling, say) to make the cost-benefit analysis work out.
In most practical situations, it doesn’t seem like “understanding of how to avert drift” is the key bottleneck to averting drift—it seems like the basic problem is that most people just don’t care about averting drift at all, or have any inclination to be thoughtful about how their own preferences evolve. That’s still something you can intervene on, but it feels like a huge morass where you are competing with many other forces.
In the end I’m doing a pretty rough calculation that depends on a whole bunch of stuff, but those feel like they are maybe the most likely differences in view / places where I have something to say. Overall I still think this problem is relatively important, but that’s how I get to the intuitive view that it’s maybe ~10x lower impact. I would grant the existence of (plenty of) people for whom it’s higher impact though.
As a rule of thumb, “if one x-risk seems X times bigger than another, it should have about X times as many people working on it” is intuitive appealingly to me, and suggests we should have at least 2 people working on “value corruption” even if you think that risk is 10x smaller, but I’m not sure if that makes sense to you.
I think that seems roughly right, probably modulated by some O(1) factor factor reflecting tractability or other factors not captured in the total quantity of risk—maybe I’d expect us to have 2-10x more resources per unit risk devoted to more tractable risks.
In this case I’d be happy with the recommendation of ~10x more people working on motivation than on value drift, that feels like the right ballpark for basically the same reason that motivation feels ~10x more impactful.
I’m just hoping that you won’t (intentionally or unintentionally) discourage people from working on “value corruption” so strongly that they don’t even consider looking into that problem and forming their own conclusions based on their own intuitions/priors. [...] I don’t want people to be excessively discouraged from working on the latter by statements like “motivation contains the most urgent part”.
I agree I should be more careful about this.
I do think that motivation contains the most urgent/important part and feel pretty comfortable expressing that view (for the same reasons I’m generally inclined to express my views), but could hedge more when making statements like this.
(I think saying “X is more urgent than Y” is basically compatible with the view “There should be 10 people working on X for each person working on Y,” even if one also believes “but actually on the current margin investment in Y might be a better deal.” Will edit the post to be a bit softer here though.
ETA: actually I think the language in the post basically reflects what I meant, the broader definition seems worse because it contains tons of stuff that is lower priority. The narrower definition doesn’t contain every problem that is high priority, it just contains a single high priority problem, which is better than a really broad basket containing a mix of important and not-that-important stuff. But I will likely write a separate post or two at some point about value drift and other important problems other than motivation.)
If you think this risk is very large, presumably there is some positive argument for why it’s so large?
Yeah, I didn’t literally mean that I don’t have any arguments, but rather that we’ve discussed it in the past and it seems like we didn’t get close to resolving our disagreement. I tend to think that Aumann Agreement doesn’t apply to humans, and it’s fine to disagree on these kinds of things. Even if agreement ought to be possible in principle (which again I don’t think is necessarily true for humans), if you think that even from your perspective the value drift/corruption problem is currently overly neglected, then we can come back and revisit this at another time (e.g., when you think there’s too many people working on this problem, which might never actually happen).
it seems like the basic problem is that most people just don’t care about averting drift at all, or have any inclination to be thoughtful about how their own preferences evolve
I don’t understand how this is compatible with only 2% loss from value drift/corruption. Do you perhaps think the actual loss is much bigger, but almost certainly we just can’t do anything about it, so 2% is how much you expect we can potentially “save” from value drift/corruption? Or are you taking an anti-realist position and saying something like, if someone doesn’t care about averting drift/corruption, then however their values drift that doesn’t constitute any loss?
The narrower definition doesn’t contain every problem that is high priority, it just contains a single high priority problem, which is better than a really broad basket containing a mix of important and not-that-important stuff.
I don’t understand “better” in what sense. Whatever it is, why wouldn’t it be even better to have two terms, one of which is broadly defined so as to include all the problems that might be urgent but also includes lower priority problems and problems whose priority we’re not sure about, and another one that is defined to be a specific urgent problem. Do you currently have any objections to using “AI alignment” as the broader term (in line with the MIRI/Arbital definition and examples) and “AI motivation” as the narrower term (as suggested by Rohin)?
Do you currently have any objections to using “AI alignment” as the broader term (in line with the MIRI/Arbital definition and examples) and “AI motivation” as the narrower term (as suggested by Rohin)?
Yes:
The vast majority of existing usages of “alignment” should then be replaced by “motivation,” which is more specific and usually just as accurate. If you are going to split a term into new terms A and B, and you find that the vast majority of existing usage should be A, then I claim that “A” should be the one that keeps the old word.
The word “alignment” was chosen (originally be Stuart Russell I think) precisely because it is such a good name for the problem of aligning AI values with human values, it’s a word that correctly evokes what that problem is about. This is also how MIRI originally introduced the term. (I think they introduced it here, where they said “We call a smarter-than-human system that reliably pursues beneficial goals “aligned with human interests” or simply “aligned.””) Everywhere that anyone talks about alignment they use the analogy with “pointing,” and even MIRI folks usually talk about alignment as if it was mostly or entirely about pointing your AI in the right direction.
In contrast, “alignment” doesn’t really make sense as a name for the entire field of problems about making AI good. For the problem of making AI beneficial we already have the even older term “beneficial AI,” which really means exactly that. In explaining why MIRI doesn’t like that term, Rob said
Some of the main things I want from a term are:
A. It clearly and consistently keeps the focus on system design and engineering, and whatever technical/conceptual groundwork is needed to succeed at such. I want to make it easy for people (if they want to) to just hash out those technical issues, without feeling any pressure to dive into debates about bad actors and inter-group dynamics, or garden-variety machine ethics and moral philosophy, which carry a lot of derail / suck-the-energy-out-of-the-room risk.
[…] [“AI safety” or “beneficial AI”] doesn’t work so well for A—it’s commonly used to include things like misuse risk.”
[continuing last point] The proposed usage of “alignment” doesn’t meet this desiderata though, it has exactly the same problem as “beneficial AI,” except that it’s historically associated with this community. In particular it absolutely includes “garden-variety machine ethics and moral philosophy.” Yes, there is all sorts of stuff that MIRI or I wouldn’t care about that is relevant to “beneficial” AI, but under the proposed definition of alignment it’s also relevant to “aligned” AI. (This statement by Rob also makes me think that you wouldn’t in fact be happy with what he at least means by “alignment,” since I take it you explicitly mean to include moral philosophy?)
People have introduced a lot of terms and change terms frequently. I’ve changed the language on my blog multiple times at other people’s request. This isn’t costless, it really does make things more and more confusing.
I think “AI motivation” is not a good term for this area of study: it (a) suggests it’s about the study of AI motivation rather than engineering AI to be motivated to help humans, (b) is going to be perceived as aggressively anthropomorphizing (even if “alignment” is only slightly better), (c) is generally less optimized (related to the second point above, “alignment” is quite a good term for this area).
Probably “alignment” / “value alignment” would be a better split of terms than “alignment” vs. “motivation”. “Value alignment” has traditionally been used with the de re reading, but I could clarify that I’m working on de dicto value alignment when more precision is needed (everything I work on is also relevant on the de re reading, so the other interpretation is also accurate and just less precise).
I guess I have an analogous question for you: do you currently have any objections to using “beneficial AI” as the broader term, and “AI alignment” as the narrower term?
This is also how MIRI originally introduced the term. (I think they introduced it here, where they said “We call a smarter-than-human system that reliably pursues beneficial goals “aligned with human interests” or simply “aligned.”“)
But that definition seems quite different from your “A is trying to do what H wants it to do.” For example, if H has a wrong understanding of his/her true or normative values and as a result wants A to do something that is actually harmful, then under your definition A would be still be “aligned” but under MIRI’s definition it wouldn’t be (because it wouldn’t be pursuing beneficial goals).
This statement by Rob also makes me think that you wouldn’t in fact be happy with what he at least means by “alignment,” since I take it you explicitly mean to include moral philosophy?
I think that’s right. When I say MIRI/Arbital definition of “alignment” I’m referring to what’s they’ve posted publicly, and I believe it does include moral philosophy. Rob’s statement that you quoted seems to be a private one (I don’t recall seeing it before and can’t find it through Google search) but I can certainly see how it muddies the waters from your perspective.
Probably “alignment” / “value alignment” would be a better split of terms than “alignment” vs. “motivation”. “Value alignment” has traditionally been used with the de re reading, but I could clarify that I’m working on de dicto value alignment when more precision is needed
This seems fine to me, if you could give the benefit of doubt as to when more precision is needed. I’m basically worried about this scenario: You or someone else writes something like “I’m cautiously optimistic about Paul’s work.” The reader recalls seeing you say that you work on “value alignment”. They match that to what they’ve read from MIRI about how aligned AI “reliably pursues beneficial goals”, and end up thinking that is easier than you’d intend, or think there is more disagreement between alignment researchers about the difficulty of the broader problem than there is actually is. If you could consistently say that the goal of your work is “de dicto value alignment” then that removes most of my worry.
I guess I have an analogous question for you: do you currently have any objections to using “beneficial AI” as the broader term, and “AI alignment” as the narrower term?
This actually seems best to me on the merits of the terms alone (i.e., putting historical usage aside), and I’d be fine with it if everyone could coordinate to switch to these terms/definitions.
But that definition seems quite different from your “A is trying to do what H wants it to do.” For example, if H has a wrong understanding of his/her true or normative values and as a result wants A to do something that is actually harmful, then under your definition A would be still be “aligned” but under MIRI’s definition it wouldn’t be (because it wouldn’t be pursuing beneficial goals).
“Do what H wants me to do” seems to me to be an example of a beneficial goal, so I’d say a system which is trying to do what H wants it to do is pursuing a beneficial goals. It may also be pursuing subgoals which turn out to be harmful, if e.g. it’s wrong about what H wants or has other mistaken empirical beliefs. I don’t think anyone could be advocating the definition “pursues no harmful subgoals,” since that basically requires perfect empirical knowledge (it seems just as hard as never taking a harmful action). Does that seem right to you?
I’ve been assuming that “reliably pursues beneficial goals” is weaker than the definition I proposed, but practically equivalent as a research goal.
I’m basically worried about this scenario: You or someone else writes something like “I’m cautiously optimistic about Paul’s work.” The reader recalls seeing you say that you work on “value alignment”. They match that to what they’ve read from MIRI about how aligned AI “reliably pursues beneficial goals”, and end up thinking that is easier than you’d intend, or think there is more disagreement between alignment researchers about the difficulty of the broader problem than there is actually is. If you could consistently say that the goal of your work is “de dicto value alignment” then that removes most of my worry.
I think it’s reasonable for me to be more careful about clarifying what any particular line of research agenda does or does not aim to achieve. I think that in most contexts that is going to require more precision than just saying “AI alignment” regardless of how the term was defined, I normally clarify by saying something like “an AI which is at least trying to help us get what we want.”
This actually seems best to me on the merits of the terms alone (i.e., putting historical usage aside), and I’d be fine with it if everyone could coordinate to switch to these terms/definitions.
My guess is that MIRI folks won’t like the “beneficial AI” term because it is too broad a tent. (Which is also my objection to the proposed definition of “AI alignment,” as “overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.”) My sense is that if that were their position, then you would also be unhappy with their proposed usage of “AI alignment,” since you seem to want a broad tent that makes minimal assumptions about what problems will turn out to be important. Does that seem right?
(They might also dislike “beneficial AI” because of random contingent facts about how it’s been used in the past, and so might want a different term with the same meaning.)
My own feeling is that using “beneficial AI” to mean “AI that produces good outcomes in the world” is basically just using “beneficial” in accordance with its usual meaning, and this isn’t a case where a special technical term is needed (and indeed it’s weird to have a technical term whose definition is precisely captured by a single—different—word).
“Do what H wants me to do” seems to me to be an example of a beneficial goal, so I’d say a system which is trying to do what H wants it to do is pursuing a beneficial goals. It may also be pursuing subgoals which turn out to be harmful, if e.g. it’s wrong about what H wants or has other mistaken empirical beliefs. I don’t think anyone could be advocating the definition “pursues no harmful subgoals,” since that basically requires perfect empirical knowledge (it seems just as hard as never taking a harmful action). Does that seem right to you?
I guess both “reliable” and “beneficial” are matters of degree so “aligned” in the sense of “reliably pursues beneficial goals” is also a matter of degree. “Do what H wants A to do” would be a moderate degree of alignment whereas “Successfully figuring out and satisfying H’s true/normative values” would be a much higher degree of alignment (in that sense of alignment). Meanwhile in your sense of alignment they are at best equally aligned and the latter might actually be less aligned if H has a wrong idea of metaethics or what his true/normative values are and as a result trying to figure out and satisfy those values is not something that H wants A to do.
I think that in most contexts that is going to require more precision than just saying “AI alignment” regardless of how the term was defined, I normally clarify by saying something like “an AI which is at least trying to help us get what we want.”
That seems good too.
My guess is that MIRI folks won’t like the “beneficial AI” term because it is too broad a tent. (Which is also my objection to the proposed definition of “AI alignment,” as “overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.“) My sense is that if that were their position, then you would also be unhappy with their proposed usage of “AI alignment,” since you seem to want a broad tent that makes minimal assumptions about what problems will turn out to be important. Does that seem right?
This paragraph greatly confuses me. My understanding is that someone from MIRI (probably Eliezer) wrote the Arbital article defining “AI alignment” as “overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world”, which satisfies my desire to have a broad tent term that makes minimal assumptions about what problems will turn out to be important. I’m fine with calling this “beneficial AI” instead of “AI alignment” if everyone can coordinate on this (but I don’t know how MIRI people feel about this). I don’t understand why you think ‘MIRI folks won’t like the “beneficial AI” term because it is too broad a tent’ given that someone from MIRI gave a very broad definition to “AI alignment”. Do you perhaps think that Arbital article was written by a non-MIRI person?
“Do what H wants A to do” would be a moderate degree of alignment whereas “Successfully figuring out and satisfying H’s true/normative values” would be a much higher degree of alignment (in that sense of alignment).
In what sense is that a more beneficial goal?
“Successfully do X” seems to be the same goal as X, isn’t it?
“Figure out H’s true/normative values” is manifestly a subgoal of “satisfy H’s true/normative values.” Why would we care about that except as a subgoal?
So is the difference entirely between “satisfy H’s true/normative values” and “do what H wants”? Do you disagree with one of the previous two bullet points? Is the difference that you think “reliably pursues” implies something about “actually achieves”?
If the difference is mostly between “what H wants” and “what H truly/normatively values”, then this is just a communication difficulty. For me adding “truly” or “normatively” to “values” is just emphasis and doesn’t change the meaning.
I try to make it clear that I’m using “want” to refer to some hard-to-define idealization rather than some narrow concept, but I can see how “want” might not be a good term for this, I’d be fine using “values” or something along those lines if that would be clearer.
(This is why I wrote:
What H wants” is even more problematic than “trying.” Clarifying what this expression means, and how to operationalize it in a way that could be used to inform an AI’s behavior, is part of the alignment problem. Without additional clarity on this concept, we will not be able to build an AI that tries to do what H wants it to do.
If the difference is mostly between “what H wants” and “what H truly/normatively values”, then this is just a communication difficulty. For me adding “truly” or “normatively” to “values” is just emphasis and doesn’t change the meaning.
Ah, yes that is a big part of what I thought was the difference. (Actually I may have understood at some point that you meant “want” in an idealized sense but then forgot and didn’t re-read the post to pick up that understanding again.)
ETA: I guess another thing that contributed to this confusion is your talk of values evolving over time, and of preferences about how they evolve, which seems to suggest that by “values” you mean something like “current understanding of values” or “interim values” rather than “true/normative values” since it doesn’t seem to make sense to want one’s true/normative values to change over time.
I try to make it clear that I’m using “want” to refer to some hard-to-define idealization rather than some narrow concept, but I can see how “want” might not be a good term for this, I’d be fine using “values” or something along those lines if that would be clearer.
I don’t think “values” is good either. Both “want” and “values” are commonly used words that typically (in everyday usage) mean something like “someone’s current understanding of what they want” or what I called “interim values”. I don’t see how you can expect people not to be frequently confused if you use either of them to mean “true/normative values”. Like the situation with de re / de dicto alignment, I suggest it’s not worth trying to economize on the adjectives here.
Another difference between your definition of alignment and “reliably pursues beneficial goals” is that the latter has “reliably” in it which suggests more of a de re reading. To use your example “Suppose A thinks that H likes apples, and so goes to the store to buy some apples, but H really prefers oranges.” I think most people would call an A that correctly understands H’s preferences (and gets oranges) more reliably pursuing beneficial goals.
Given this, perhaps the easiest way to reduce confusions moving forward is to just use some adjectives to distinguish your use of the words “want”, “values”, or “alignment” from other people’s.
If the difference is mostly between “what H wants” and “what H truly/normatively values”, then this is just a communication difficulty. For me adding “truly” or “normatively” to “values” is just emphasis and doesn’t change the meaning.
So “wants” means a want more general than an object-level desire (like wanting to buy oranges), and it already takes into account the possibility of H changing his mind about what he wants if H discovers that his wants contradict his normative values?
If that’s right, how is this generalization defined? (E.g. The CEV was “what H wants in the limit of infinite intelligence, reasoning time and complete information”.)
I don’t understand why you think ‘MIRI folks won’t like the “beneficial AI” term because it is too broad a tent’ given that someone from MIRI gave a very broad definition to “AI alignment”. Do you perhaps think that Arbital article was written by a non-MIRI person?
I don’t really know what anyone from MIRI thinks about this issue. It was a guess based on (a) the fact that Rob didn’t like a number of possible alternative terms to “alignment” because they seemed to be too broad a definition, (b) the fact that virtually every MIRI usage of “alignment” refers to a much narrower class of problems than “beneficial AI” is usually taken to refer to, (c) the fact that Eliezer generally seems frustrated with people talking about other problems under the heading of “beneficial AI.”
(But (c) might be driven by powerful AI vs. nearer-term concerns / all the other empirical errors Eliezer thinks people are making, (b) isn’t that indicative, and (a) might be driven by other cultural baggage associated with the term / Rob was speaking off the cuff and not attempting to speak formally for MIRI.)
I’d consider it great if we standardized on “beneficial AI” to mean “AI that has good consequences” and “AI alignment” to refer to the narrower problem of aligning AI’s motivation/preferences/goals.
I don’t understand how this is compatible with only 2% loss from value drift/corruption. Do you perhaps think the actual loss is much bigger, but almost certainly we just can’t do anything about it, so 2% is how much you expect we can potentially “save” from value drift/corruption? Or are you taking an anti-realist position and saying something like, if someone doesn’t care about averting drift/corruption, then however their values drift that doesn’t constitute any loss?
10x worse was originally my estimate for cost-effectiveness, not for total value at risk.
People not caring about X prima facie decreases the returns to research on X. But may increase the returns for advocacy (or acquiring resources/influence, or more creative interventions). That bullet point was really about the returns to research.
People not caring about X prima facie decreases the returns to research on X. But may increase the returns for advocacy (or acquiring resources/influence, or more creative interventions). That bullet point was really about the returns to research.
It’s not obvious that applies here. If people don’t care strongly about how their values evolve over time, that seemingly gives AIs / AI designers an opening to have greater influence over how people’s values evolve over time, and implies a larger (or at least not obviously smaller) return on research into how to do this properly. Or if people care a bit about protecting their values from manipulation from other AIs but not a lot, it seems really important/valuable to reduce the cost of such protection as much as possible.
As for advocacy, it seems a lot easier (at least for someone in my position) to convince a relatively small number of AI designers to build AIs that want to help their users evolve their values in a positive way (or figuring out what their true or normative values are, or protecting their values against manipulation), than to convince all the potential users to want that themselves.
If people care less about some aspect of the future, then trying to get influence over that aspect of the future is more attractive (whether by building technology that they accept as a default, or by making an explicit trade, or whatever).
A better understanding of how to prevent value drift can still be helpful if people care a little bit, and can be particularly useful to the people who care a lot (and there will be fewer people working to develop such understanding if few people care).
I think that both
(a) Trying to have influence over aspects of value change that people don’t much care about, and
(b) better understanding the important processes driving changes in values
are reasonable things to do to make the future better. (Though some parts of (a) especially are somewhat zero-sum and I think it’s worth being thoughtful about that.)
(I don’t agree with the sign of the effect described in your comment, but don’t think it’s an important point / may just be a disagreement about what else we are holding equal so it seems good to drop.)
Trying to have influence over aspects of value change that people don’t much care about … [is] reasonable … to do to make the future better
This could refer to value change in AI controllers, like Hugh in HCH, or alternatively to value change in people living in the AI-managed world. I believe the latter could be good, but the former seems very questionable (here “value” refers to true/normative/idealized preference). So it’s hard for the same people to share the two roles. How do you ensure that value change remains good in the original sense without a reference to preference in the original sense, that hasn’t experienced any value change, a reference that remains in control? And for this discussion, it seems like the values of AI controllers (or AI+controllers) is what’s relevant.
It’s agent tiling for AI+controller agents, any value change in the whole seems to be a mistake. It might be OK to change values of subagents, but the whole shouldn’t show any value drift, only instrumentally useful tradeoffs that sacrifice less important aspects of what’s done for more important aspects, but still from the point of view of unchanged original values (to the extent that they are defined at all).
Assuming you agree that we can’t be certain about which metaethical position is correct yet, I think by implicitly adopting a subjectivist/anti-realist framing, you make the problem seem easier than we should expect it to be.
I don’t see why the anti-realist version is any easier, my preferences about how my values evolve are complex and can depend on the endpoint of that evolution process and on arbitrarily complex logical facts. I think the analogous non-realistic mathematical framing is fine. If anything the realist versions seem easier to me (and this is related to why mathematics seems so much easier than morality), since you can anchor changing preferences to some underlying ground truth and have more potential prospect for error-correction, but I don’t think it’s a big difference.
Additionally, this framing also makes the potential consequences of failing to solve the problem sound less serious than it could potentially be. I.e., if there is such a thing as someone’s true or normative values, then failing to optimize the universe for those values is really bad, but if they just have preferences about how their values evolve, then even if their values fail to evolve in that way, at least whatever values the universe ends up being optimized for are still their values, so not all is lost.
It doesn’t sound that way to me, but I’m happy to avoid framings that might give people the wrong idea.
I think I would prefer to frame the problem as “How can we design/use AI to prevent the corruption of human values, especially corruption caused/exacerbated by the development of AI?”
My main complaint with this framing (and the reason that I don’t use it) is that people respond badly to invoking the concept of “corruption” here—it’s a fuzzy category that we don’t understand, and people seem to interpret it as the speaker wanting values to remain static.
But in terms of the actual meanings rather than their impacts on people, I’d be about as happy with “avoiding corruption of values” as “having our values evolve in a positive way.” I think both of them have small shortcomings as framings. My main problem with corruption is that it suggests an unrealistically bright line / downplays our uncertainty about how to think about changing values and what constitutes corruption.
I don’t see why the anti-realist version is any easier
It seems easier in that the AI / AI designer doesn’t have to worry about the user being wrong about how they want their values to evolve. But you’re right that the realist version might be easier in other ways, so perhaps what I should say instead is that the problem definitely seems harder if we also include the subproblem of figuring out what the right metaethics is in the first place, and (by implicitly assuming a subset of all plausible metaethical positions) the statement of the problem that you proposed also does not convey a proper amount of uncertainty in its difficulty.
My main complaint with this framing (and the reason that I don’t use it) is that people respond badly to invoking the concept of “corruption” here—it’s a fuzzy category that we don’t understand, and people seem to interpret it as the speaker wanting values to remain static.
That’s a good point that I hadn’t thought of. (I guess talking about “drift” has a similar issue though, in that people might misinterpret it as the speaker wanting values to remain static.) If you or anyone else have a suggestion about how to phrase the problem so as to both avoid this issue and address my concerns about not assuming a particular metaethical position, I’d highly welcome that.
It seems easier in that the AI / AI designer doesn’t have to worry about the user being wrong about how they want their values to evolve.
That may be a connotation of the “preferences about how their values evolve,” but doesn’t seem like it follows from the anti-realist position.
I have preferences over what actions my robot takes. Yet if you asked me “what action do you want the robot to take?” I could be mistaken. I need not have access to my own preferences (since they can e.g. depend on empirical facts I don’t know). My preferences over value evolution can be similar.
Indeed, if moral realists are right, “ultimately converge to the truth” is a perfectly reasonable preference to have about how my preferences evolve. (Though again this may not be captured by the framing “help people’s preferences evolve in the way they want them to evolve.”) Perhaps the distinction is that there is some kind of idealization even of the way that preferences evolve, and maybe at that point it’s easier to just talk about preservation of idealized preferences (though that also has unfortunate implications and at least some minor technical problems).
I guess talking about “drift” has a similar issue though, in that people might misinterpret it as the speaker wanting values to remain static.
Would you agree with this way of stating it: There are more ways for someone to be wrong about their values under realism than under anti-realism. Under realism someone could be wrong even if they correctly state their preferences about how they want their values to evolve, because those preferences could themselves be wrong. So assuming an anti-realist position makes the problem sound easier because it implies there are fewer ways for the user to be wrong for the AI / AI designer to worry about.
Could you give an example of a statement you think could be wrong on the realist perspective, for which there couldn’t be a precisely analogous error on the non-realistic perspective?
There is some uninteresting semantic sense in which there are “more ways to be wrong” (since there is a whole extra category of statements that have truth values...) but not a sense that is relevant to the difficulty of building an AI.
I might be using the word “values” in a different way than. I think I can say something like “I’d like to deliberate in way X” and be wrong. I guess under non-realism I’m “incorrectly stating my preferences” and under realism I could be “correctly stating my preferences but be wrong,” but I don’t see how to translate that difference into any situation where I build an AI that is adequate on one perspective but inadequate on the other.
Suppose the user says “I want to try to figure out my true/normative values by doing X. Please help me do that.” If moral anti-realism is true, then the AI can only check if the user really wants to do X (e.g., by looking into the user’s brain and checking if X is encoded as a preference somewhere). But if moral realism is true, the AI could also use its own understanding of metaethics and metaphilosophy to predict if doing X would reliably lead to the user’s true/normative values, and warn the user or refuse to help or take some other action if the answer is no. Or if one can’t be certain about metaethics yet, and it looks like X might prematurely lock the user into the wrong values, the AI could warn the user about that.
I definitely don’t mean such a narrow sense of “want my values to evolve.” Seems worth using some language to clarify that.
In general the three options seem to be:
You care about what is “good” in the realist sense.
You care about what the user “actually wants” in some idealized sense.
You care about what the user “currently wants” in some narrow sense.
It seems to me that the first two are pretty similar. (And if you are uncertain about whether realism is true, and you’d be in the first case if you accepted realism, it seems like you’d probably be in the second case if you rejected realism. Of course that would depend on the nature of your uncertainty about realism, your views could depend on an arbitrary way on whether realism is true or false depending on what versions of realism/non-realism are competing, but I’m assuming something like the most common realist and non-realist views around here.)
To defend my original usage both in this thread and in the OP, which I’m not that attached to, I do think it would be typical to say that someone made a mistake if they were trying to help me get what I wanted, but failed to notice or communicate some crucial consideration that would totally change my views about what I wanted—the usual English usage of these terms involves at least mild idealization.
My position on this (that might be clear from previous discussions):
I agree this is a real problem.
From a technical perspective, I think this is even further from the alignment problem (than other AI safety problems), so I definitely think it should be studied separately and deserves a separate name.(Though the last bullet point in this comment implicitly gives an argument in the other direction.)
I’d normally frame this problem as “society’s values will evolve over time, and we have preferences about how they evolve.” New technology might change things in ways we don’t endorse. Natural pressures like death may lead to changes we don’t endorse (though that’s a tricky values call). The constraint of remaining economically/militarily competitive could also force our values to evolve in a bad way (alignment is an instance of that problem, and eventually AI+alignment would address the other natural instance by decoupling human values from the competence needed to remain competitive). And of course there is a hard problem in that we don’t know how to deliberate/reflect. The “figure out how to deliberate” problem seems like it is relatively easily postponed, since you don’t have to solve it until you are doing deliberation, but the “help people avoid errors in deliberation” may be more urgent.
The reason I consider alignment more urgent is entirely quantitative and very empirically contingent, I don’t think there is any simple argument against. I think there is a >1/3 chance that AI will be solidly superhuman within 20 subjective years, and that in those scenarios alignment destroys maybe 20% of the total value of the future, leading to 0.3%/year of losses from alignment, and right now it looks reasonably tractable. Influencing the trajectory of society’s values in other ways seems significantly worse than that to me (maybe 10x less cost-effective?). I think it would be useful to do some back-of-the-envelope calculations for the severity of value drift and the case for working on it.
I don’t think I’m likely to work on this problem unless I either become much more pessimistic about working on alignment (e.g. because the problem is much harder or easier than I currently believe), I feel like I’ve already poked at it enough that VOI from more poking is lower than just charging ahead on alignment. But that is a stronger judgment than the last section, and I think is largely due to comparative advantage considerations, and I would certainly be supportive of work on this topic (e.g. would be happy to fund, would engage with it, etc.)
This is a leading contender for what I would do if alignment seemed unappealing, though I think that broader institutional improvement / capability enhancement / etc. seems more appealing. I’d definitely spend more time thinking about it.
I think that important versions of these problems really do exist with or without AI, although I agree that AI will accelerate the point at which they become critical while it’s not obvious whether it will accelerate solutions. I don’t think this is particularly important but does make me feel even more comfortable with the naming issue—this isn’t really a problem about AI at all, it’s just one of many issues that is modulated by AI.
I think the main way AI is relevant to the cost-effectiveness analysis of shaping-the-evolution-of-values is that it may decrease the amount of work that can be done on these problems between now and when they become serious (if AI is effectively accelerating the timeline for catastrophic value change without accelerating work on making values evolve in a way we’d endorse).
To the extent that the value of working on these problems is dominated by that scenario—”AI has a large comparative disadvantage at helping us solve philosophical problems / thinking about long-term trajectory / etc.”—then I think that one of the most promising interventions on this problem is improving the relative capability of AI at problems of this form. My current view is that working on factored cognition (and similarly on debate) is a reasonable approach to that. This isn’t a super important consideration, but it overall makes me (a) a bit more excited about factored cognition (especially in worlds where the broader iterated amplification program breaks down), (b) a bit less concerned about figuring out whether relative capabilities is more or less important than alignment.
I would like to have clearer ways of talking and thinking about these problems, but (a) I think the next step is probably developing a better understanding (or, if someone has a much better understanding, then a development of a better shared understanding), (b) I really want a word other than “alignment,” and probably multiple words. I guess the one that feels most urgently-unnamed right now is something like: understanding how values evolve and what features may introduce that evolution in a way we don’t endorse, including both social dynamics, environmental factors, the need to remain competitive, and the dynamics of deliberation and argumentation.
This statement of the problem seems to assume a subjectivist or anti-realist view of metaethics (items 4 or 5 on this list). Consider the analogous statement, “mathematicians’ beliefs about mathematical statements will evolve over time, and they have preferences about how their beliefs evolve”. I think a lot of mathematicians would object to that and instead say that they prefer to have true beliefs about mathematics, and their “preferences about how their beliefs evolve” are just their best guesses about how to arrive at true beliefs.
Assuming you agree that we can’t be certain about which metaethical position is correct yet, I think by implicitly adopting a subjectivist/anti-realist framing, you make the problem seem easier than we should expect it to be. It implies that instead of the AI (and indirectly the AI designer) potentially having (if a realist or relativist metaethical position is correct) an obligation/opportunity to help the user figure out what their true or normative values are, which may involve solving difficult metaethical and other philosophical questions, the AI can just follow the user’s preferences about how their values evolve.
Additionally, this framing also makes the potential consequences of failing to solve the problem sound less serious than it could potentially be. I.e., if there is such a thing as someone’s true or normative values, then failing to optimize the universe for those values is really bad, but if they just have preferences about how their values evolve, then even if their values fail to evolve in that way, at least whatever values the universe ends up being optimized for are still their values, so not all is lost.
I think I would prefer to frame the problem as “How can we design/use AI to prevent the corruption of human values, especially corruption caused/exacerbated by the development of AI?” and would consider this an instance of the more general problem “When considering AI safety, it’s not safe to assume that the human user/operator/supervisor is a generally safe agent.”
To me the x-risk of corrupting human values by well-motivated AI is comparable to the x-risk caused by badly-motivated AI (and both higher than 20% conditional on superhuman AI within 20 subjective years), but I’m not sure how to argue this with you. Even if the total risk of “value corruption” is 10x smaller, it seems like the marginal impact of an additional researcher on “value corruption” could be higher given that there are now about 20(?) full time researchers working mostly on AI motivation but zero on this problem (as far as I know), and then we also have to consider the effect of a marginal researcher on the future growth of each field, and future effects on public opinion and policy makers. Unfortunately, I don’t know how to calculate these things even in a back-of-the-envelope way. As a rule of thumb, “if one x-risk seems X times bigger than another, it should have about X times as many people working on it” is intuitive appealingly to me, and suggests we should have at least 2 people working on “value corruption” even if you think that risk is 10x smaller, but I’m not sure if that makes sense to you.
I see no reason to convince you personally to work on “value corruption” since your intuition on the relative severity of the risks is so different from mine, and under either of our views we obviously still need people to work on motivation / alignment-in-your-sense. I’m just hoping that you won’t (intentionally or unintentionally) discourage people from working on “value corruption” so strongly that they don’t even consider looking into that problem and forming their own conclusions based on their own intuitions/priors.
This seems totally reasonable to me, but 1) others may have other ideas about how to intervene on this problem, and 2) even within factored cognition or debate there are probably research directions that skew towards being more applicable to motivation and research directions that skew towards being more applicable to “value corruption” and I don’t want people to be excessively discouraged from working on the latter by statements like “motivation contains the most urgent part”.
If you think this risk is very large, presumably there is some positive argument for why it’s so large? That seems like the most natural way to run the argument. I agree it’s not clear what exactly the norms of argument here are, but the very basic one seems to be sharing the reason for great concern.
In the case of alignment there are a few lines of argument that we can flesh out pretty far. The basic structure is something like: “(a) if we built AI with our current understanding there is a good chance it would not be trying to do what we wanted or have enough overlap to give the future substantial value, (b) if we built sufficiently competent AI, the future would probably be shaped by its intentions, (c) we have a significant risk of not developing sufficiently better understanding prior to having the capability to build sufficiently competent AI, (d) we have a significant risk of building sufficiently competent AI even if we don’t have sufficiently good understanding.” (Each of those claims obviously requires more argument, etc.)
One version of the case for worrying about value corruption would be:
It seems plausible that the values pursued by humans are very sensitive to changes in their environment.
It may be that historical variation is itself problematic, and we care mostly about our particular values.
Or it may be that values are “hardened” against certain kinds of environment shift that occur in nature, and that they will go to some lower “default” level of robustness under new kinds of shifts.
Or it may be that normal variation is OK for decision-theoretic reasons (since we are the beneficiaries of past shifts) but new kinds of variation are not OK.
If so, the rate of change in subjective time could be reasonably high—perhaps the change that occurs within one generation could shift value far enough to reduce value by 50% (if that change wasn’t endorsed for decision-theoretic reasons / hardened against).
It’s plausible, perhaps 50%, that AI will accelerate kinds of change that lead to value drift radically more than it accelerates an understanding of how to prevent such drift.
A good understanding of how to prevent value drift might be used / be a major driver of how well we prevent such drift. (Or maybe some other foreseeable institutional characteristics could have a big effect on how much drift occurs.)
If so, then it matters a lot how well we understand how to prevent such drift at the time when we develop AI. Perhaps there will be several generations worth of subjective time / drift-driving change before we are able to do enough additional labor to obsolete our current understanding (since AI is accelerating change but not the relevant kind of labor).
Our current understanding may not be good, and there may be a realistic prospect of having a much better understanding.
This kind of story is kind of conjunctive, so I’d expect to explore a few lines of argument like this, and then try to figure out what are the most important underlying uncertainties (e.g. steps that appear in most arguments of this form, or a more fundamental underlying cause for concern that generates many different arguments).
My most basic concerns with this story are things like:
In “well-controlled” situations, with principals who care about this issue, it feels like we already have an OK understanding of how to avert drift (conditioned on solving alignment). It seems like the basic idea is to decouple evolving values from the events in the world that are actually driving competitiveness / interacting with the natural world / realizing people’s consumption / etc., which is directly facilitated by alignment. The extreme form of this is having some human in a box somewhere (or maybe in cold storage) who will reflect and grow on their own schedule, and who will ultimately assume control of their resources once reaching maturity. We’ve talked a little bit about this, and you’ve pointed out some reasons this kind of scheme isn’t totally satisfactory even if it works as intended, but quantitatively the reasons you’ve pointed to don’t seem to be probable enough (per economic doubling, say) to make the cost-benefit analysis work out.
In most practical situations, it doesn’t seem like “understanding of how to avert drift” is the key bottleneck to averting drift—it seems like the basic problem is that most people just don’t care about averting drift at all, or have any inclination to be thoughtful about how their own preferences evolve. That’s still something you can intervene on, but it feels like a huge morass where you are competing with many other forces.
In the end I’m doing a pretty rough calculation that depends on a whole bunch of stuff, but those feel like they are maybe the most likely differences in view / places where I have something to say. Overall I still think this problem is relatively important, but that’s how I get to the intuitive view that it’s maybe ~10x lower impact. I would grant the existence of (plenty of) people for whom it’s higher impact though.
I think that seems roughly right, probably modulated by some O(1) factor factor reflecting tractability or other factors not captured in the total quantity of risk—maybe I’d expect us to have 2-10x more resources per unit risk devoted to more tractable risks.
In this case I’d be happy with the recommendation of ~10x more people working on motivation than on value drift, that feels like the right ballpark for basically the same reason that motivation feels ~10x more impactful.
I agree I should be more careful about this.
I do think that motivation contains the most urgent/important part and feel pretty comfortable expressing that view (for the same reasons I’m generally inclined to express my views), but could hedge more when making statements like this.
(I think saying “X is more urgent than Y” is basically compatible with the view “There should be 10 people working on X for each person working on Y,” even if one also believes “but actually on the current margin investment in Y might be a better deal.” Will edit the post to be a bit softer here though.
ETA: actually I think the language in the post basically reflects what I meant, the broader definition seems worse because it contains tons of stuff that is lower priority. The narrower definition doesn’t contain every problem that is high priority, it just contains a single high priority problem, which is better than a really broad basket containing a mix of important and not-that-important stuff. But I will likely write a separate post or two at some point about value drift and other important problems other than motivation.)
Yeah, I didn’t literally mean that I don’t have any arguments, but rather that we’ve discussed it in the past and it seems like we didn’t get close to resolving our disagreement. I tend to think that Aumann Agreement doesn’t apply to humans, and it’s fine to disagree on these kinds of things. Even if agreement ought to be possible in principle (which again I don’t think is necessarily true for humans), if you think that even from your perspective the value drift/corruption problem is currently overly neglected, then we can come back and revisit this at another time (e.g., when you think there’s too many people working on this problem, which might never actually happen).
I don’t understand how this is compatible with only 2% loss from value drift/corruption. Do you perhaps think the actual loss is much bigger, but almost certainly we just can’t do anything about it, so 2% is how much you expect we can potentially “save” from value drift/corruption? Or are you taking an anti-realist position and saying something like, if someone doesn’t care about averting drift/corruption, then however their values drift that doesn’t constitute any loss?
I don’t understand “better” in what sense. Whatever it is, why wouldn’t it be even better to have two terms, one of which is broadly defined so as to include all the problems that might be urgent but also includes lower priority problems and problems whose priority we’re not sure about, and another one that is defined to be a specific urgent problem. Do you currently have any objections to using “AI alignment” as the broader term (in line with the MIRI/Arbital definition and examples) and “AI motivation” as the narrower term (as suggested by Rohin)?
Yes:
The vast majority of existing usages of “alignment” should then be replaced by “motivation,” which is more specific and usually just as accurate. If you are going to split a term into new terms A and B, and you find that the vast majority of existing usage should be A, then I claim that “A” should be the one that keeps the old word.
The word “alignment” was chosen (originally be Stuart Russell I think) precisely because it is such a good name for the problem of aligning AI values with human values, it’s a word that correctly evokes what that problem is about. This is also how MIRI originally introduced the term. (I think they introduced it here, where they said “We call a smarter-than-human system that reliably pursues beneficial goals “aligned with human interests” or simply “aligned.””) Everywhere that anyone talks about alignment they use the analogy with “pointing,” and even MIRI folks usually talk about alignment as if it was mostly or entirely about pointing your AI in the right direction.
In contrast, “alignment” doesn’t really make sense as a name for the entire field of problems about making AI good. For the problem of making AI beneficial we already have the even older term “beneficial AI,” which really means exactly that. In explaining why MIRI doesn’t like that term, Rob said
[continuing last point] The proposed usage of “alignment” doesn’t meet this desiderata though, it has exactly the same problem as “beneficial AI,” except that it’s historically associated with this community. In particular it absolutely includes “garden-variety machine ethics and moral philosophy.” Yes, there is all sorts of stuff that MIRI or I wouldn’t care about that is relevant to “beneficial” AI, but under the proposed definition of alignment it’s also relevant to “aligned” AI. (This statement by Rob also makes me think that you wouldn’t in fact be happy with what he at least means by “alignment,” since I take it you explicitly mean to include moral philosophy?)
People have introduced a lot of terms and change terms frequently. I’ve changed the language on my blog multiple times at other people’s request. This isn’t costless, it really does make things more and more confusing.
I think “AI motivation” is not a good term for this area of study: it (a) suggests it’s about the study of AI motivation rather than engineering AI to be motivated to help humans, (b) is going to be perceived as aggressively anthropomorphizing (even if “alignment” is only slightly better), (c) is generally less optimized (related to the second point above, “alignment” is quite a good term for this area).
Probably “alignment” / “value alignment” would be a better split of terms than “alignment” vs. “motivation”. “Value alignment” has traditionally been used with the de re reading, but I could clarify that I’m working on de dicto value alignment when more precision is needed (everything I work on is also relevant on the de re reading, so the other interpretation is also accurate and just less precise).
I guess I have an analogous question for you: do you currently have any objections to using “beneficial AI” as the broader term, and “AI alignment” as the narrower term?
But that definition seems quite different from your “A is trying to do what H wants it to do.” For example, if H has a wrong understanding of his/her true or normative values and as a result wants A to do something that is actually harmful, then under your definition A would be still be “aligned” but under MIRI’s definition it wouldn’t be (because it wouldn’t be pursuing beneficial goals).
I think that’s right. When I say MIRI/Arbital definition of “alignment” I’m referring to what’s they’ve posted publicly, and I believe it does include moral philosophy. Rob’s statement that you quoted seems to be a private one (I don’t recall seeing it before and can’t find it through Google search) but I can certainly see how it muddies the waters from your perspective.
This seems fine to me, if you could give the benefit of doubt as to when more precision is needed. I’m basically worried about this scenario: You or someone else writes something like “I’m cautiously optimistic about Paul’s work.” The reader recalls seeing you say that you work on “value alignment”. They match that to what they’ve read from MIRI about how aligned AI “reliably pursues beneficial goals”, and end up thinking that is easier than you’d intend, or think there is more disagreement between alignment researchers about the difficulty of the broader problem than there is actually is. If you could consistently say that the goal of your work is “de dicto value alignment” then that removes most of my worry.
This actually seems best to me on the merits of the terms alone (i.e., putting historical usage aside), and I’d be fine with it if everyone could coordinate to switch to these terms/definitions.
“Do what H wants me to do” seems to me to be an example of a beneficial goal, so I’d say a system which is trying to do what H wants it to do is pursuing a beneficial goals. It may also be pursuing subgoals which turn out to be harmful, if e.g. it’s wrong about what H wants or has other mistaken empirical beliefs. I don’t think anyone could be advocating the definition “pursues no harmful subgoals,” since that basically requires perfect empirical knowledge (it seems just as hard as never taking a harmful action). Does that seem right to you?
I’ve been assuming that “reliably pursues beneficial goals” is weaker than the definition I proposed, but practically equivalent as a research goal.
I think it’s reasonable for me to be more careful about clarifying what any particular line of research agenda does or does not aim to achieve. I think that in most contexts that is going to require more precision than just saying “AI alignment” regardless of how the term was defined, I normally clarify by saying something like “an AI which is at least trying to help us get what we want.”
My guess is that MIRI folks won’t like the “beneficial AI” term because it is too broad a tent. (Which is also my objection to the proposed definition of “AI alignment,” as “overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.”) My sense is that if that were their position, then you would also be unhappy with their proposed usage of “AI alignment,” since you seem to want a broad tent that makes minimal assumptions about what problems will turn out to be important. Does that seem right?
(They might also dislike “beneficial AI” because of random contingent facts about how it’s been used in the past, and so might want a different term with the same meaning.)
My own feeling is that using “beneficial AI” to mean “AI that produces good outcomes in the world” is basically just using “beneficial” in accordance with its usual meaning, and this isn’t a case where a special technical term is needed (and indeed it’s weird to have a technical term whose definition is precisely captured by a single—different—word).
I guess both “reliable” and “beneficial” are matters of degree so “aligned” in the sense of “reliably pursues beneficial goals” is also a matter of degree. “Do what H wants A to do” would be a moderate degree of alignment whereas “Successfully figuring out and satisfying H’s true/normative values” would be a much higher degree of alignment (in that sense of alignment). Meanwhile in your sense of alignment they are at best equally aligned and the latter might actually be less aligned if H has a wrong idea of metaethics or what his true/normative values are and as a result trying to figure out and satisfy those values is not something that H wants A to do.
That seems good too.
This paragraph greatly confuses me. My understanding is that someone from MIRI (probably Eliezer) wrote the Arbital article defining “AI alignment” as “overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world”, which satisfies my desire to have a broad tent term that makes minimal assumptions about what problems will turn out to be important. I’m fine with calling this “beneficial AI” instead of “AI alignment” if everyone can coordinate on this (but I don’t know how MIRI people feel about this). I don’t understand why you think ‘MIRI folks won’t like the “beneficial AI” term because it is too broad a tent’ given that someone from MIRI gave a very broad definition to “AI alignment”. Do you perhaps think that Arbital article was written by a non-MIRI person?
In what sense is that a more beneficial goal?
“Successfully do X” seems to be the same goal as X, isn’t it?
“Figure out H’s true/normative values” is manifestly a subgoal of “satisfy H’s true/normative values.” Why would we care about that except as a subgoal?
So is the difference entirely between “satisfy H’s true/normative values” and “do what H wants”? Do you disagree with one of the previous two bullet points? Is the difference that you think “reliably pursues” implies something about “actually achieves”?
If the difference is mostly between “what H wants” and “what H truly/normatively values”, then this is just a communication difficulty. For me adding “truly” or “normatively” to “values” is just emphasis and doesn’t change the meaning.
I try to make it clear that I’m using “want” to refer to some hard-to-define idealization rather than some narrow concept, but I can see how “want” might not be a good term for this, I’d be fine using “values” or something along those lines if that would be clearer.
(This is why I wrote:
)
Ah, yes that is a big part of what I thought was the difference. (Actually I may have understood at some point that you meant “want” in an idealized sense but then forgot and didn’t re-read the post to pick up that understanding again.)
ETA: I guess another thing that contributed to this confusion is your talk of values evolving over time, and of preferences about how they evolve, which seems to suggest that by “values” you mean something like “current understanding of values” or “interim values” rather than “true/normative values” since it doesn’t seem to make sense to want one’s true/normative values to change over time.
I don’t think “values” is good either. Both “want” and “values” are commonly used words that typically (in everyday usage) mean something like “someone’s current understanding of what they want” or what I called “interim values”. I don’t see how you can expect people not to be frequently confused if you use either of them to mean “true/normative values”. Like the situation with de re / de dicto alignment, I suggest it’s not worth trying to economize on the adjectives here.
Another difference between your definition of alignment and “reliably pursues beneficial goals” is that the latter has “reliably” in it which suggests more of a de re reading. To use your example “Suppose A thinks that H likes apples, and so goes to the store to buy some apples, but H really prefers oranges.” I think most people would call an A that correctly understands H’s preferences (and gets oranges) more reliably pursuing beneficial goals.
Given this, perhaps the easiest way to reduce confusions moving forward is to just use some adjectives to distinguish your use of the words “want”, “values”, or “alignment” from other people’s.
So “wants” means a want more general than an object-level desire (like wanting to buy oranges), and it already takes into account the possibility of H changing his mind about what he wants if H discovers that his wants contradict his normative values?
If that’s right, how is this generalization defined? (E.g. The CEV was “what H wants in the limit of infinite intelligence, reasoning time and complete information”.)
I don’t really know what anyone from MIRI thinks about this issue. It was a guess based on (a) the fact that Rob didn’t like a number of possible alternative terms to “alignment” because they seemed to be too broad a definition, (b) the fact that virtually every MIRI usage of “alignment” refers to a much narrower class of problems than “beneficial AI” is usually taken to refer to, (c) the fact that Eliezer generally seems frustrated with people talking about other problems under the heading of “beneficial AI.”
(But (c) might be driven by powerful AI vs. nearer-term concerns / all the other empirical errors Eliezer thinks people are making, (b) isn’t that indicative, and (a) might be driven by other cultural baggage associated with the term / Rob was speaking off the cuff and not attempting to speak formally for MIRI.)
I’d consider it great if we standardized on “beneficial AI” to mean “AI that has good consequences” and “AI alignment” to refer to the narrower problem of aligning AI’s motivation/preferences/goals.
10x worse was originally my estimate for cost-effectiveness, not for total value at risk.
People not caring about X prima facie decreases the returns to research on X. But may increase the returns for advocacy (or acquiring resources/influence, or more creative interventions). That bullet point was really about the returns to research.
It’s not obvious that applies here. If people don’t care strongly about how their values evolve over time, that seemingly gives AIs / AI designers an opening to have greater influence over how people’s values evolve over time, and implies a larger (or at least not obviously smaller) return on research into how to do this properly. Or if people care a bit about protecting their values from manipulation from other AIs but not a lot, it seems really important/valuable to reduce the cost of such protection as much as possible.
As for advocacy, it seems a lot easier (at least for someone in my position) to convince a relatively small number of AI designers to build AIs that want to help their users evolve their values in a positive way (or figuring out what their true or normative values are, or protecting their values against manipulation), than to convince all the potential users to want that themselves.
I agree that:
If people care less about some aspect of the future, then trying to get influence over that aspect of the future is more attractive (whether by building technology that they accept as a default, or by making an explicit trade, or whatever).
A better understanding of how to prevent value drift can still be helpful if people care a little bit, and can be particularly useful to the people who care a lot (and there will be fewer people working to develop such understanding if few people care).
I think that both
(a) Trying to have influence over aspects of value change that people don’t much care about, and
(b) better understanding the important processes driving changes in values
are reasonable things to do to make the future better. (Though some parts of (a) especially are somewhat zero-sum and I think it’s worth being thoughtful about that.)
(I don’t agree with the sign of the effect described in your comment, but don’t think it’s an important point / may just be a disagreement about what else we are holding equal so it seems good to drop.)
This could refer to value change in AI controllers, like Hugh in HCH, or alternatively to value change in people living in the AI-managed world. I believe the latter could be good, but the former seems very questionable (here “value” refers to true/normative/idealized preference). So it’s hard for the same people to share the two roles. How do you ensure that value change remains good in the original sense without a reference to preference in the original sense, that hasn’t experienced any value change, a reference that remains in control? And for this discussion, it seems like the values of AI controllers (or AI+controllers) is what’s relevant.
It’s agent tiling for AI+controller agents, any value change in the whole seems to be a mistake. It might be OK to change values of subagents, but the whole shouldn’t show any value drift, only instrumentally useful tradeoffs that sacrifice less important aspects of what’s done for more important aspects, but still from the point of view of unchanged original values (to the extent that they are defined at all).
I don’t see why the anti-realist version is any easier, my preferences about how my values evolve are complex and can depend on the endpoint of that evolution process and on arbitrarily complex logical facts. I think the analogous non-realistic mathematical framing is fine. If anything the realist versions seem easier to me (and this is related to why mathematics seems so much easier than morality), since you can anchor changing preferences to some underlying ground truth and have more potential prospect for error-correction, but I don’t think it’s a big difference.
It doesn’t sound that way to me, but I’m happy to avoid framings that might give people the wrong idea.
My main complaint with this framing (and the reason that I don’t use it) is that people respond badly to invoking the concept of “corruption” here—it’s a fuzzy category that we don’t understand, and people seem to interpret it as the speaker wanting values to remain static.
But in terms of the actual meanings rather than their impacts on people, I’d be about as happy with “avoiding corruption of values” as “having our values evolve in a positive way.” I think both of them have small shortcomings as framings. My main problem with corruption is that it suggests an unrealistically bright line / downplays our uncertainty about how to think about changing values and what constitutes corruption.
It seems easier in that the AI / AI designer doesn’t have to worry about the user being wrong about how they want their values to evolve. But you’re right that the realist version might be easier in other ways, so perhaps what I should say instead is that the problem definitely seems harder if we also include the subproblem of figuring out what the right metaethics is in the first place, and (by implicitly assuming a subset of all plausible metaethical positions) the statement of the problem that you proposed also does not convey a proper amount of uncertainty in its difficulty.
That’s a good point that I hadn’t thought of. (I guess talking about “drift” has a similar issue though, in that people might misinterpret it as the speaker wanting values to remain static.) If you or anyone else have a suggestion about how to phrase the problem so as to both avoid this issue and address my concerns about not assuming a particular metaethical position, I’d highly welcome that.
That may be a connotation of the “preferences about how their values evolve,” but doesn’t seem like it follows from the anti-realist position.
I have preferences over what actions my robot takes. Yet if you asked me “what action do you want the robot to take?” I could be mistaken. I need not have access to my own preferences (since they can e.g. depend on empirical facts I don’t know). My preferences over value evolution can be similar.
Indeed, if moral realists are right, “ultimately converge to the truth” is a perfectly reasonable preference to have about how my preferences evolve. (Though again this may not be captured by the framing “help people’s preferences evolve in the way they want them to evolve.”) Perhaps the distinction is that there is some kind of idealization even of the way that preferences evolve, and maybe at that point it’s easier to just talk about preservation of idealized preferences (though that also has unfortunate implications and at least some minor technical problems).
I agree that drift is also problematic.
Would you agree with this way of stating it: There are more ways for someone to be wrong about their values under realism than under anti-realism. Under realism someone could be wrong even if they correctly state their preferences about how they want their values to evolve, because those preferences could themselves be wrong. So assuming an anti-realist position makes the problem sound easier because it implies there are fewer ways for the user to be wrong for the AI / AI designer to worry about.
Could you give an example of a statement you think could be wrong on the realist perspective, for which there couldn’t be a precisely analogous error on the non-realistic perspective?
There is some uninteresting semantic sense in which there are “more ways to be wrong” (since there is a whole extra category of statements that have truth values...) but not a sense that is relevant to the difficulty of building an AI.
I might be using the word “values” in a different way than. I think I can say something like “I’d like to deliberate in way X” and be wrong. I guess under non-realism I’m “incorrectly stating my preferences” and under realism I could be “correctly stating my preferences but be wrong,” but I don’t see how to translate that difference into any situation where I build an AI that is adequate on one perspective but inadequate on the other.
Suppose the user says “I want to try to figure out my true/normative values by doing X. Please help me do that.” If moral anti-realism is true, then the AI can only check if the user really wants to do X (e.g., by looking into the user’s brain and checking if X is encoded as a preference somewhere). But if moral realism is true, the AI could also use its own understanding of metaethics and metaphilosophy to predict if doing X would reliably lead to the user’s true/normative values, and warn the user or refuse to help or take some other action if the answer is no. Or if one can’t be certain about metaethics yet, and it looks like X might prematurely lock the user into the wrong values, the AI could warn the user about that.
I definitely don’t mean such a narrow sense of “want my values to evolve.” Seems worth using some language to clarify that.
In general the three options seem to be:
You care about what is “good” in the realist sense.
You care about what the user “actually wants” in some idealized sense.
You care about what the user “currently wants” in some narrow sense.
It seems to me that the first two are pretty similar. (And if you are uncertain about whether realism is true, and you’d be in the first case if you accepted realism, it seems like you’d probably be in the second case if you rejected realism. Of course that would depend on the nature of your uncertainty about realism, your views could depend on an arbitrary way on whether realism is true or false depending on what versions of realism/non-realism are competing, but I’m assuming something like the most common realist and non-realist views around here.)
To defend my original usage both in this thread and in the OP, which I’m not that attached to, I do think it would be typical to say that someone made a mistake if they were trying to help me get what I wanted, but failed to notice or communicate some crucial consideration that would totally change my views about what I wanted—the usual English usage of these terms involves at least mild idealization.