Ultimately, our goal is to build AI systems that do what we want them to do. One way of decomposing this is first to define the behavior that we want from an AI system, and then to figure out how to obtain that behavior, which we might call the definition-optimization decomposition. Ambitious value learning aims to solve the definition subproblem. I interpret this post as proposing a different decomposition of the overall problem. One subproblem is how to build an AI system that is trying to do what we want, and the second subproblem is how to make the AI competent enough that it actually does what we want. I like this motivation-competence decomposition for a few reasons:
It isolates the major, urgent difficulty in a single subproblem. If we make an AI system that tries to do what we want, it could certainly make mistakes, but it seems much less likely to cause eg. human extinction. (Though it is certainly possible, for example by building an unaligned successor AI system, as mentioned in the post.) In contrast, with the definition-optimization decomposition, we need to solve both specification problems with the definition and robustness problems with the optimization.
Humans seem to solve the motivation subproblem, whereas humans don’t seem to solve either the definition or the optimization subproblems. I can definitely imagine a human legitimately trying to help me, whereas I can’t really imagine a human knowing how to derive optimal behavior for my goals, nor can I imagine a human that can actually perform the optimal behavior to achieve some arbitrary goal.
It is easier to apply to systems without much capability, though as the post notes, it probably still does need to have some level of capability. While a digit recognition system is useful, it doesn’t seem meaningful to talk about whether it is “trying” to help us.
Relatedly, the safety guarantees seem to degrade more slowly and smoothly. With definition-optimization, if you get the definition even slightly wrong, Goodhart’s Law suggests that you can get very bad outcomes. With motivation-competence, I’ve already argued that incompetence probably leads to small problems, not big ones, and slightly worse motivation might not make a huge difference because of something analogous to the basin of attraction around corrigibility. This depends a lot on what “slightly worse” means for motivation, but I’m optimistic.
We’ve been working with the definition-optimization decomposition for quite some time now by modeling AI systems as expected utility maximizers, and we’ve found a lot of negative results and not very many positive ones.
The motivation-competence decomposition accommodates interaction between the AI system and humans, which definition-optimization does not allow (or at least, it makes it awkward to include such interaction).
The cons are:
It is imprecise and informal, whereas we can use the formalism of expected utility maximizers for the definition-optimization decomposition.
There hasn’t been much work done in this paradigm, so it is not obvious that there is progress to make.
I suspect many researchers would argue that any sufficiently intelligent system will be well-modeled as an expected utility maximizer and will have goals and preferences it is optimizing for, and as a result we need to deal with the problems of expected utility maximizers anyway. Personally, I do not find this argument compelling, and hope to write about why in the near future. ETA: Written up in the chapter on Goals vs Utility Functions in the Value Learning sequence, particularly in Coherence arguments do not imply goal-directed behavior.
This is a great comment, and maybe it should even be its own post. It clarified a bunch of things for me, and I think was the best concise argument for “we should try to build something that doesn’t look like an expected utility maximizer” that I’ve read so far.
I agree with habryka that this is a really good explanation. I also agree with most of your pros and cons, but for me another major con is that this decomposition moves some problems that I think are crucial and urgent out of “AI alignment” and into the “competence” part, with the implicit or explicit implication that they are not as important, for example the problem of obtaining or helping humans to obtain a better understanding of their values and defending their values against manipulation from other AIs.
In other words, the motivation-competence decomposition seems potentially very useful to me as a way to break down a larger problem into smaller parts so it can be solved more easily, but I don’t agree that the urgent/not-urgent divide lines up neatly with the motivation/competence divide.
Aside from the practical issue of confusion between different usages of “AI alignment” (I think others like MIRI had been using “AI alignment” in a broader sense before Paul came up with his narrower definition), even using “AI alignment” in a context where it’s clear that I’m using Paul’s definition gives me the feeling that I’m implicitly agreeing to his understanding of how various subproblems should be prioritized.
Aside from the practical issue of confusion between different usages of “AI alignment” (I think others like MIRI had been using “AI alignment” in a broader sense before Paul came up with his narrower definition)
I switched to this usage of AI alignment in 2017, after an email thread involving many MIRI people where Rob suggested using “AI alignment” to refer to what Bostrom calls the “second principal-agent problem” (he objected to my use of “control”). I think I misunderstood what Rob intended in that discussion, but my definition is meant to be in line with that—if the agent is trying to do what the principal wants, it seem like you’ve solved the principal-agent problem. I think the main way this definition is narrower than what was discussed in that email thread is by excluding things like boxing.
In practice, essentially all of MIRI’s work seems to fit within this narrower definition, so I’m not too concerned at the moment with this practical issue (I don’t know of any work MIRI feels strongly about that doesn’t fit in this definition). We had a thread about this after it came up on LW in April, where we kind of decided to stick with something like “either make the AI trying to do the right thing, or somehow cope with the problems introduced by it trying to do the wrong thing” (so including things like boxing), but to mostly not worry too much since in practice basically the same problems are under both categories.
I should have updated this post before it got rerun as part of the sequence.
The “alignment problem for advanced agents” or “AI alignment” is the overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.
An advanced agent can be said to be “totally aligned” when it can assess the exact value of well-described outcomes and hence the exact subjective value of actions, policies, and plans; where value has its overridden meaning of a metasyntactic variable standing in for “whatever we really do or really should value in the world or want from an Artificial Intelligence” (this is the same as “normative” if the speaker believes in normativity).
I think this clearly includes the kinds of problems I’m talking about in this thread. Do you agree? Also supporting my view is the history of “Friendliness” being a term that included the problem of better understanding the user’s values (as in CEV) and then MIRI giving up that term in favor of “alignment” as an apparently exact synonym. See this MIRI post which talks about “full alignment problem for fully autonomous AGI systems” and links to Arbital.
In practice, essentially all of MIRI’s work seems to fit within this narrower definition, so I’m not too concerned at the moment with this practical issue
I think you may have misunderstood what I meant by “practical issue”. My point was that if you say something like “I think AI alignment is the most urgent problem to work on” the listener could easily misinterpret you as meaning “alignment” in the MIRI/Arbital sense. Or if I say “AI alignment is the most urgent problem to work on” in the MIRI/Arbital sense of alignment, the listener could easily misinterpret as meaning “alignment” your sense.
Again my feeling is that MIRI started using alignment in the broader sense first and therefore that definition ought to have priority. If you disagree with this, I could try to do some more historical research to show this. (For example by figuring out when those Arbital articles were written, which I currently don’t know how to do.)
Again my feeling is that MIRI started using alignment in the broader sense first and therefore that definition ought to have priority. If you disagree with this, I could try to do some more historical research to show this. (For example by figuring out when those Arbital articles were written, which I currently don’t know how to do.)
I think MIRI’s first use of this term was here where they said “We call a smarter-than-human system that reliably pursues beneficial goals `aligned with human interests’ or simply `aligned.′ ” which is basically the same as my definition. (Perhaps slightly weaker, since “do what the user wants you to do” is just one beneficial goal.) This talk never defines alignment, but the slide introducing the big picture says “Take-home message: We’re afraid it’s going to be technically difficult to point AIs in an intuitively intended direction” which also really suggests it’s about trying to point your AI in the right direction.
The actual discussion on that Arbital page strongly suggests that alignment is about pointing an AI in a direction, though I suppose that may merely be an instance of suggestively naming the field “alignment” and then defining it to be “whatever is important” as a way of smuggling in the connotation that pointing your AI in the right direction is the important thing. All of the topics in the “AI alignment” domain (except for mindcrime, which is borderline) all fit under the narrower definition; the list of alignment researchers are all people working on the narrower problem.
So I think the way this term is used in practice basically matches this narrower definition.
As I mentioned, I was previously happily using the term “AI control.” Rob Bensinger suggested that I stop using that term and instead use AI alignment, proposing a definition of alignment that seemed fine to me.
I don’t think the very broad definition is what almost anyone has in mind when they talk about alignment. It doesn’t seem to be matching up with reality in any particular way, except insofar as its capturing the problems that a certain group of people work on.” I don’t really see any argument in favor except the historical precedent, which I think is dubious in light of all of the conflicting definitions, the actual usage, and the explicit move to standardize on “alignment” where an alternative definition was proposed.
(In the discussion, the compromise definition suggested was “cope with the fact that the AI is not trying to do what we want it to do, either by aligning incentives or by mitigating the effects of misalignment.”)
The “alignment problem for advanced agents” or “AI alignment” is the overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.
Is this intended (/ do you understand this) to include things like “make your AI better at predicting the world,” since we expect that agents who can make better predictions will achieve better outcomes?
If this isn’t included, is that because “sufficiently advanced” includes making good predictions? Or because of the empirical view that ability to predict the world isn’t an important input into producing good outcomes? Or something else?
If this definition doesn’t distinguish alignment from capabilities, then that seems like a non-starter to me which is neither useful nor captures the typical usage.
If this excludes making better prediction because that’s assumed by “sufficiently advanced agent,” then I have all sorts of other questions (does “sufficiently advanced” include all particular empirical knowledge relevant to making the world better? does it include some arbitrary category not explicitly carved out in the definition?)
In general, the alternative broader usage of AI alignment is broad enough to capture lots of problems that would exist whether or not we built AI. That’s not so different from using the term to capture (say) physics problems that would exist whether or not we built AI, both feel bad to me.
Independently of this issue, it seems like “the kinds of problems you are talking about in this thread” need better descriptions whether or not they are part of alignment (since even if they are part of alignment, they will certainly involve totally different techniques/skills/impact evaluations/outcomes/etc.).
The actual discussion on that Arbital page strongly suggests that alignment is about pointing an AI in a direction
But the page includes:
“AI alignment theory” is meant as an overarching term to cover the whole research field associated with this problem, including, e.g., the much-debated attempt to estimate how rapidly an AI might gain in capability once it goes over various particular thresholds.
which seems to be outside of just “pointing an AI in a direction”
Is this intended (/ do you understand this) to include things like “make your AI better at predicting the world,” since we expect that agents who can make better predictions will achieve better outcomes?
I think so, at least for certain kinds of predictions that seem especially important (i.e., may lead to x-risk if done badly), see this Arbital page which is under AI Alignment:
Vingean reflection is reasoning about cognitive systems, especially cognitive systems very similar to yourself (including your actual self), under the constraint that you can’t predict the exact future outputs. We need to make predictions about the consequence of operating an agent in an environment via reasoning on some more abstract level, somehow.
If this definition doesn’t distinguish alignment from capabilities, then that seems like a non-starter to me which is neither useful nor captures the typical usage.
It seems to me that Rohin’s proposal of distinguishing between “motivation” and “capabilities” is a good one, and then we can keep using “alignment” for the set of broader problems that are in line with the MIRI/Arbital definition and examples.
In general, the alternative broader usage of AI alignment is broad enough to capture lots of problems that would exist whether or not we built AI. That’s not so different from using the term to capture (say) physics problems that would exist whether or not we built AI, both feel bad to me.
It seems fine to me to include 1) problems that are greatly exacerbated by AI and 2) problems that aren’t caused by AI but may be best solved/ameliorated by some element of AI design, since these are problems that AI researchers have a responsibility over and/or can potentially contribute to. If there’s a problem that isn’t exacerbated by AI and does not seem likely to have a solution within AI design then I’d not include that.
Independently of this issue, it seems like “the kinds of problems you are talking about in this thread” need better descriptions whether or not they are part of alignment (since even if they are part of alignment, they will certainly involve totally different techniques/skills/impact evaluations/outcomes/etc.).
for me another major con is that this decomposition moves some problems that I think are crucial and urgent out of “AI alignment” and into the “competence” part, with the implicit or explicit implication that they are not as important, for example the problem of obtaining or helping humans to obtain a better understanding of their values and defending their values against manipulation from other AIs.
I think it’s bad to use a definitional move to try to implicitly prioritize or deprioritize research. I think I shouldn’t have written: “I like it less because it includes many subproblems that I think (a) are much less urgent, (b) are likely to involve totally different techniques than the urgent part of alignment.”
That said, I do think it’s important that these seem like conceptually different problems and that different people can have different views about their relative importance—I really want to discuss them separately, try to solve them separately, compare their relative values (and separate that from attempts to work on either).
I don’t think it’s obvious that alignment is higher priority than these problems, or than other aspects of safety. I mostly think it’s a useful category to be able to talk about separately. In general I think that it’s good to be able to separate conceptually separate categories, and I care about that particularly much in this case because I care particularly much about this problem. But I also grant that the term has inertia behind it and so choosing its definition is a bit loaded and so someone could object on those grounds even if they bought that it was a useful separation.
(I think that “defending their values against manipulation from other AIs” wasn’t include under any of the definitions of “alignment” proposed by Rob in our email discussion about possible definitions, so it doesn’t seem totally correct to refer to this as “moving” those subproblems, so much as there already existing a mess of imprecise definitions some of which included and some of which excluded those subproblems.)
Yeah, that seems right. I would probably defend the claim that motivation contains the most urgent part in the same way that Paul has done in the past—it seems likely to be easy to get a well motivated AI system to realize that it should help us understand our values, and that it should not do irreversible high-impact actions until then. I’m less optimistic about defending values against manipulation, because you probably need to be very competent for that, and you can’t take your time to become more competent, but that seems like a further-away problem to me and so less urgent.
(I don’t think I have much to add over the discussions you and Paul have had in the past, but I’m happy to clarify my opinion if it seems useful to you—perhaps my way of stating things will click where Paul’s way didn’t, idk. Or I might have different opinions and not realize it.)
I would support the idea of having this idea simply as a decomposition and not also pack in the implication that motivation/competence corresponds to urgent/not-urgent, though I suspect it is quite hard to do that now.
I’m happy to clarify my opinion if it seems useful to you—perhaps my way of stating things will click where Paul’s way didn’t
I would highly welcome that. BTW if you see me argue with Paul in the future (or in the past) and I seem to be not getting something, please feel free to jump in and explain it a different way. I often find it easier to understand one of Paul’s ideas from someone else’s explanation.
it seems likely to be easy to get a well motivated AI system to realize that it should help us understand our values
Yes, that seems easy, but actually helping seems much harder.
and that it should not do irreversible high-impact actions until then
How do you determine what is “high-impact” before you have a utility function? Even “reversible” is relative to a utility function, right? It doesn’t mean that you literally can reverse all the consequences of an action, but rather that you can reverse the impact of that action on your utility?
It seems to me that “avoid irreversible high-impact actions” would only work if one had a small amount of uncertainty over one’s utility function, in which case you could just avoid actions that are considered “irreversible high-impact” by any the utility functions that you have significant probability mass on. But if you had a large amount of uncertainty, or just have very little idea what your utility function looks like, that doesn’t work because almost any action could be “irreversible high-impact”. For example if I were a negative utilitarian I perhaps ought to spend all my resources trying to stop technological progress leading to space colonization, so anything that I do besides that would be “irreversible high-impact” unless I could go back in time and change my resource allocation.
BTW, here is a section from a draft post that I’m working on. Do you think it would be easy to solve or avoid all of these problems? (This post isn’t specifically addressing Paul’s approach so some of them may be easy to solve under his approach but I don’t think all of them are.)
How to prevent “aligned” AIs from unintentionally corrupting human values? We know that ML systems tend to have problems with adversarial examples and distributional shifts in general. There seems to be no reason not to expect that human value functions have similar problems, which even “aligned” AIs could trigger unless they are somehow designed not to. For example, such AIs could give humans so much power so quickly or put them in such novel situations that their moral development can’t keep up, so their value systems no longer give sensible answers. (Sort of the AI assisted version of the classic “power corrupts” problem.) AIs could give us new options that are irresistible to some parts of our motivational systems, like more powerful versions of video game and social media addiction. Even in the course of trying to figure out how the world could be made better for us, they could in effect be searching for adversarial examples on our value functions. Finally, at our own request or in a sincere attempt to help us, they could generate philosophical or moral arguments that are wrong but extremely persuasive.
(Some of these issues, like the invention of new addictions and new technologies in general, would happen even without AI, but I think AIs would likely, by default, strongly exacerbate the problem by differentially accelerating such technologies faster than progress in understanding how to avoid or safely handle them.)
I’m less optimistic about defending values against manipulation, because you probably need to be very competent for that, and you can’t take your time to become more competent, but that seems like a further-away problem to me and so less urgent.
Why is that a further-away problem? Even if it is, we still need people to work on them now, if only to generate persuasive evidence in case they really are very hard problems so we can pursue some other strategy to avoid them like stopping or delaying the development of advanced AI as much as possible.
How to prevent “aligned” AIs from unintentionally corrupting human values? We know that ML systems tend to have problems with adversarial examples and distributional shifts in general. There seems to be no reason not to expect that human value functions have similar problems, which even “aligned” AIs could trigger unless they are somehow designed not to. For example, such AIs could give humans so much power so quickly or put them in such novel situations that their moral development can’t keep up, so their value systems no longer give sensible answers. (Sort of the AI assisted version of the classic “power corrupts” problem.) AIs could give us new options that are irresistible to some parts of our motivational systems, like more powerful versions of video game and social media addiction. Even in the course of trying to figure out how the world could be made better for us, they could in effect be searching for adversarial examples on our value functions. Finally, at our own request or in a sincere attempt to help us, they could generate philosophical or moral arguments that are wrong but extremely persuasive.
My position on this (that might be clear from previous discussions):
I agree this is a real problem.
From a technical perspective, I think this is even further from the alignment problem (than other AI safety problems), so I definitely think it should be studied separately and deserves a separate name.(Though the last bullet point in this comment implicitly gives an argument in the other direction.)
I’d normally frame this problem as “society’s values will evolve over time, and we have preferences about how they evolve.” New technology might change things in ways we don’t endorse. Natural pressures like death may lead to changes we don’t endorse (though that’s a tricky values call). The constraint of remaining economically/militarily competitive could also force our values to evolve in a bad way (alignment is an instance of that problem, and eventually AI+alignment would address the other natural instance by decoupling human values from the competence needed to remain competitive). And of course there is a hard problem in that we don’t know how to deliberate/reflect. The “figure out how to deliberate” problem seems like it is relatively easily postponed, since you don’t have to solve it until you are doing deliberation, but the “help people avoid errors in deliberation” may be more urgent.
The reason I consider alignment more urgent is entirely quantitative and very empirically contingent, I don’t think there is any simple argument against. I think there is a >1/3 chance that AI will be solidly superhuman within 20 subjective years, and that in those scenarios alignment destroys maybe 20% of the total value of the future, leading to 0.3%/year of losses from alignment, and right now it looks reasonably tractable. Influencing the trajectory of society’s values in other ways seems significantly worse than that to me (maybe 10x less cost-effective?). I think it would be useful to do some back-of-the-envelope calculations for the severity of value drift and the case for working on it.
I don’t think I’m likely to work on this problem unless I either become much more pessimistic about working on alignment (e.g. because the problem is much harder or easier than I currently believe), I feel like I’ve already poked at it enough that VOI from more poking is lower than just charging ahead on alignment. But that is a stronger judgment than the last section, and I think is largely due to comparative advantage considerations, and I would certainly be supportive of work on this topic (e.g. would be happy to fund, would engage with it, etc.)
This is a leading contender for what I would do if alignment seemed unappealing, though I think that broader institutional improvement / capability enhancement / etc. seems more appealing. I’d definitely spend more time thinking about it.
I think that important versions of these problems really do exist with or without AI, although I agree that AI will accelerate the point at which they become critical while it’s not obvious whether it will accelerate solutions. I don’t think this is particularly important but does make me feel even more comfortable with the naming issue—this isn’t really a problem about AI at all, it’s just one of many issues that is modulated by AI.
I think the main way AI is relevant to the cost-effectiveness analysis of shaping-the-evolution-of-values is that it may decrease the amount of work that can be done on these problems between now and when they become serious (if AI is effectively accelerating the timeline for catastrophic value change without accelerating work on making values evolve in a way we’d endorse).
To the extent that the value of working on these problems is dominated by that scenario—”AI has a large comparative disadvantage at helping us solve philosophical problems / thinking about long-term trajectory / etc.”—then I think that one of the most promising interventions on this problem is improving the relative capability of AI at problems of this form. My current view is that working on factored cognition (and similarly on debate) is a reasonable approach to that. This isn’t a super important consideration, but it overall makes me (a) a bit more excited about factored cognition (especially in worlds where the broader iterated amplification program breaks down), (b) a bit less concerned about figuring out whether relative capabilities is more or less important than alignment.
I would like to have clearer ways of talking and thinking about these problems, but (a) I think the next step is probably developing a better understanding (or, if someone has a much better understanding, then a development of a better shared understanding), (b) I really want a word other than “alignment,” and probably multiple words. I guess the one that feels most urgently-unnamed right now is something like: understanding how values evolve and what features may introduce that evolution in a way we don’t endorse, including both social dynamics, environmental factors, the need to remain competitive, and the dynamics of deliberation and argumentation.
I’d normally frame this problem as “society’s values will evolve over time, and we have preferences about how they evolve.”
This statement of the problem seems to assume a subjectivist or anti-realist view of metaethics (items 4 or 5 on this list). Consider the analogous statement, “mathematicians’ beliefs about mathematical statements will evolve over time, and they have preferences about how their beliefs evolve”. I think a lot of mathematicians would object to that and instead say that they prefer to have true beliefs about mathematics, and their “preferences about how their beliefs evolve” are just their best guesses about how to arrive at true beliefs.
Assuming you agree that we can’t be certain about which metaethical position is correct yet, I think by implicitly adopting a subjectivist/anti-realist framing, you make the problem seem easier than we should expect it to be. It implies that instead of the AI (and indirectly the AI designer) potentially having (if a realist or relativist metaethical position is correct) an obligation/opportunity to help the user figure out what their true or normative values are, which may involve solving difficult metaethical and other philosophical questions, the AI can just follow the user’s preferences about how their values evolve.
Additionally, this framing also makes the potential consequences of failing to solve the problem sound less serious than it could potentially be. I.e., if there is such a thing as someone’s true or normative values, then failing to optimize the universe for those values is really bad, but if they just have preferences about how their values evolve, then even if their values fail to evolve in that way, at least whatever values the universe ends up being optimized for are still their values, so not all is lost.
I think I would prefer to frame the problem as “How can we design/use AI to prevent the corruption of human values, especially corruption caused/exacerbated by the development of AI?” and would consider this an instance of the more general problem “When considering AI safety, it’s not safe to assume that the human user/operator/supervisor is a generally safe agent.”
Influencing the trajectory of society’s values in other ways seems significantly worse than that to me (maybe 10x less cost-effective?). I think it would be useful to do some back-of-the-envelope calculations for the severity of value drift and the case for working on it.
To me the x-risk of corrupting human values by well-motivated AI is comparable to the x-risk caused by badly-motivated AI (and both higher than 20% conditional on superhuman AI within 20 subjective years), but I’m not sure how to argue this with you. Even if the total risk of “value corruption” is 10x smaller, it seems like the marginal impact of an additional researcher on “value corruption” could be higher given that there are now about 20(?) full time researchers working mostly on AI motivation but zero on this problem (as far as I know), and then we also have to consider the effect of a marginal researcher on the future growth of each field, and future effects on public opinion and policy makers. Unfortunately, I don’t know how to calculate these things even in a back-of-the-envelope way. As a rule of thumb, “if one x-risk seems X times bigger than another, it should have about X times as many people working on it” is intuitive appealingly to me, and suggests we should have at least 2 people working on “value corruption” even if you think that risk is 10x smaller, but I’m not sure if that makes sense to you.
I don’t think I’m likely to work on this problem unless I either become much more pessimistic about working on alignment
I see no reason to convince you personally to work on “value corruption” since your intuition on the relative severity of the risks is so different from mine, and under either of our views we obviously still need people to work on motivation / alignment-in-your-sense. I’m just hoping that you won’t (intentionally or unintentionally) discourage people from working on “value corruption” so strongly that they don’t even consider looking into that problem and forming their own conclusions based on their own intuitions/priors.
To the extent that the value of working on these problems is dominated by that scenario—“AI has a large comparative disadvantage at helping us solve philosophical problems / thinking about long-term trajectory / etc.“—then I think that one of the most promising interventions on this problem is improving the relative capability of AI at problems of this form. My current view is that working on factored cognition (and similarly on debate) is a reasonable approach to that. This isn’t a super important consideration, but it overall makes me (a) a bit more excited about factored cognition (especially in worlds where the broader iterated amplification program breaks down), (b) a bit less concerned about figuring out whether relative capabilities is more or less important than alignment.
This seems totally reasonable to me, but 1) others may have other ideas about how to intervene on this problem, and 2) even within factored cognition or debate there are probably research directions that skew towards being more applicable to motivation and research directions that skew towards being more applicable to “value corruption” and I don’t want people to be excessively discouraged from working on the latter by statements like “motivation contains the most urgent part”.
To me the x-risk of corrupting human values by well-motivated AI is comparable to the x-risk caused by badly-motivated AI (and both higher than 20% conditional on superhuman AI within 20 subjective years), but I’m not sure how to argue this with you.
If you think this risk is very large, presumably there is some positive argument for why it’s so large? That seems like the most natural way to run the argument. I agree it’s not clear what exactly the norms of argument here are, but the very basic one seems to be sharing the reason for great concern.
In the case of alignment there are a few lines of argument that we can flesh out pretty far. The basic structure is something like: “(a) if we built AI with our current understanding there is a good chance it would not be trying to do what we wanted or have enough overlap to give the future substantial value, (b) if we built sufficiently competent AI, the future would probably be shaped by its intentions, (c) we have a significant risk of not developing sufficiently better understanding prior to having the capability to build sufficiently competent AI, (d) we have a significant risk of building sufficiently competent AI even if we don’t have sufficiently good understanding.” (Each of those claims obviously requires more argument, etc.)
One version of the case for worrying about value corruption would be:
It seems plausible that the values pursued by humans are very sensitive to changes in their environment.
It may be that historical variation is itself problematic, and we care mostly about our particular values.
Or it may be that values are “hardened” against certain kinds of environment shift that occur in nature, and that they will go to some lower “default” level of robustness under new kinds of shifts.
Or it may be that normal variation is OK for decision-theoretic reasons (since we are the beneficiaries of past shifts) but new kinds of variation are not OK.
If so, the rate of change in subjective time could be reasonably high—perhaps the change that occurs within one generation could shift value far enough to reduce value by 50% (if that change wasn’t endorsed for decision-theoretic reasons / hardened against).
It’s plausible, perhaps 50%, that AI will accelerate kinds of change that lead to value drift radically more than it accelerates an understanding of how to prevent such drift.
A good understanding of how to prevent value drift might be used / be a major driver of how well we prevent such drift. (Or maybe some other foreseeable institutional characteristics could have a big effect on how much drift occurs.)
If so, then it matters a lot how well we understand how to prevent such drift at the time when we develop AI. Perhaps there will be several generations worth of subjective time / drift-driving change before we are able to do enough additional labor to obsolete our current understanding (since AI is accelerating change but not the relevant kind of labor).
Our current understanding may not be good, and there may be a realistic prospect of having a much better understanding.
This kind of story is kind of conjunctive, so I’d expect to explore a few lines of argument like this, and then try to figure out what are the most important underlying uncertainties (e.g. steps that appear in most arguments of this form, or a more fundamental underlying cause for concern that generates many different arguments).
My most basic concerns with this story are things like:
In “well-controlled” situations, with principals who care about this issue, it feels like we already have an OK understanding of how to avert drift (conditioned on solving alignment). It seems like the basic idea is to decouple evolving values from the events in the world that are actually driving competitiveness / interacting with the natural world / realizing people’s consumption / etc., which is directly facilitated by alignment. The extreme form of this is having some human in a box somewhere (or maybe in cold storage) who will reflect and grow on their own schedule, and who will ultimately assume control of their resources once reaching maturity. We’ve talked a little bit about this, and you’ve pointed out some reasons this kind of scheme isn’t totally satisfactory even if it works as intended, but quantitatively the reasons you’ve pointed to don’t seem to be probable enough (per economic doubling, say) to make the cost-benefit analysis work out.
In most practical situations, it doesn’t seem like “understanding of how to avert drift” is the key bottleneck to averting drift—it seems like the basic problem is that most people just don’t care about averting drift at all, or have any inclination to be thoughtful about how their own preferences evolve. That’s still something you can intervene on, but it feels like a huge morass where you are competing with many other forces.
In the end I’m doing a pretty rough calculation that depends on a whole bunch of stuff, but those feel like they are maybe the most likely differences in view / places where I have something to say. Overall I still think this problem is relatively important, but that’s how I get to the intuitive view that it’s maybe ~10x lower impact. I would grant the existence of (plenty of) people for whom it’s higher impact though.
As a rule of thumb, “if one x-risk seems X times bigger than another, it should have about X times as many people working on it” is intuitive appealingly to me, and suggests we should have at least 2 people working on “value corruption” even if you think that risk is 10x smaller, but I’m not sure if that makes sense to you.
I think that seems roughly right, probably modulated by some O(1) factor factor reflecting tractability or other factors not captured in the total quantity of risk—maybe I’d expect us to have 2-10x more resources per unit risk devoted to more tractable risks.
In this case I’d be happy with the recommendation of ~10x more people working on motivation than on value drift, that feels like the right ballpark for basically the same reason that motivation feels ~10x more impactful.
I’m just hoping that you won’t (intentionally or unintentionally) discourage people from working on “value corruption” so strongly that they don’t even consider looking into that problem and forming their own conclusions based on their own intuitions/priors. [...] I don’t want people to be excessively discouraged from working on the latter by statements like “motivation contains the most urgent part”.
I agree I should be more careful about this.
I do think that motivation contains the most urgent/important part and feel pretty comfortable expressing that view (for the same reasons I’m generally inclined to express my views), but could hedge more when making statements like this.
(I think saying “X is more urgent than Y” is basically compatible with the view “There should be 10 people working on X for each person working on Y,” even if one also believes “but actually on the current margin investment in Y might be a better deal.” Will edit the post to be a bit softer here though.
ETA: actually I think the language in the post basically reflects what I meant, the broader definition seems worse because it contains tons of stuff that is lower priority. The narrower definition doesn’t contain every problem that is high priority, it just contains a single high priority problem, which is better than a really broad basket containing a mix of important and not-that-important stuff. But I will likely write a separate post or two at some point about value drift and other important problems other than motivation.)
If you think this risk is very large, presumably there is some positive argument for why it’s so large?
Yeah, I didn’t literally mean that I don’t have any arguments, but rather that we’ve discussed it in the past and it seems like we didn’t get close to resolving our disagreement. I tend to think that Aumann Agreement doesn’t apply to humans, and it’s fine to disagree on these kinds of things. Even if agreement ought to be possible in principle (which again I don’t think is necessarily true for humans), if you think that even from your perspective the value drift/corruption problem is currently overly neglected, then we can come back and revisit this at another time (e.g., when you think there’s too many people working on this problem, which might never actually happen).
it seems like the basic problem is that most people just don’t care about averting drift at all, or have any inclination to be thoughtful about how their own preferences evolve
I don’t understand how this is compatible with only 2% loss from value drift/corruption. Do you perhaps think the actual loss is much bigger, but almost certainly we just can’t do anything about it, so 2% is how much you expect we can potentially “save” from value drift/corruption? Or are you taking an anti-realist position and saying something like, if someone doesn’t care about averting drift/corruption, then however their values drift that doesn’t constitute any loss?
The narrower definition doesn’t contain every problem that is high priority, it just contains a single high priority problem, which is better than a really broad basket containing a mix of important and not-that-important stuff.
I don’t understand “better” in what sense. Whatever it is, why wouldn’t it be even better to have two terms, one of which is broadly defined so as to include all the problems that might be urgent but also includes lower priority problems and problems whose priority we’re not sure about, and another one that is defined to be a specific urgent problem. Do you currently have any objections to using “AI alignment” as the broader term (in line with the MIRI/Arbital definition and examples) and “AI motivation” as the narrower term (as suggested by Rohin)?
Do you currently have any objections to using “AI alignment” as the broader term (in line with the MIRI/Arbital definition and examples) and “AI motivation” as the narrower term (as suggested by Rohin)?
Yes:
The vast majority of existing usages of “alignment” should then be replaced by “motivation,” which is more specific and usually just as accurate. If you are going to split a term into new terms A and B, and you find that the vast majority of existing usage should be A, then I claim that “A” should be the one that keeps the old word.
The word “alignment” was chosen (originally be Stuart Russell I think) precisely because it is such a good name for the problem of aligning AI values with human values, it’s a word that correctly evokes what that problem is about. This is also how MIRI originally introduced the term. (I think they introduced it here, where they said “We call a smarter-than-human system that reliably pursues beneficial goals “aligned with human interests” or simply “aligned.””) Everywhere that anyone talks about alignment they use the analogy with “pointing,” and even MIRI folks usually talk about alignment as if it was mostly or entirely about pointing your AI in the right direction.
In contrast, “alignment” doesn’t really make sense as a name for the entire field of problems about making AI good. For the problem of making AI beneficial we already have the even older term “beneficial AI,” which really means exactly that. In explaining why MIRI doesn’t like that term, Rob said
Some of the main things I want from a term are:
A. It clearly and consistently keeps the focus on system design and engineering, and whatever technical/conceptual groundwork is needed to succeed at such. I want to make it easy for people (if they want to) to just hash out those technical issues, without feeling any pressure to dive into debates about bad actors and inter-group dynamics, or garden-variety machine ethics and moral philosophy, which carry a lot of derail / suck-the-energy-out-of-the-room risk.
[…] [“AI safety” or “beneficial AI”] doesn’t work so well for A—it’s commonly used to include things like misuse risk.”
[continuing last point] The proposed usage of “alignment” doesn’t meet this desiderata though, it has exactly the same problem as “beneficial AI,” except that it’s historically associated with this community. In particular it absolutely includes “garden-variety machine ethics and moral philosophy.” Yes, there is all sorts of stuff that MIRI or I wouldn’t care about that is relevant to “beneficial” AI, but under the proposed definition of alignment it’s also relevant to “aligned” AI. (This statement by Rob also makes me think that you wouldn’t in fact be happy with what he at least means by “alignment,” since I take it you explicitly mean to include moral philosophy?)
People have introduced a lot of terms and change terms frequently. I’ve changed the language on my blog multiple times at other people’s request. This isn’t costless, it really does make things more and more confusing.
I think “AI motivation” is not a good term for this area of study: it (a) suggests it’s about the study of AI motivation rather than engineering AI to be motivated to help humans, (b) is going to be perceived as aggressively anthropomorphizing (even if “alignment” is only slightly better), (c) is generally less optimized (related to the second point above, “alignment” is quite a good term for this area).
Probably “alignment” / “value alignment” would be a better split of terms than “alignment” vs. “motivation”. “Value alignment” has traditionally been used with the de re reading, but I could clarify that I’m working on de dicto value alignment when more precision is needed (everything I work on is also relevant on the de re reading, so the other interpretation is also accurate and just less precise).
I guess I have an analogous question for you: do you currently have any objections to using “beneficial AI” as the broader term, and “AI alignment” as the narrower term?
This is also how MIRI originally introduced the term. (I think they introduced it here, where they said “We call a smarter-than-human system that reliably pursues beneficial goals “aligned with human interests” or simply “aligned.”“)
But that definition seems quite different from your “A is trying to do what H wants it to do.” For example, if H has a wrong understanding of his/her true or normative values and as a result wants A to do something that is actually harmful, then under your definition A would be still be “aligned” but under MIRI’s definition it wouldn’t be (because it wouldn’t be pursuing beneficial goals).
This statement by Rob also makes me think that you wouldn’t in fact be happy with what he at least means by “alignment,” since I take it you explicitly mean to include moral philosophy?
I think that’s right. When I say MIRI/Arbital definition of “alignment” I’m referring to what’s they’ve posted publicly, and I believe it does include moral philosophy. Rob’s statement that you quoted seems to be a private one (I don’t recall seeing it before and can’t find it through Google search) but I can certainly see how it muddies the waters from your perspective.
Probably “alignment” / “value alignment” would be a better split of terms than “alignment” vs. “motivation”. “Value alignment” has traditionally been used with the de re reading, but I could clarify that I’m working on de dicto value alignment when more precision is needed
This seems fine to me, if you could give the benefit of doubt as to when more precision is needed. I’m basically worried about this scenario: You or someone else writes something like “I’m cautiously optimistic about Paul’s work.” The reader recalls seeing you say that you work on “value alignment”. They match that to what they’ve read from MIRI about how aligned AI “reliably pursues beneficial goals”, and end up thinking that is easier than you’d intend, or think there is more disagreement between alignment researchers about the difficulty of the broader problem than there is actually is. If you could consistently say that the goal of your work is “de dicto value alignment” then that removes most of my worry.
I guess I have an analogous question for you: do you currently have any objections to using “beneficial AI” as the broader term, and “AI alignment” as the narrower term?
This actually seems best to me on the merits of the terms alone (i.e., putting historical usage aside), and I’d be fine with it if everyone could coordinate to switch to these terms/definitions.
But that definition seems quite different from your “A is trying to do what H wants it to do.” For example, if H has a wrong understanding of his/her true or normative values and as a result wants A to do something that is actually harmful, then under your definition A would be still be “aligned” but under MIRI’s definition it wouldn’t be (because it wouldn’t be pursuing beneficial goals).
“Do what H wants me to do” seems to me to be an example of a beneficial goal, so I’d say a system which is trying to do what H wants it to do is pursuing a beneficial goals. It may also be pursuing subgoals which turn out to be harmful, if e.g. it’s wrong about what H wants or has other mistaken empirical beliefs. I don’t think anyone could be advocating the definition “pursues no harmful subgoals,” since that basically requires perfect empirical knowledge (it seems just as hard as never taking a harmful action). Does that seem right to you?
I’ve been assuming that “reliably pursues beneficial goals” is weaker than the definition I proposed, but practically equivalent as a research goal.
I’m basically worried about this scenario: You or someone else writes something like “I’m cautiously optimistic about Paul’s work.” The reader recalls seeing you say that you work on “value alignment”. They match that to what they’ve read from MIRI about how aligned AI “reliably pursues beneficial goals”, and end up thinking that is easier than you’d intend, or think there is more disagreement between alignment researchers about the difficulty of the broader problem than there is actually is. If you could consistently say that the goal of your work is “de dicto value alignment” then that removes most of my worry.
I think it’s reasonable for me to be more careful about clarifying what any particular line of research agenda does or does not aim to achieve. I think that in most contexts that is going to require more precision than just saying “AI alignment” regardless of how the term was defined, I normally clarify by saying something like “an AI which is at least trying to help us get what we want.”
This actually seems best to me on the merits of the terms alone (i.e., putting historical usage aside), and I’d be fine with it if everyone could coordinate to switch to these terms/definitions.
My guess is that MIRI folks won’t like the “beneficial AI” term because it is too broad a tent. (Which is also my objection to the proposed definition of “AI alignment,” as “overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.”) My sense is that if that were their position, then you would also be unhappy with their proposed usage of “AI alignment,” since you seem to want a broad tent that makes minimal assumptions about what problems will turn out to be important. Does that seem right?
(They might also dislike “beneficial AI” because of random contingent facts about how it’s been used in the past, and so might want a different term with the same meaning.)
My own feeling is that using “beneficial AI” to mean “AI that produces good outcomes in the world” is basically just using “beneficial” in accordance with its usual meaning, and this isn’t a case where a special technical term is needed (and indeed it’s weird to have a technical term whose definition is precisely captured by a single—different—word).
“Do what H wants me to do” seems to me to be an example of a beneficial goal, so I’d say a system which is trying to do what H wants it to do is pursuing a beneficial goals. It may also be pursuing subgoals which turn out to be harmful, if e.g. it’s wrong about what H wants or has other mistaken empirical beliefs. I don’t think anyone could be advocating the definition “pursues no harmful subgoals,” since that basically requires perfect empirical knowledge (it seems just as hard as never taking a harmful action). Does that seem right to you?
I guess both “reliable” and “beneficial” are matters of degree so “aligned” in the sense of “reliably pursues beneficial goals” is also a matter of degree. “Do what H wants A to do” would be a moderate degree of alignment whereas “Successfully figuring out and satisfying H’s true/normative values” would be a much higher degree of alignment (in that sense of alignment). Meanwhile in your sense of alignment they are at best equally aligned and the latter might actually be less aligned if H has a wrong idea of metaethics or what his true/normative values are and as a result trying to figure out and satisfy those values is not something that H wants A to do.
I think that in most contexts that is going to require more precision than just saying “AI alignment” regardless of how the term was defined, I normally clarify by saying something like “an AI which is at least trying to help us get what we want.”
That seems good too.
My guess is that MIRI folks won’t like the “beneficial AI” term because it is too broad a tent. (Which is also my objection to the proposed definition of “AI alignment,” as “overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.“) My sense is that if that were their position, then you would also be unhappy with their proposed usage of “AI alignment,” since you seem to want a broad tent that makes minimal assumptions about what problems will turn out to be important. Does that seem right?
This paragraph greatly confuses me. My understanding is that someone from MIRI (probably Eliezer) wrote the Arbital article defining “AI alignment” as “overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world”, which satisfies my desire to have a broad tent term that makes minimal assumptions about what problems will turn out to be important. I’m fine with calling this “beneficial AI” instead of “AI alignment” if everyone can coordinate on this (but I don’t know how MIRI people feel about this). I don’t understand why you think ‘MIRI folks won’t like the “beneficial AI” term because it is too broad a tent’ given that someone from MIRI gave a very broad definition to “AI alignment”. Do you perhaps think that Arbital article was written by a non-MIRI person?
“Do what H wants A to do” would be a moderate degree of alignment whereas “Successfully figuring out and satisfying H’s true/normative values” would be a much higher degree of alignment (in that sense of alignment).
In what sense is that a more beneficial goal?
“Successfully do X” seems to be the same goal as X, isn’t it?
“Figure out H’s true/normative values” is manifestly a subgoal of “satisfy H’s true/normative values.” Why would we care about that except as a subgoal?
So is the difference entirely between “satisfy H’s true/normative values” and “do what H wants”? Do you disagree with one of the previous two bullet points? Is the difference that you think “reliably pursues” implies something about “actually achieves”?
If the difference is mostly between “what H wants” and “what H truly/normatively values”, then this is just a communication difficulty. For me adding “truly” or “normatively” to “values” is just emphasis and doesn’t change the meaning.
I try to make it clear that I’m using “want” to refer to some hard-to-define idealization rather than some narrow concept, but I can see how “want” might not be a good term for this, I’d be fine using “values” or something along those lines if that would be clearer.
(This is why I wrote:
What H wants” is even more problematic than “trying.” Clarifying what this expression means, and how to operationalize it in a way that could be used to inform an AI’s behavior, is part of the alignment problem. Without additional clarity on this concept, we will not be able to build an AI that tries to do what H wants it to do.
If the difference is mostly between “what H wants” and “what H truly/normatively values”, then this is just a communication difficulty. For me adding “truly” or “normatively” to “values” is just emphasis and doesn’t change the meaning.
Ah, yes that is a big part of what I thought was the difference. (Actually I may have understood at some point that you meant “want” in an idealized sense but then forgot and didn’t re-read the post to pick up that understanding again.)
ETA: I guess another thing that contributed to this confusion is your talk of values evolving over time, and of preferences about how they evolve, which seems to suggest that by “values” you mean something like “current understanding of values” or “interim values” rather than “true/normative values” since it doesn’t seem to make sense to want one’s true/normative values to change over time.
I try to make it clear that I’m using “want” to refer to some hard-to-define idealization rather than some narrow concept, but I can see how “want” might not be a good term for this, I’d be fine using “values” or something along those lines if that would be clearer.
I don’t think “values” is good either. Both “want” and “values” are commonly used words that typically (in everyday usage) mean something like “someone’s current understanding of what they want” or what I called “interim values”. I don’t see how you can expect people not to be frequently confused if you use either of them to mean “true/normative values”. Like the situation with de re / de dicto alignment, I suggest it’s not worth trying to economize on the adjectives here.
Another difference between your definition of alignment and “reliably pursues beneficial goals” is that the latter has “reliably” in it which suggests more of a de re reading. To use your example “Suppose A thinks that H likes apples, and so goes to the store to buy some apples, but H really prefers oranges.” I think most people would call an A that correctly understands H’s preferences (and gets oranges) more reliably pursuing beneficial goals.
Given this, perhaps the easiest way to reduce confusions moving forward is to just use some adjectives to distinguish your use of the words “want”, “values”, or “alignment” from other people’s.
If the difference is mostly between “what H wants” and “what H truly/normatively values”, then this is just a communication difficulty. For me adding “truly” or “normatively” to “values” is just emphasis and doesn’t change the meaning.
So “wants” means a want more general than an object-level desire (like wanting to buy oranges), and it already takes into account the possibility of H changing his mind about what he wants if H discovers that his wants contradict his normative values?
If that’s right, how is this generalization defined? (E.g. The CEV was “what H wants in the limit of infinite intelligence, reasoning time and complete information”.)
I don’t understand why you think ‘MIRI folks won’t like the “beneficial AI” term because it is too broad a tent’ given that someone from MIRI gave a very broad definition to “AI alignment”. Do you perhaps think that Arbital article was written by a non-MIRI person?
I don’t really know what anyone from MIRI thinks about this issue. It was a guess based on (a) the fact that Rob didn’t like a number of possible alternative terms to “alignment” because they seemed to be too broad a definition, (b) the fact that virtually every MIRI usage of “alignment” refers to a much narrower class of problems than “beneficial AI” is usually taken to refer to, (c) the fact that Eliezer generally seems frustrated with people talking about other problems under the heading of “beneficial AI.”
(But (c) might be driven by powerful AI vs. nearer-term concerns / all the other empirical errors Eliezer thinks people are making, (b) isn’t that indicative, and (a) might be driven by other cultural baggage associated with the term / Rob was speaking off the cuff and not attempting to speak formally for MIRI.)
I’d consider it great if we standardized on “beneficial AI” to mean “AI that has good consequences” and “AI alignment” to refer to the narrower problem of aligning AI’s motivation/preferences/goals.
I don’t understand how this is compatible with only 2% loss from value drift/corruption. Do you perhaps think the actual loss is much bigger, but almost certainly we just can’t do anything about it, so 2% is how much you expect we can potentially “save” from value drift/corruption? Or are you taking an anti-realist position and saying something like, if someone doesn’t care about averting drift/corruption, then however their values drift that doesn’t constitute any loss?
10x worse was originally my estimate for cost-effectiveness, not for total value at risk.
People not caring about X prima facie decreases the returns to research on X. But may increase the returns for advocacy (or acquiring resources/influence, or more creative interventions). That bullet point was really about the returns to research.
People not caring about X prima facie decreases the returns to research on X. But may increase the returns for advocacy (or acquiring resources/influence, or more creative interventions). That bullet point was really about the returns to research.
It’s not obvious that applies here. If people don’t care strongly about how their values evolve over time, that seemingly gives AIs / AI designers an opening to have greater influence over how people’s values evolve over time, and implies a larger (or at least not obviously smaller) return on research into how to do this properly. Or if people care a bit about protecting their values from manipulation from other AIs but not a lot, it seems really important/valuable to reduce the cost of such protection as much as possible.
As for advocacy, it seems a lot easier (at least for someone in my position) to convince a relatively small number of AI designers to build AIs that want to help their users evolve their values in a positive way (or figuring out what their true or normative values are, or protecting their values against manipulation), than to convince all the potential users to want that themselves.
If people care less about some aspect of the future, then trying to get influence over that aspect of the future is more attractive (whether by building technology that they accept as a default, or by making an explicit trade, or whatever).
A better understanding of how to prevent value drift can still be helpful if people care a little bit, and can be particularly useful to the people who care a lot (and there will be fewer people working to develop such understanding if few people care).
I think that both
(a) Trying to have influence over aspects of value change that people don’t much care about, and
(b) better understanding the important processes driving changes in values
are reasonable things to do to make the future better. (Though some parts of (a) especially are somewhat zero-sum and I think it’s worth being thoughtful about that.)
(I don’t agree with the sign of the effect described in your comment, but don’t think it’s an important point / may just be a disagreement about what else we are holding equal so it seems good to drop.)
Trying to have influence over aspects of value change that people don’t much care about … [is] reasonable … to do to make the future better
This could refer to value change in AI controllers, like Hugh in HCH, or alternatively to value change in people living in the AI-managed world. I believe the latter could be good, but the former seems very questionable (here “value” refers to true/normative/idealized preference). So it’s hard for the same people to share the two roles. How do you ensure that value change remains good in the original sense without a reference to preference in the original sense, that hasn’t experienced any value change, a reference that remains in control? And for this discussion, it seems like the values of AI controllers (or AI+controllers) is what’s relevant.
It’s agent tiling for AI+controller agents, any value change in the whole seems to be a mistake. It might be OK to change values of subagents, but the whole shouldn’t show any value drift, only instrumentally useful tradeoffs that sacrifice less important aspects of what’s done for more important aspects, but still from the point of view of unchanged original values (to the extent that they are defined at all).
Assuming you agree that we can’t be certain about which metaethical position is correct yet, I think by implicitly adopting a subjectivist/anti-realist framing, you make the problem seem easier than we should expect it to be.
I don’t see why the anti-realist version is any easier, my preferences about how my values evolve are complex and can depend on the endpoint of that evolution process and on arbitrarily complex logical facts. I think the analogous non-realistic mathematical framing is fine. If anything the realist versions seem easier to me (and this is related to why mathematics seems so much easier than morality), since you can anchor changing preferences to some underlying ground truth and have more potential prospect for error-correction, but I don’t think it’s a big difference.
Additionally, this framing also makes the potential consequences of failing to solve the problem sound less serious than it could potentially be. I.e., if there is such a thing as someone’s true or normative values, then failing to optimize the universe for those values is really bad, but if they just have preferences about how their values evolve, then even if their values fail to evolve in that way, at least whatever values the universe ends up being optimized for are still their values, so not all is lost.
It doesn’t sound that way to me, but I’m happy to avoid framings that might give people the wrong idea.
I think I would prefer to frame the problem as “How can we design/use AI to prevent the corruption of human values, especially corruption caused/exacerbated by the development of AI?”
My main complaint with this framing (and the reason that I don’t use it) is that people respond badly to invoking the concept of “corruption” here—it’s a fuzzy category that we don’t understand, and people seem to interpret it as the speaker wanting values to remain static.
But in terms of the actual meanings rather than their impacts on people, I’d be about as happy with “avoiding corruption of values” as “having our values evolve in a positive way.” I think both of them have small shortcomings as framings. My main problem with corruption is that it suggests an unrealistically bright line / downplays our uncertainty about how to think about changing values and what constitutes corruption.
I don’t see why the anti-realist version is any easier
It seems easier in that the AI / AI designer doesn’t have to worry about the user being wrong about how they want their values to evolve. But you’re right that the realist version might be easier in other ways, so perhaps what I should say instead is that the problem definitely seems harder if we also include the subproblem of figuring out what the right metaethics is in the first place, and (by implicitly assuming a subset of all plausible metaethical positions) the statement of the problem that you proposed also does not convey a proper amount of uncertainty in its difficulty.
My main complaint with this framing (and the reason that I don’t use it) is that people respond badly to invoking the concept of “corruption” here—it’s a fuzzy category that we don’t understand, and people seem to interpret it as the speaker wanting values to remain static.
That’s a good point that I hadn’t thought of. (I guess talking about “drift” has a similar issue though, in that people might misinterpret it as the speaker wanting values to remain static.) If you or anyone else have a suggestion about how to phrase the problem so as to both avoid this issue and address my concerns about not assuming a particular metaethical position, I’d highly welcome that.
It seems easier in that the AI / AI designer doesn’t have to worry about the user being wrong about how they want their values to evolve.
That may be a connotation of the “preferences about how their values evolve,” but doesn’t seem like it follows from the anti-realist position.
I have preferences over what actions my robot takes. Yet if you asked me “what action do you want the robot to take?” I could be mistaken. I need not have access to my own preferences (since they can e.g. depend on empirical facts I don’t know). My preferences over value evolution can be similar.
Indeed, if moral realists are right, “ultimately converge to the truth” is a perfectly reasonable preference to have about how my preferences evolve. (Though again this may not be captured by the framing “help people’s preferences evolve in the way they want them to evolve.”) Perhaps the distinction is that there is some kind of idealization even of the way that preferences evolve, and maybe at that point it’s easier to just talk about preservation of idealized preferences (though that also has unfortunate implications and at least some minor technical problems).
I guess talking about “drift” has a similar issue though, in that people might misinterpret it as the speaker wanting values to remain static.
Would you agree with this way of stating it: There are more ways for someone to be wrong about their values under realism than under anti-realism. Under realism someone could be wrong even if they correctly state their preferences about how they want their values to evolve, because those preferences could themselves be wrong. So assuming an anti-realist position makes the problem sound easier because it implies there are fewer ways for the user to be wrong for the AI / AI designer to worry about.
Could you give an example of a statement you think could be wrong on the realist perspective, for which there couldn’t be a precisely analogous error on the non-realistic perspective?
There is some uninteresting semantic sense in which there are “more ways to be wrong” (since there is a whole extra category of statements that have truth values...) but not a sense that is relevant to the difficulty of building an AI.
I might be using the word “values” in a different way than. I think I can say something like “I’d like to deliberate in way X” and be wrong. I guess under non-realism I’m “incorrectly stating my preferences” and under realism I could be “correctly stating my preferences but be wrong,” but I don’t see how to translate that difference into any situation where I build an AI that is adequate on one perspective but inadequate on the other.
Suppose the user says “I want to try to figure out my true/normative values by doing X. Please help me do that.” If moral anti-realism is true, then the AI can only check if the user really wants to do X (e.g., by looking into the user’s brain and checking if X is encoded as a preference somewhere). But if moral realism is true, the AI could also use its own understanding of metaethics and metaphilosophy to predict if doing X would reliably lead to the user’s true/normative values, and warn the user or refuse to help or take some other action if the answer is no. Or if one can’t be certain about metaethics yet, and it looks like X might prematurely lock the user into the wrong values, the AI could warn the user about that.
I definitely don’t mean such a narrow sense of “want my values to evolve.” Seems worth using some language to clarify that.
In general the three options seem to be:
You care about what is “good” in the realist sense.
You care about what the user “actually wants” in some idealized sense.
You care about what the user “currently wants” in some narrow sense.
It seems to me that the first two are pretty similar. (And if you are uncertain about whether realism is true, and you’d be in the first case if you accepted realism, it seems like you’d probably be in the second case if you rejected realism. Of course that would depend on the nature of your uncertainty about realism, your views could depend on an arbitrary way on whether realism is true or false depending on what versions of realism/non-realism are competing, but I’m assuming something like the most common realist and non-realist views around here.)
To defend my original usage both in this thread and in the OP, which I’m not that attached to, I do think it would be typical to say that someone made a mistake if they were trying to help me get what I wanted, but failed to notice or communicate some crucial consideration that would totally change my views about what I wanted—the usual English usage of these terms involves at least mild idealization.
Yes, that seems easy, but actually helping seems much harder.
Longer form of my opinion:
Metaphilosophy is hard, and we need to solve it eventually. This might happen by default, i.e. if we simply build a well-motivated AI without thinking about metaphilosophy and without running any social interventions designed to get the AI’s operators to think about metaphilosophy, humanity might still realize that metaphilosophy needs to be solved, and then goes ahead and solves it. I’m quite unsure right now whether or not it will happen by default.
However, in the world where the AI’s operators don’t agree that we need to solve metaphilosophy, I am very pessimistic about the AI realizing that it should help us with metaphilosophy and doing so. The one way I could imagine it happening is by programming in the right utility function (not even learning it, since if you learn it then you probably learn that metaphilosophy doesn’t need to be solved), which seems hopelessly doomed. It seems really hard to make an AI system where you can predict in advance that it will help us solve metaphilosophy regardless of the operator’s wishes.
In the world where the AI’s operators do agree that we need to solve metaphilosophy, I think we’re in a much better position. A background assumption I have is that humans motivated to solve metaphilosophy will be able to do so given enough time—I share Paul’s intuition that humans who no longer have to worry about food, water, shelter, disease, etc. could deliberate for a long time and make progress. In that case, a well-motivated AI would be fine—it would stay deferential, perhaps learn more things in order to be more competent, and does things we ask it to do, which might include helping us in our deliberation by bringing up arguments we hadn’t considered yet. (And note a well-motivated AI should only bring up arguments it believes are true, or likely to be true.)
I’ve laid out two extreme ways the world could be, and of course there’s a spectrum between them. But thinking about the extremes makes me think of this not as a part of AI alignment, but as a social coordination problem, that is, we need to have humanity (especially the AI’s operators) agree that metaphilosophy is hard and needs to be solved. I’d support interventions that make this more likely, eg. more public writing that talks about what we do after AGI, or about the possibility of a Great Deliberation before using the cosmic endowment, etc. If we succeed at that and building a well-motivated AI system, I think that would be sufficient.
How do you determine what is “high-impact” before you have a utility function? Even “reversible” is relative to a utility function, right? It doesn’t mean that you literally can reverse all the consequences of an action, but rather that you can reverse the impact of that action on your utility?
I mean something more like “don’t do things that a human wouldn’t do, that seem crazy from a human perspective”. I’m not suggesting that the AI has a perfect understanding of what “irreversible” and “high-impact” mean. But it should be able to predict what things a human would find crazy for which it should probably get the human’s approval before doing the thing. (As an analogy, most employees have a sense of what it is okay for them to take initiative on, vs. what they should get their manager’s approval for.)
For example if I were a negative utilitarian I perhaps ought to spend all my resources trying to stop technological progress leading to space colonization, so anything that I do besides that would be “irreversible high-impact” unless I could go back in time and change my resource allocation.
Yeah, I more mean something like “continuation of the status quo” rather than “irreversible high-impact”, as TurnTrout talks about below.
Do you think it would be easy to solve or avoid all of these problems?
I am not sure. I think it is relatively easy to look back at how we have responded to similar events in the past and notice that something is amiss—for example, it seems relatively easy for an AGI to figure out that power corrupts and that humanity has not liked it when that happened, or that many humans don’t like it when you take advantage of their motivational systems, and so to at least not be confident in the actions you mention. On the other hand, there may be similar types of events in the future that we can’t back out by looking at the past. I don’t know how to deal with these sorts of unknown unknowns.
I think sufficiently narrow AI systems have essentially no hope of solving or avoiding these problems in general, regardless of safety techniques we develop, and so in the short term to avoid these problems you want to intervene on the humans who are deploying AI systems.
Why is that a further-away problem? Even if it is, we still need people to work on them now, if only to generate persuasive evidence in case they really are very hard problems so we can pursue some other strategy to avoid them like stopping or delaying the development of advanced AI as much as possible.
Yeah, looking back I don’t like that reason, I think I had an intuition that it wasn’t an urgent problem and wanted to jot a quick sentence to that effect and the sentence came out wrong.
One reason it might not be urgent is because we need to aim for competitiveness anyway—our AI systems need to be competitive so that economic incentives don’t cause us to use unaligned variants.
We can also aim to have the world mostly run by aligned AI systems rather than unaligned ones, which would mean that there isn’t much opportunity for us to be manipulated. You might have the intuition that even one unaligned AI could successfully manipulate everyone’s values, and so we would still need the aligned AI systems to be able to defend against that. I’m not sure where I stand on that—it seems possible to me that this is just very hard to do, especially when there are aligned superintelligent systems that would by default put a stop to it if they find out about it.
But really I’m just confused on this topic and would need to think more about it.
we need to have humanity (especially the AI’s operators) agree that metaphilosophy is hard and needs to be solved
I’m not sure I understand your proposal here. What are they agreeing to exactly? Stopping technological development at a certain level until metaphilosophy is solved?
But it should be able to predict what things a human would find crazy for which it should probably get the human’s approval before doing the thing
Think of the human as a really badly designed AI with a convoluted architecture that nobody understands, spaghetti code, full of security holes, has no idea what its terminal values are and is really confused even about its “interim” values, has all kinds of potential safety problems like not being robust to distributional shifts, and is only “safe” in the sense of having passed certain tests for a very narrow distribution of inputs.
Clearly it’s not safe for a much more powerful outer AI to query the human about arbitrary actions that it’s considering, right? Instead, if the human is to contribute anything at all to safety in this situation, the outer AI has to figure out how to generate a bunch of smaller queries that the human can safely handle, from which it would then infer what the human would say if it could safely consider the actual choice under consideration. If the AI is bad at this “competence” problem it could send unsafe queries to the human and corrupt the human, and/or infer the wrong thing about what the human would approve of.
Is it clearer now why this doesn’t seem like an easy problem to me?
for example, it seems relatively easy for an AGI to figure out that power corrupts and that humanity has not liked it when that happened
I’m not sure what you think the AGI would figure out, and what it would do in response to that. Are you suggesting something like, based on historical data, it would learn a classifier to predict what kind of new technologies or choices would change human values in a way that we would not like, and restrict those technologies/choices from us? It seems far from easy to do this in a robust way. I mean this classifier would be facing lots of unpredictable distributional shifts… I guess you made a similar point when you said “On the other hand, there may be similar types of events in the future that we can’t back out by looking at the past.”
ETA: Do you expect that different AIs would do different things in this regard depending on how cautious their operators are? Like some AIs would learn from their operators to be really cautious, and restrict technologies/choices that it isn’t sure won’t corrupt humans, but other operators and their AIs won’t be so cautious so a bunch of humans will be corrupted as a result, but that’s a lower priority problem because you think most AI operators will be really cautious so the percentage of value lost in the universe isn’t very high? (This is my current understanding of Paul’s position, and I wonder if you have a different position or a different way of putting it that would convince me more.) What about the problem that the corrupted humans/AIs could produce a lot of negative utility even if they are small in numbers? What about the problem of the cautious AIs being at a competitive disadvantage against other AIs who are less cautious about what they are willing to do?
I think sufficiently narrow AI systems have essentially no hope of solving or avoiding these problems in general, regardless of safety techniques we develop, and so in the short term to avoid these problems you want to intervene on the humans who are deploying AI systems.
This seems right.
We can also aim to have the world mostly run by aligned AI systems rather than unaligned ones, which would mean that there isn’t much opportunity for us to be manipulated.
Manipulation doesn’t have to come just from unaligned AIs, it could also come from AIs that are aligned to other people. For example, if an AI is aligned to Alice, and Alice sees something to be gained by manipulating Bob, the AI being aligned won’t stop Alice from using it to manipulate Bob.
ETA: I forgot to mention that I don’t understand this part, can you please explain more:
One reason it might not be urgent is because we need to aim for competitiveness anyway—our AI systems need to be competitive so that economic incentives don’t cause us to use unaligned variants.
I’m not sure I understand your proposal here. What are they agreeing to exactly? Stopping technological development at a certain level until metaphilosophy is solved?
I don’t know, I want to outsource that decision to humans + AI at the time where it is relevant. Perhaps it involves stopping technological development. Perhaps it means continuing technological development, but not doing any space colonization. My point is simply that if humans agree that metaphilosophy needs to be solved, and the AI is trying to help humans, then metaphilosophy will probably be solved, even if I don’t know how exactly it will happen.
Is it clearer now why this doesn’t seem like an easy problem to me?
Yes. It seems to me like you’re considering the case where a human has to be able to give the correct answer to any question of the form “is this action a good thing to do?” I’m claiming that we could instead grow the set of things the AI does gradually, to give time for humans to figure out what it is they want. So I was imagining that humans would answer the AI’s questions in a frame where they have a lot of risk aversion, so anything that seemed particularly impactful would require a lot of deliberation before being approved.
I’m not sure what you think the AGI would figure out, and what it would do in response to that. Are you suggesting something like, based on historical data, it would learn a classifier to predict what kind of new technologies or choices would change human values in a way that we would not like, and restrict those technologies/choices from us?
I was thinking more of the case where a single human amassed a lot of power. Humans haven’t seemed to solve the problem of predicting how new technologies/choices would change human values, so that seems like quite a hard problem to solve (but perhaps AI could do it). I meant more that conditional on the AI knowing how some new technology or choice would affect us, it seems not too hard to figure out whether we would view it as a good thing.
Do you expect that different AIs would do different things in this regard depending on how cautious their operators are?
Yes.
that’s a lower priority problem because you think most AI operators will be really cautious so the percentage of value lost in the universe isn’t very high?
Kind of? I’d amend that slightly to say that to the extent that I think it is a problem (I’m not sure), I want to solve it in some way that is not technical research. (Possibilities: convince everyone to be cautious, obtain a decisive strategic advantage and enforce that everyone is cautious.)
What about the problem that the corrupted humans/AIs could produce a lot of negative utility even if they are small in numbers?
Same as above.
Manipulation doesn’t have to come just from unaligned AIs, it could also come from AIs that are aligned to other people. For example, if an AI is aligned to Alice, and Alice sees something to be gained by manipulating Bob, the AI being aligned won’t stop Alice from using it to manipulate Bob.
Same as above. All of these problems that you’re talking about would also apply to technology that could make a human smarter. It seems like it would be easiest to address on that level, rather than trying to build an AI system that can deal with these problems even though the operator would not want them to correct for the problem.
What about the problem of the cautious AIs being at a competitive disadvantage against other AIs who are less cautious about what they are willing to do?
This seems like an empirical fact that makes the problems listed above harder to solve.
I forgot to mention that I don’t understand this part, can you please explain more:
One reason it might not be urgent is because we need to aim for competitiveness anyway—our AI systems need to be competitive so that economic incentives don’t cause us to use unaligned variants.
So I broadly agree with Paul’s reasons for aiming for competitiveness. Given competitiveness, you might hope that we would automatically get defense against value manipulation by other AIs, since our aligned AI will defend us from value manipulation by similarly-capable unaligned AIs (or aligned AIs that other people have). Of course, defense might be a lot harder than offense, and you probably do think that, in which case this doesn’t really help us. (As I said, I haven’t really thought about this before.)
Overall view: I don’t think that the problems you’ve mentioned are obviously going to be solved as a part of AI alignment. I think that solving them will require mostly interventions on humans, not on the development of AI. I am weakly optimistic that humans will actually be able to coordinate and solve these problems as a result. If I were substantially more pessimistic, I would put more effort into strategy and governance issues. (Not sure I would change what I’m doing given my comparative advantage at technical research, but it would at least change what I advise other people do.)
Meta-view on our disagreement: I suspect that you have been talking about the problem of “making the future go well” while I’ve been talking about the problem of “getting AIs to do what we want” (which do seem like different problems to me). Most of the problems you’ve been talking about don’t even make it into the bucket of “getting AIs to do what we want” the way I think about it, so some of the claims (like “the urgent part is in the motivation subproblem”) are not meant to quantify over the problems you’re identifying. I think we do disagree on how important the problems you identify are, but not as much as you would think, since I’m quite uncertain about this area of problem-space.
I am weakly optimistic that humans will actually be able to coordinate and solve these problems as a result.
Why isn’t that also an argument against the urgency of solving AI motivation? I.e., we don’t need to urgently solve AI motivation because humans will be able to coordinate to stop or delay AI development long enough to solve AI motivation at leisure?
It seems to me that coordination is really hard. Yes we have to push on that, but we also have to push on potential technical solutions because most likely coordination will fail, and there is enough uncertainty about the difficulty of technical solutions that I think we urgently need more people to investigate the problems to see how hard they really are.
Aside from that, I think it’s also really important to better predict/understand just how difficult solving those problems are (both socially and technically) because that understanding is highly relevant to strategic decisions we have to make today. For example if those problems are very difficult to solve so that in expectation we end up losing most of the potential value of the universe even if we solve AI motivation, then that greatly reduces the value of working on motivation relative to something like producing evidence of the difficulty of those problems in order to convince policymakers to try to coordinate on stopping/delaying AI progress, or trying to create a singleton AI. That’s why I was asking you for details of what you think the social solutions would look like.
so some of the claims (like “the urgent part is in the motivation subproblem”) are not meant to quantify over the problems you’re identifying
I see, in that case I would appreciate disclaimers or clearer ways of stating that, so that people who might want to work on these problems are not discouraged from doing so more strongly than you intend.
I’m quite uncertain about this area of problem-space
Why isn’t that also an argument against the urgency of solving AI motivation? I.e., we don’t need to urgently solve AI motivation because humans will be able to coordinate to stop or delay AI development long enough to solve AI motivation at leisure?
Two reasons come to mind:
Stopping or delaying AI development feels more like trying to interfere with an already-running process, whereas there are no existing norms on what we use AI for that we would have to fight against, and debates on those norms are already beginning. For new things, I expect the public to be particularly risk-averse.
Relatedly, it is a lot easier to make norms/laws/regulations now that bind our future selves. On an individual level, it seems easier to delay your chance of going to Mars if you know you’re going to get a hovercar soon. On a societal scale, it seems easier to delay space colonization if we’re going to have lives of leisure due to automation, or to delay full automation if we’re soon going to get 4 hour workdays. Looking at the things governments and corporations say, it seems like they would be likely to do things like this. I think it makes a lot of sense to try and direct these efforts at the right target.
I want to emphasize though that my method here was having an intuition and querying for reasons behind the intuition. I would be a little surprised if someone could convince me my intuition is wrong in ~half an hour of conversation. I would not be surprised if someone could convince me that my reasons are wrong in ~half an hour of conversation.
It seems to me that coordination is really hard. Yes we have to push on that, but we also have to push on potential technical solutions because most likely coordination will fail, and there is enough uncertainty about the difficulty of technical solutions that I think we urgently need more people to investigate the problems to see how hard they really are.
I think it would help me if you suggested some ways that technical solutions could help with these problems. For example, with coordinating to prevent/delay corrupting technologies, the fundamental problem to me seems to be that with any technical solution, the thing that the AI does will be against the operator’s wishes-upon-reflection. (If your technical solution is in line with the operator’s wishes-upon-reflection, then I think you could also solve the problem by solving motivation.) This seems both hard to design (where does the AI get the information about what to do, if not from the operator’s wishes-upon-reflection?) as well as hard to implement (why would the operator use a system that’s going to do something they don’t want?).
You might argue that there are things that the operator would want if they could get it (eg. global coordination), but they can’t achieve it now, and so we need a technical solution for that. However, it seems like a we are in the same position as a well-motivated AI w.r.t. that operator. For example, if we try to cede control to FairBots that rationally cooperate with each other, a well-motivated AI could also do that.
Aside from that, I think it’s also really important to better predict/understand just how difficult solving those problems are (both socially and technically) because that understanding is highly relevant to strategic decisions we have to make today. For example if those problems are very difficult to solve so that in expectation we end up losing most of the potential value of the universe even if we solve AI motivation, then that greatly reduces the value of working on motivation relative to something like producing evidence of the difficulty of those problems in order to convince policymakers to try to coordinate on stopping/delaying AI progress, or trying to create a singleton AI. That’s why I was asking you for details of what you think the social solutions would look like.
Agreed. I view a lot of strategy research (eg. from FHI and OpenAI) as figuring this out from the social side, and some of my optimism is based on conversations with those researchers. On the technical side, I feel quite stuck (for the reasons above), though I haven’t tried hard enough to say that it’s too difficult to do.
I see, in that case I would appreciate disclaimers or clearer ways of stating that, so that people who might want to work on these problems are not discouraged from doing so more strongly than you intend.
I’ll keep that in mind. When I wrote the original comment, I wasn’t even thinking about problems like the ones you mention, because I categorize them as “strategy” by default, and I was trying to talk about the technical problem.
Stopping or delaying AI development feels more like trying to interfere with an already-running process, whereas there are no existing norms on what we use AI for that we would have to fight against, and debates on those norms are already beginning. For new things, I expect the public to be particularly risk-averse.
Do you think that at the time when AI development wasn’t an already-running process, and AI was still a new thing that the public could be expected to be risk-averse about (when would you say that was?), the argument “working on alignment isn’t urgent because humans can probably coordinate to stop AI development” would have been a good one?
Relatedly, it is a lot easier to make norms/laws/regulations now that bind our future selves.
Same question here. Back when “don’t develop AI” was still a binding on our future selves, should we have expected that we will coordinate to stop AI development, and it’s just bad luck that we haven’t succeeded in doing that?
Looking at the things governments and corporations say, it seems like they would be likely to do things like this.
Can you be more specific? What global agreement do you think would be reached, that is both realistic and would solve the kinds of problems that I’m worried about (e.g., unintentional corruption of humans by “aligned” AIs who give humans too much power or options that they can’t handle, and deliberate manipulation of humans by unaligned AIs or AIs aligned to other users)?
I think it would help me if you suggested some ways that technical solutions could help with these problems.
For example, create an AI that can help the user with philosophical questions at least as much as technical questions. (This could be done for example by figuring out how to better use Iterated Amplification to answer philosophical questions, or how to do imitation learning of human philosophers, or how to apply inverse reinforcement learning to philosophical reasoning.) Then the user could ask questions like “Am I likely to be corrupted by access to this technology? What can I do to prevent that while still taking advantage of it?” Or “Is this just an extremely persuasive attempt at manipulation or an actually good moral argument?”
As another example, solve metaethics and build that into the AI so that the AI can figure out or learn the actual terminal values of the user, which would make it easier to protect the user from manipulation and self-corruption. And even if the human user is corrupted, the AI still has the correct utility function, and when it has made enough technological progress it can uncorrupt the human.
I view a lot of strategy research (eg. from FHI and OpenAI) as figuring this out from the social side, and some of my optimism is based on conversations with those researchers.
Can you point me to any relevant results that have been written down, or explain what you learned from those conversations?
On the technical side, I feel quite stuck (for the reasons above), though I haven’t tried hard enough to say that it’s too difficult to do.
To address this and the question (from the parallel thread) of whether you should personally work on this, I think we need people to either solve the technical problems or at least to collectively try hard enough to convincingly say that it’s too difficult to do. (Otherwise who is going to convince policymakers to adopt the very costly social solutions? Who is going to convince people to start/join a social movement to influence policymakers to consider those costly social solutions? The fact that those things tend to take a lot of time seems like sufficient reason for urgency on the technical side, even if you expect the social solutions to be feasible.) Who are these people going to be, especially the first ones to join the field and help grow it? Probably existing AI alignment researchers, right? (I can probably make stronger arguments in this direction but I don’t want to be too “pushy” so I’ll stop here.)
I forgot to followup on this important part of our discussion:
All of these problems that you’re talking about would also apply to technology that could make a human smarter.
It seems to me that a technology that could make a human smarter is much more likely (compared to AI) to accelerate all forms of intellectual progress (e.g., technological progress and philosophical/moral progress) about equally, and therefore would have a less significant effect on the kinds of problems that I’m talking about (which are largely caused by technological progress outpacing philosophical/moral progress). I could make some arguments about this, but I’m curious if this doesn’t seem obvious to you.
Assuming the above, and assuming that one has moral uncertainty that gives some weight to the concept of moral responsibility, it seems to me that an additional argument for AI researchers to work on these problems is that it’s a moral responsibility of AI researchers/companies to try to solve problems that they create, for example via technological solutions, or by coordinating amongst themselves, or by convincing policymakers to coordinate, or by funding others to work on these problems, etc., and they are currently neglecting to do this (especially with regard to the particular problems that I’m pointing out).
It seems to me that a technology that could make a human smarter is much more likely (compared to AI) to accelerate all forms of intellectual progress (e.g., technological progress and philosophical/moral progress) about equally, and therefore would have a less significant effect on the kinds of problems that I’m talking about (which are largely caused by technological progress outpacing philosophical/moral progress).
Yes, I agree with this. The reason I mentioned that was to make the point that the problems are a function of progress in general and aren’t specific to AI—they are just exacerbated by AI. I think this is a weak reason to expect that solutions are likely to come from outside of AI.
Assuming the above, and assuming that one has moral uncertainty that gives some weight to the concept of moral responsibility, it seems to me that an additional argument for AI researchers to work on these problems is that it’s a moral responsibility of AI researchers/companies to try to solve problems that they create, for example via technological solutions, or by coordinating amongst themselves, or by convincing policymakers to coordinate, or by funding others to work on these problems, etc., and they are currently neglecting to do this.
This seems true. Just to make sure I’m not misunderstanding, this was meant to be an observation, and not meant to argue that I personally should prioritize this, right?
The reason I mentioned that was to make the point that the problems are a function of progress in general and aren’t specific to AI—they are just exacerbated by AI. I think this is a weak reason to expect that solutions are likely to come from outside of AI.
This doesn’t make much sense to me. Why is this any kind of reason to expect that solutions are likely to come from outside of AI? Can you give me an analogy where this kind of reasoning more obviously makes sense?
Just to make sure I’m not misunderstanding, this was meant to be an observation, and not meant to argue that I personally should prioritize this, right?
Right, this argument wasn’t targeted to you, but I think there are other reasons for you to personally prioritize this. See my comment in the parallel thread.
It seems to me that “avoid irreversible high-impact actions” would only work if one had a small amount of uncertainty over one’s utility function, in which case you could just avoid actions that are considered “irreversible high-impact” by any the utility functions that you have significant probability mass on. But if you had a large amount of uncertainty, or just have very little idea what your utility function looks like, that doesn’t work because almost any action could be “irreversible high-impact”.
From the AUP perspective, this only seems true in a way analogous to the statement that “any hypothesis can have arbitrarily long description length”. It’s possible to make practically no assumptions about what the true utility function is and still recover a sensible notion of “low impact”. That is, penalizing shifts in attainable utility for even random or simple functions still yields the desired behavior; I have experimental results to this effect which aren’t yet published. This suggests that the notion of impact captured by AUP isn’t dependent on realizability of the true utility, and hence the broader thing Rohin is pointing at should be doable.
While it’s true that some complex value loss is likely to occur when not considering an appropriate distribution over extremely complicated utility functions, it seems by-and-large negligible. This is because such loss occurs either as a continuation of the status quo or as a consequence of something objectively mild, which seems to correlate strongly with reasonably human-values mild.
Another con of the motivation-competence decomposition: unlike definition-optimization, it doesn’t actually seem to be a clean decomposition of the larger task, such that we can solve each subtask independently and then combine the solutions.
For example one way we could solve the motivation problem is by building a perfect human imitation (of someone who really wants to help H do what H wants), but then we seem to be stuck on the “competence” front, and there’s no clear way to plug this solution of “motivation” into a better generic solution to “competence” to get a more competent intent-aligned agent. Instead it seems like we have to solve the competence problem that is particular to the specific solution to motivation, or solve motivation and competence together as one large problem.
In contrast, the problem of specifying an aligned utility function and the problem of building a safe EU maximizers seem to be naturally independent problems, such that once we have a specification of an aligned utility function (or a method of specifying aligned utility functions), we can just plug that into more and more powerful and robust EU maximizers.
Furthermore I think this lack of clean decomposition shows up at the conceptual level too, not just the pragmatic level. For example, suppose we tried to increase the competence of the human imitation by combining it with a superintelligent Oracle, and it turns out the human imitation isn’t very careful and in most timelines destroys the world by asking unsafe questions that cause the Oracle to perform malign optimizations. Is this a failure of motivation or a failure of competence, or both? It seems arguable or hard to say. In contrast, in a system that is built using the definition-optimization decomposition, it seems like it would be easy to trace any safety failures to either the “definition” solution or the “optimization” solution.
I overall agree that this is a con. Certainly there are AI systems that are weak enough that you can’t talk coherently about their “motivation”. Probably all deep-learning-based systems fall into this category.
I also agree that (at least for now, and probably in the future as well) you can’t formally specify the “type signature” of motivation such that you could separately solve the competence problem without knowing the details of the solution to the motivation problem.
My hope here would be to solve the motivation problem and leave the competence problem for later, since by my view that solves most of the problem (I’m aware that you disagree with this).
I don’t agree that it’s not clean at the conceptual level. It’s perhaps less clean than the definition-optimization decomposition, but not much less.
For example, suppose we tried to increase the competence of the human imitation by combining it with a superintelligent Oracle, and it turns out the human imitation isn’t very careful and in most timelines destroys the world by asking unsafe questions that cause the Oracle to perform malign optimizations. Is this a failure of motivation or a failure of competence, or both?
This seems pretty clearly like a failure of competence to me, since the human imitation would (presumably) say that they don’t want the world to be destroyed, and they (presumably) did not predict that that was what would happen when they queried the oracle.
This seems pretty clearly like a failure of competence to me, since the human imitation would (presumably) say that they don’t want the world to be destroyed, and they (presumably) did not predict that that was what would happen when they queried the oracle.
It also seems like a failure of motivation though, because as soon as the Oracle started to do malign optimization, the system as a whole is no longer trying to do what H wants.
Or is the idea that as long as the top-level or initial optimizer is trying (or tried) to do what H wants, then all subsequent failures of motivation don’t count, so we’re excluding problems like inner alignment from motivation / intent alignment?
I’m unsure what your answer would be, and what Paul’s answer would be, and whether they would be the same, which at least suggests that the concepts haven’t been cleanly decomposed yet.
ETA: Or to put it another way, supposed AI safety researchers determined ahead of time what kinds of questions won’t cause the Oracle to perform malign optimizations. Would that not count as part of the solution to motivation / intent alignment of this system (i.e., combination of human imitation and Oracle)? It seems really counterintuitive if the answer is “no”.
Oh, I see, you’re talking about the system as a whole, whereas I was thinking of the human imitation specifically. That seems like a multiagent system and I wouldn’t apply single-agent reasoning to it, so I agree motivation-competence is not the right way to think about it (but if you insisted on it, I’d say it fails motivation, mostly because the system doesn’t really have a single “motivation”).
It doesn’t seem like the definition-optimization decomposition helps either? I don’t know whether I’d call that a failure of definition or optimization.
Or to put it another way, supposed AI safety researchers determined ahead of time what kinds of questions won’t cause the Oracle to perform malign optimizations. Would that not count as part of the solution to motivation / intent alignment of this system (i.e., combination of human imitation and Oracle)?
I would say the human imitation was intent aligned, and this helped improve the competence of the human imitation. I mostly wouldn’t apply this framework to the system (and I also wouldn’t apply definition-optimization to the system).
That seems like a multiagent system and I wouldn’t apply single-agent reasoning to it, so I agree motivation-competence is not the right way to think about it
This was an unexpected answer. Isn’t HCH also such a multiagent system? (It seems very similar to what I described: a human with access to a superhuman Oracle, although HCH wasn’t what I initially had in mind.) IDA should converge to HCH in the limit of infinite compute and training data, so this would seem to imply that the motivation-competence framework doesn’t apply to IDA either. I’m pretty sure Paul would give a different answer, if we ask him about “intent alignment”.
It doesn’t seem like the definition-optimization decomposition helps either? I don’t know whether I’d call that a failure of definition or optimization.
It seems more obvious that multiagent systems just fall outside of the definition-optimization framework, which seems to be a point in its favor as far as conceptual clarity is concerned.
I’m pretty sure Paul would give a different answer, if we ask him about “intent alignment”.
Yes, I’d say that to the extent that “trying to do X” is a useful concept, it applies to systems with lots of agents just as well as it applies to one agent.
Even a very theoretically simple system like AIXI doesn’t seem to be “trying” to do just one thing, in the sense that it can e.g. exert considerable optimization power at things other than reward, even in cases where the system seems to “know” that its actions won’t lead to reward.
You could say that AIXI is “optimizing” the right thing and just messing up when it suffers inner alignment failures, but I’m not convinced that this division is actually doing much useful work. I think it’s meaningful to say “defining what we want is useful,” but beyond that it doesn’t seem like a workable way to actually analyze the hard parts of alignment or divide up the problem.
(For example, I think we can likely get OK definitions of what we value, along the lines of A Formalization of Indirect Normativity, but I’ve mostly stopped working along these lines because it no longer seems directly useful.)
It seems more obvious that multiagent systems just falls outside of the definition-optimization framework, which seems to be a point in its favor as far as conceptual clarity is concerned.
I agree.
Of course, it also seems quite likely that AIs of the kind that will probably be built (“by default”) also fall outside of the definition-optimization framework. So adopting this framework as a way to analyze potential aligned AIs seems to amount to narrowing the space considerably.
Yes, I’d say that to the extent that “trying to do X” is a useful concept, it applies to systems with lots of agents just as well as it applies to one agent.
So how do you see it applying in my example? Would you say that the system in my example is both trying to do what H wants it to do, and also trying to do something that H doesn’t want? Is it intent aligned period, or intent aligned at some points in time and not at others, or simultaneously intent aligned and not aligned, or something else? (I feel like we’ve had a similar discussion before and either it didn’t get resolved or I didn’t understand your position. I didn’t see a direct attempt to answer this in the comment I’m replying to, and it’s fine if you don’t want to go down this road again but I want to convey my continued confusion.)
You could say that AIXI is “optimizing” the right thing and just messing up when it suffers inner alignment failures, but I’m not convinced that this division is actually doing much useful work. I think it’s meaningful to say “defining what we want is useful,” but beyond that it doesn’t seem like a workable way to actually analyze the hard parts of alignment or divide up the problem.
I don’t understand how this is connected to what I was saying. (In general I often find it significantly harder to understand your comments compared to say Rohin’s. Not necessarily saying you should do something differently, as you might already be making a difficult tradeoff between how much time to spend here and elsewhere, but just offering feedback in case you didn’t realize.)
Of course, it also seems quite likely that AIs of the kind that will probably be built (“by default”) also fall outside of the definition-optimization framework. So adopting this framework as a way to analyze potential aligned AIs seems to amount to narrowing the space considerably.
Would you say that the system in my example is both trying to do what H wants it to do, and also trying to do something that H doesn’t want? Is it intent aligned period, or intent aligned at some points in time and not at others, or simultaneously intent aligned and not aligned, or something else?
The oracle is not aligned when asked questions that cause it to do malign optimization.
The human+oracle system is not aligned in situations where the human would pose such questions.
For a coherent system (e.g. a multiagent system which has converged to a Pareto efficient compromise), it make sense to talk about the one thing that it is trying to do.
For an incoherent system this abstraction may not make sense, and a system may be trying to do lots of things. I try to use benign when talking about possibly-incoherent systems, or things that don’t even resemble optimizers.
The definition in this post is a bit sloppy here, but I’m usually imagining that we are building roughly-coherent AI systems (and that if they are incoherent, some parts are malign). If you wanted to be a bit more careful with the definition, and want to admit vagueness in “what H wants it to do” (such that there can be several different preferences that are “what H wants”) we could say something like:
A is aligned with H if everything it is trying to do is “what H wants.”
That’s not great either though (and I think the original post is more at an appropriate level of attempted-precision).
(In the following I will also use “aligned” to mean “intent aligned”.)
The human+oracle system is not aligned in situations where the human would pose such questions.
Ok, sounds like “intent aligned at some points in time and not at others” was the closest guess. To confirm, would you endorse “the system was aligned when the human imitation was still trying to figure out what questions to ask the oracle (since the system was still only trying to do what H wants), and then due to its own incompetence became not aligned when the oracle started working on the unsafe question”?
Given that intent alignment in this sense seems to be property of a system+situation instead of the system itself, how would you define when the “intent alignment problem” has been solved for an AI, or when would you call an AI (such as IDA) itself “intent aligned”? (When we can reasonably expect to keep it out of situations where its alignment fails, for some reasonable amount of time, perhaps?) Or is it the case that whenever you use “intent alignment” you always have some specific situation or set of situations in mind?
Fwiw having read this exchange, I think I approximately agree with Paul. Going back to the original response to my comment:
Isn’t HCH also such a multiagent system?
Yes, I shouldn’t have made a categorical statement about multiagent systems. What I should have said was that the particular multiagent system you proposed did not have a single thing it is “trying to do”, i.e. I wouldn’t say it has a single “motivation”. This allows you to say “the system is not intent-aligned”, even though you can’t say “the system is trying to do X”.
Another way of saying this is that it is an incoherent system and so the motivation abstraction / motivation-competence decomposition doesn’t make sense, but HCH is one of the few multiagent systems that is coherent. (Idk if I believe that claim, but it seems plausible.) This seems to map on to the statement:
For an incoherent system this abstraction may not make sense, and a system may be trying to do lots of things.
Also, I want to note strong agreement with this:
Of course, it also seems quite likely that AIs of the kind that will probably be built (“by default”) also fall outside of the definition-optimization framework. So adopting this framework as a way to analyze potential aligned AIs seems to amount to narrowing the space considerably.
Another way of saying this is that it is an incoherent system and so the motivation abstraction / motivation-competence decomposition doesn’t make sense, but HCH is one of the few multiagent systems that is coherent.
HCH can be incoherent. I think one example that came up in an earlier discussion was the top node in HCH trying to help the user by asking (due to incompetence / insufficient understanding of corrigibility) “What is a good approximation of the user’s utility function?” followed by “What action would maximize EU according to this utility function?”
ETA: If this isn’t clearly incoherent, imagine that due to further incompetence, lower nodes work on subgoals in a way that conflict with each other.
Ultimately, our goal is to build AI systems that do what we want them to do. One way of decomposing this is first to define the behavior that we want from an AI system, and then to figure out how to obtain that behavior, which we might call the definition-optimization decomposition. Ambitious value learning aims to solve the definition subproblem. I interpret this post as proposing a different decomposition of the overall problem. One subproblem is how to build an AI system that is trying to do what we want, and the second subproblem is how to make the AI competent enough that it actually does what we want. I like this motivation-competence decomposition for a few reasons:
It isolates the major, urgent difficulty in a single subproblem. If we make an AI system that tries to do what we want, it could certainly make mistakes, but it seems much less likely to cause eg. human extinction. (Though it is certainly possible, for example by building an unaligned successor AI system, as mentioned in the post.) In contrast, with the definition-optimization decomposition, we need to solve both specification problems with the definition and robustness problems with the optimization.
Humans seem to solve the motivation subproblem, whereas humans don’t seem to solve either the definition or the optimization subproblems. I can definitely imagine a human legitimately trying to help me, whereas I can’t really imagine a human knowing how to derive optimal behavior for my goals, nor can I imagine a human that can actually perform the optimal behavior to achieve some arbitrary goal.
It is easier to apply to systems without much capability, though as the post notes, it probably still does need to have some level of capability. While a digit recognition system is useful, it doesn’t seem meaningful to talk about whether it is “trying” to help us.
Relatedly, the safety guarantees seem to degrade more slowly and smoothly. With definition-optimization, if you get the definition even slightly wrong, Goodhart’s Law suggests that you can get very bad outcomes. With motivation-competence, I’ve already argued that incompetence probably leads to small problems, not big ones, and slightly worse motivation might not make a huge difference because of something analogous to the basin of attraction around corrigibility. This depends a lot on what “slightly worse” means for motivation, but I’m optimistic.
We’ve been working with the definition-optimization decomposition for quite some time now by modeling AI systems as expected utility maximizers, and we’ve found a lot of negative results and not very many positive ones.
The motivation-competence decomposition accommodates interaction between the AI system and humans, which definition-optimization does not allow (or at least, it makes it awkward to include such interaction).
The cons are:
It is imprecise and informal, whereas we can use the formalism of expected utility maximizers for the definition-optimization decomposition.
There hasn’t been much work done in this paradigm, so it is not obvious that there is progress to make.
I suspect many researchers would argue that any sufficiently intelligent system will be well-modeled as an expected utility maximizer and will have goals and preferences it is optimizing for, and as a result we need to deal with the problems of expected utility maximizers anyway. Personally, I do not find this argument compelling, and hope to write about why in the near future. ETA: Written up in the chapter on Goals vs Utility Functions in the Value Learning sequence, particularly in Coherence arguments do not imply goal-directed behavior.
This is a great comment, and maybe it should even be its own post. It clarified a bunch of things for me, and I think was the best concise argument for “we should try to build something that doesn’t look like an expected utility maximizer” that I’ve read so far.
Thanks! The hope is to write something a bit more comprehensive that expands on many of these points, which would be its own post (or sequence).
I agree with habryka that this is a really good explanation. I also agree with most of your pros and cons, but for me another major con is that this decomposition moves some problems that I think are crucial and urgent out of “AI alignment” and into the “competence” part, with the implicit or explicit implication that they are not as important, for example the problem of obtaining or helping humans to obtain a better understanding of their values and defending their values against manipulation from other AIs.
In other words, the motivation-competence decomposition seems potentially very useful to me as a way to break down a larger problem into smaller parts so it can be solved more easily, but I don’t agree that the urgent/not-urgent divide lines up neatly with the motivation/competence divide.
Aside from the practical issue of confusion between different usages of “AI alignment” (I think others like MIRI had been using “AI alignment” in a broader sense before Paul came up with his narrower definition), even using “AI alignment” in a context where it’s clear that I’m using Paul’s definition gives me the feeling that I’m implicitly agreeing to his understanding of how various subproblems should be prioritized.
I switched to this usage of AI alignment in 2017, after an email thread involving many MIRI people where Rob suggested using “AI alignment” to refer to what Bostrom calls the “second principal-agent problem” (he objected to my use of “control”). I think I misunderstood what Rob intended in that discussion, but my definition is meant to be in line with that—if the agent is trying to do what the principal wants, it seem like you’ve solved the principal-agent problem. I think the main way this definition is narrower than what was discussed in that email thread is by excluding things like boxing.
In practice, essentially all of MIRI’s work seems to fit within this narrower definition, so I’m not too concerned at the moment with this practical issue (I don’t know of any work MIRI feels strongly about that doesn’t fit in this definition). We had a thread about this after it came up on LW in April, where we kind of decided to stick with something like “either make the AI trying to do the right thing, or somehow cope with the problems introduced by it trying to do the wrong thing” (so including things like boxing), but to mostly not worry too much since in practice basically the same problems are under both categories.
I should have updated this post before it got rerun as part of the sequence.
Note that Arbital defines “AI alignment” as:
and “total alignment” as:
I think this clearly includes the kinds of problems I’m talking about in this thread. Do you agree? Also supporting my view is the history of “Friendliness” being a term that included the problem of better understanding the user’s values (as in CEV) and then MIRI giving up that term in favor of “alignment” as an apparently exact synonym. See this MIRI post which talks about “full alignment problem for fully autonomous AGI systems” and links to Arbital.
I think you may have misunderstood what I meant by “practical issue”. My point was that if you say something like “I think AI alignment is the most urgent problem to work on” the listener could easily misinterpret you as meaning “alignment” in the MIRI/Arbital sense. Or if I say “AI alignment is the most urgent problem to work on” in the MIRI/Arbital sense of alignment, the listener could easily misinterpret as meaning “alignment” your sense.
Again my feeling is that MIRI started using alignment in the broader sense first and therefore that definition ought to have priority. If you disagree with this, I could try to do some more historical research to show this. (For example by figuring out when those Arbital articles were written, which I currently don’t know how to do.)
I think MIRI’s first use of this term was here where they said “We call a smarter-than-human system that reliably pursues beneficial goals `aligned with human interests’ or simply `aligned.′ ” which is basically the same as my definition. (Perhaps slightly weaker, since “do what the user wants you to do” is just one beneficial goal.) This talk never defines alignment, but the slide introducing the big picture says “Take-home message: We’re afraid it’s going to be technically difficult to point AIs in an intuitively intended direction” which also really suggests it’s about trying to point your AI in the right direction.
The actual discussion on that Arbital page strongly suggests that alignment is about pointing an AI in a direction, though I suppose that may merely be an instance of suggestively naming the field “alignment” and then defining it to be “whatever is important” as a way of smuggling in the connotation that pointing your AI in the right direction is the important thing. All of the topics in the “AI alignment” domain (except for mindcrime, which is borderline) all fit under the narrower definition; the list of alignment researchers are all people working on the narrower problem.
So I think the way this term is used in practice basically matches this narrower definition.
As I mentioned, I was previously happily using the term “AI control.” Rob Bensinger suggested that I stop using that term and instead use AI alignment, proposing a definition of alignment that seemed fine to me.
I don’t think the very broad definition is what almost anyone has in mind when they talk about alignment. It doesn’t seem to be matching up with reality in any particular way, except insofar as its capturing the problems that a certain group of people work on.” I don’t really see any argument in favor except the historical precedent, which I think is dubious in light of all of the conflicting definitions, the actual usage, and the explicit move to standardize on “alignment” where an alternative definition was proposed.
(In the discussion, the compromise definition suggested was “cope with the fact that the AI is not trying to do what we want it to do, either by aligning incentives or by mitigating the effects of misalignment.”)
Is this intended (/ do you understand this) to include things like “make your AI better at predicting the world,” since we expect that agents who can make better predictions will achieve better outcomes?
If this isn’t included, is that because “sufficiently advanced” includes making good predictions? Or because of the empirical view that ability to predict the world isn’t an important input into producing good outcomes? Or something else?
If this definition doesn’t distinguish alignment from capabilities, then that seems like a non-starter to me which is neither useful nor captures the typical usage.
If this excludes making better prediction because that’s assumed by “sufficiently advanced agent,” then I have all sorts of other questions (does “sufficiently advanced” include all particular empirical knowledge relevant to making the world better? does it include some arbitrary category not explicitly carved out in the definition?)
In general, the alternative broader usage of AI alignment is broad enough to capture lots of problems that would exist whether or not we built AI. That’s not so different from using the term to capture (say) physics problems that would exist whether or not we built AI, both feel bad to me.
Independently of this issue, it seems like “the kinds of problems you are talking about in this thread” need better descriptions whether or not they are part of alignment (since even if they are part of alignment, they will certainly involve totally different techniques/skills/impact evaluations/outcomes/etc.).
But the page includes:
which seems to be outside of just “pointing an AI in a direction”
I think so, at least for certain kinds of predictions that seem especially important (i.e., may lead to x-risk if done badly), see this Arbital page which is under AI Alignment:
It seems to me that Rohin’s proposal of distinguishing between “motivation” and “capabilities” is a good one, and then we can keep using “alignment” for the set of broader problems that are in line with the MIRI/Arbital definition and examples.
It seems fine to me to include 1) problems that are greatly exacerbated by AI and 2) problems that aren’t caused by AI but may be best solved/ameliorated by some element of AI design, since these are problems that AI researchers have a responsibility over and/or can potentially contribute to. If there’s a problem that isn’t exacerbated by AI and does not seem likely to have a solution within AI design then I’d not include that.
Sure, agreed.
I think it’s bad to use a definitional move to try to implicitly prioritize or deprioritize research. I think I shouldn’t have written: “I like it less because it includes many subproblems that I think (a) are much less urgent, (b) are likely to involve totally different techniques than the urgent part of alignment.”
That said, I do think it’s important that these seem like conceptually different problems and that different people can have different views about their relative importance—I really want to discuss them separately, try to solve them separately, compare their relative values (and separate that from attempts to work on either).
I don’t think it’s obvious that alignment is higher priority than these problems, or than other aspects of safety. I mostly think it’s a useful category to be able to talk about separately. In general I think that it’s good to be able to separate conceptually separate categories, and I care about that particularly much in this case because I care particularly much about this problem. But I also grant that the term has inertia behind it and so choosing its definition is a bit loaded and so someone could object on those grounds even if they bought that it was a useful separation.
(I think that “defending their values against manipulation from other AIs” wasn’t include under any of the definitions of “alignment” proposed by Rob in our email discussion about possible definitions, so it doesn’t seem totally correct to refer to this as “moving” those subproblems, so much as there already existing a mess of imprecise definitions some of which included and some of which excluded those subproblems.)
Yeah, that seems right. I would probably defend the claim that motivation contains the most urgent part in the same way that Paul has done in the past—it seems likely to be easy to get a well motivated AI system to realize that it should help us understand our values, and that it should not do irreversible high-impact actions until then. I’m less optimistic about defending values against manipulation, because you probably need to be very competent for that, and you can’t take your time to become more competent, but that seems like a further-away problem to me and so less urgent.
(I don’t think I have much to add over the discussions you and Paul have had in the past, but I’m happy to clarify my opinion if it seems useful to you—perhaps my way of stating things will click where Paul’s way didn’t, idk. Or I might have different opinions and not realize it.)
I would support the idea of having this idea simply as a decomposition and not also pack in the implication that motivation/competence corresponds to urgent/not-urgent, though I suspect it is quite hard to do that now.
I would highly welcome that. BTW if you see me argue with Paul in the future (or in the past) and I seem to be not getting something, please feel free to jump in and explain it a different way. I often find it easier to understand one of Paul’s ideas from someone else’s explanation.
Yes, that seems easy, but actually helping seems much harder.
How do you determine what is “high-impact” before you have a utility function? Even “reversible” is relative to a utility function, right? It doesn’t mean that you literally can reverse all the consequences of an action, but rather that you can reverse the impact of that action on your utility?
It seems to me that “avoid irreversible high-impact actions” would only work if one had a small amount of uncertainty over one’s utility function, in which case you could just avoid actions that are considered “irreversible high-impact” by any the utility functions that you have significant probability mass on. But if you had a large amount of uncertainty, or just have very little idea what your utility function looks like, that doesn’t work because almost any action could be “irreversible high-impact”. For example if I were a negative utilitarian I perhaps ought to spend all my resources trying to stop technological progress leading to space colonization, so anything that I do besides that would be “irreversible high-impact” unless I could go back in time and change my resource allocation.
BTW, here is a section from a draft post that I’m working on. Do you think it would be easy to solve or avoid all of these problems? (This post isn’t specifically addressing Paul’s approach so some of them may be easy to solve under his approach but I don’t think all of them are.)
How to prevent “aligned” AIs from unintentionally corrupting human values? We know that ML systems tend to have problems with adversarial examples and distributional shifts in general. There seems to be no reason not to expect that human value functions have similar problems, which even “aligned” AIs could trigger unless they are somehow designed not to. For example, such AIs could give humans so much power so quickly or put them in such novel situations that their moral development can’t keep up, so their value systems no longer give sensible answers. (Sort of the AI assisted version of the classic “power corrupts” problem.) AIs could give us new options that are irresistible to some parts of our motivational systems, like more powerful versions of video game and social media addiction. Even in the course of trying to figure out how the world could be made better for us, they could in effect be searching for adversarial examples on our value functions. Finally, at our own request or in a sincere attempt to help us, they could generate philosophical or moral arguments that are wrong but extremely persuasive.
(Some of these issues, like the invention of new addictions and new technologies in general, would happen even without AI, but I think AIs would likely, by default, strongly exacerbate the problem by differentially accelerating such technologies faster than progress in understanding how to avoid or safely handle them.)
Why is that a further-away problem? Even if it is, we still need people to work on them now, if only to generate persuasive evidence in case they really are very hard problems so we can pursue some other strategy to avoid them like stopping or delaying the development of advanced AI as much as possible.
My position on this (that might be clear from previous discussions):
I agree this is a real problem.
From a technical perspective, I think this is even further from the alignment problem (than other AI safety problems), so I definitely think it should be studied separately and deserves a separate name.(Though the last bullet point in this comment implicitly gives an argument in the other direction.)
I’d normally frame this problem as “society’s values will evolve over time, and we have preferences about how they evolve.” New technology might change things in ways we don’t endorse. Natural pressures like death may lead to changes we don’t endorse (though that’s a tricky values call). The constraint of remaining economically/militarily competitive could also force our values to evolve in a bad way (alignment is an instance of that problem, and eventually AI+alignment would address the other natural instance by decoupling human values from the competence needed to remain competitive). And of course there is a hard problem in that we don’t know how to deliberate/reflect. The “figure out how to deliberate” problem seems like it is relatively easily postponed, since you don’t have to solve it until you are doing deliberation, but the “help people avoid errors in deliberation” may be more urgent.
The reason I consider alignment more urgent is entirely quantitative and very empirically contingent, I don’t think there is any simple argument against. I think there is a >1/3 chance that AI will be solidly superhuman within 20 subjective years, and that in those scenarios alignment destroys maybe 20% of the total value of the future, leading to 0.3%/year of losses from alignment, and right now it looks reasonably tractable. Influencing the trajectory of society’s values in other ways seems significantly worse than that to me (maybe 10x less cost-effective?). I think it would be useful to do some back-of-the-envelope calculations for the severity of value drift and the case for working on it.
I don’t think I’m likely to work on this problem unless I either become much more pessimistic about working on alignment (e.g. because the problem is much harder or easier than I currently believe), I feel like I’ve already poked at it enough that VOI from more poking is lower than just charging ahead on alignment. But that is a stronger judgment than the last section, and I think is largely due to comparative advantage considerations, and I would certainly be supportive of work on this topic (e.g. would be happy to fund, would engage with it, etc.)
This is a leading contender for what I would do if alignment seemed unappealing, though I think that broader institutional improvement / capability enhancement / etc. seems more appealing. I’d definitely spend more time thinking about it.
I think that important versions of these problems really do exist with or without AI, although I agree that AI will accelerate the point at which they become critical while it’s not obvious whether it will accelerate solutions. I don’t think this is particularly important but does make me feel even more comfortable with the naming issue—this isn’t really a problem about AI at all, it’s just one of many issues that is modulated by AI.
I think the main way AI is relevant to the cost-effectiveness analysis of shaping-the-evolution-of-values is that it may decrease the amount of work that can be done on these problems between now and when they become serious (if AI is effectively accelerating the timeline for catastrophic value change without accelerating work on making values evolve in a way we’d endorse).
To the extent that the value of working on these problems is dominated by that scenario—”AI has a large comparative disadvantage at helping us solve philosophical problems / thinking about long-term trajectory / etc.”—then I think that one of the most promising interventions on this problem is improving the relative capability of AI at problems of this form. My current view is that working on factored cognition (and similarly on debate) is a reasonable approach to that. This isn’t a super important consideration, but it overall makes me (a) a bit more excited about factored cognition (especially in worlds where the broader iterated amplification program breaks down), (b) a bit less concerned about figuring out whether relative capabilities is more or less important than alignment.
I would like to have clearer ways of talking and thinking about these problems, but (a) I think the next step is probably developing a better understanding (or, if someone has a much better understanding, then a development of a better shared understanding), (b) I really want a word other than “alignment,” and probably multiple words. I guess the one that feels most urgently-unnamed right now is something like: understanding how values evolve and what features may introduce that evolution in a way we don’t endorse, including both social dynamics, environmental factors, the need to remain competitive, and the dynamics of deliberation and argumentation.
This statement of the problem seems to assume a subjectivist or anti-realist view of metaethics (items 4 or 5 on this list). Consider the analogous statement, “mathematicians’ beliefs about mathematical statements will evolve over time, and they have preferences about how their beliefs evolve”. I think a lot of mathematicians would object to that and instead say that they prefer to have true beliefs about mathematics, and their “preferences about how their beliefs evolve” are just their best guesses about how to arrive at true beliefs.
Assuming you agree that we can’t be certain about which metaethical position is correct yet, I think by implicitly adopting a subjectivist/anti-realist framing, you make the problem seem easier than we should expect it to be. It implies that instead of the AI (and indirectly the AI designer) potentially having (if a realist or relativist metaethical position is correct) an obligation/opportunity to help the user figure out what their true or normative values are, which may involve solving difficult metaethical and other philosophical questions, the AI can just follow the user’s preferences about how their values evolve.
Additionally, this framing also makes the potential consequences of failing to solve the problem sound less serious than it could potentially be. I.e., if there is such a thing as someone’s true or normative values, then failing to optimize the universe for those values is really bad, but if they just have preferences about how their values evolve, then even if their values fail to evolve in that way, at least whatever values the universe ends up being optimized for are still their values, so not all is lost.
I think I would prefer to frame the problem as “How can we design/use AI to prevent the corruption of human values, especially corruption caused/exacerbated by the development of AI?” and would consider this an instance of the more general problem “When considering AI safety, it’s not safe to assume that the human user/operator/supervisor is a generally safe agent.”
To me the x-risk of corrupting human values by well-motivated AI is comparable to the x-risk caused by badly-motivated AI (and both higher than 20% conditional on superhuman AI within 20 subjective years), but I’m not sure how to argue this with you. Even if the total risk of “value corruption” is 10x smaller, it seems like the marginal impact of an additional researcher on “value corruption” could be higher given that there are now about 20(?) full time researchers working mostly on AI motivation but zero on this problem (as far as I know), and then we also have to consider the effect of a marginal researcher on the future growth of each field, and future effects on public opinion and policy makers. Unfortunately, I don’t know how to calculate these things even in a back-of-the-envelope way. As a rule of thumb, “if one x-risk seems X times bigger than another, it should have about X times as many people working on it” is intuitive appealingly to me, and suggests we should have at least 2 people working on “value corruption” even if you think that risk is 10x smaller, but I’m not sure if that makes sense to you.
I see no reason to convince you personally to work on “value corruption” since your intuition on the relative severity of the risks is so different from mine, and under either of our views we obviously still need people to work on motivation / alignment-in-your-sense. I’m just hoping that you won’t (intentionally or unintentionally) discourage people from working on “value corruption” so strongly that they don’t even consider looking into that problem and forming their own conclusions based on their own intuitions/priors.
This seems totally reasonable to me, but 1) others may have other ideas about how to intervene on this problem, and 2) even within factored cognition or debate there are probably research directions that skew towards being more applicable to motivation and research directions that skew towards being more applicable to “value corruption” and I don’t want people to be excessively discouraged from working on the latter by statements like “motivation contains the most urgent part”.
If you think this risk is very large, presumably there is some positive argument for why it’s so large? That seems like the most natural way to run the argument. I agree it’s not clear what exactly the norms of argument here are, but the very basic one seems to be sharing the reason for great concern.
In the case of alignment there are a few lines of argument that we can flesh out pretty far. The basic structure is something like: “(a) if we built AI with our current understanding there is a good chance it would not be trying to do what we wanted or have enough overlap to give the future substantial value, (b) if we built sufficiently competent AI, the future would probably be shaped by its intentions, (c) we have a significant risk of not developing sufficiently better understanding prior to having the capability to build sufficiently competent AI, (d) we have a significant risk of building sufficiently competent AI even if we don’t have sufficiently good understanding.” (Each of those claims obviously requires more argument, etc.)
One version of the case for worrying about value corruption would be:
It seems plausible that the values pursued by humans are very sensitive to changes in their environment.
It may be that historical variation is itself problematic, and we care mostly about our particular values.
Or it may be that values are “hardened” against certain kinds of environment shift that occur in nature, and that they will go to some lower “default” level of robustness under new kinds of shifts.
Or it may be that normal variation is OK for decision-theoretic reasons (since we are the beneficiaries of past shifts) but new kinds of variation are not OK.
If so, the rate of change in subjective time could be reasonably high—perhaps the change that occurs within one generation could shift value far enough to reduce value by 50% (if that change wasn’t endorsed for decision-theoretic reasons / hardened against).
It’s plausible, perhaps 50%, that AI will accelerate kinds of change that lead to value drift radically more than it accelerates an understanding of how to prevent such drift.
A good understanding of how to prevent value drift might be used / be a major driver of how well we prevent such drift. (Or maybe some other foreseeable institutional characteristics could have a big effect on how much drift occurs.)
If so, then it matters a lot how well we understand how to prevent such drift at the time when we develop AI. Perhaps there will be several generations worth of subjective time / drift-driving change before we are able to do enough additional labor to obsolete our current understanding (since AI is accelerating change but not the relevant kind of labor).
Our current understanding may not be good, and there may be a realistic prospect of having a much better understanding.
This kind of story is kind of conjunctive, so I’d expect to explore a few lines of argument like this, and then try to figure out what are the most important underlying uncertainties (e.g. steps that appear in most arguments of this form, or a more fundamental underlying cause for concern that generates many different arguments).
My most basic concerns with this story are things like:
In “well-controlled” situations, with principals who care about this issue, it feels like we already have an OK understanding of how to avert drift (conditioned on solving alignment). It seems like the basic idea is to decouple evolving values from the events in the world that are actually driving competitiveness / interacting with the natural world / realizing people’s consumption / etc., which is directly facilitated by alignment. The extreme form of this is having some human in a box somewhere (or maybe in cold storage) who will reflect and grow on their own schedule, and who will ultimately assume control of their resources once reaching maturity. We’ve talked a little bit about this, and you’ve pointed out some reasons this kind of scheme isn’t totally satisfactory even if it works as intended, but quantitatively the reasons you’ve pointed to don’t seem to be probable enough (per economic doubling, say) to make the cost-benefit analysis work out.
In most practical situations, it doesn’t seem like “understanding of how to avert drift” is the key bottleneck to averting drift—it seems like the basic problem is that most people just don’t care about averting drift at all, or have any inclination to be thoughtful about how their own preferences evolve. That’s still something you can intervene on, but it feels like a huge morass where you are competing with many other forces.
In the end I’m doing a pretty rough calculation that depends on a whole bunch of stuff, but those feel like they are maybe the most likely differences in view / places where I have something to say. Overall I still think this problem is relatively important, but that’s how I get to the intuitive view that it’s maybe ~10x lower impact. I would grant the existence of (plenty of) people for whom it’s higher impact though.
I think that seems roughly right, probably modulated by some O(1) factor factor reflecting tractability or other factors not captured in the total quantity of risk—maybe I’d expect us to have 2-10x more resources per unit risk devoted to more tractable risks.
In this case I’d be happy with the recommendation of ~10x more people working on motivation than on value drift, that feels like the right ballpark for basically the same reason that motivation feels ~10x more impactful.
I agree I should be more careful about this.
I do think that motivation contains the most urgent/important part and feel pretty comfortable expressing that view (for the same reasons I’m generally inclined to express my views), but could hedge more when making statements like this.
(I think saying “X is more urgent than Y” is basically compatible with the view “There should be 10 people working on X for each person working on Y,” even if one also believes “but actually on the current margin investment in Y might be a better deal.” Will edit the post to be a bit softer here though.
ETA: actually I think the language in the post basically reflects what I meant, the broader definition seems worse because it contains tons of stuff that is lower priority. The narrower definition doesn’t contain every problem that is high priority, it just contains a single high priority problem, which is better than a really broad basket containing a mix of important and not-that-important stuff. But I will likely write a separate post or two at some point about value drift and other important problems other than motivation.)
Yeah, I didn’t literally mean that I don’t have any arguments, but rather that we’ve discussed it in the past and it seems like we didn’t get close to resolving our disagreement. I tend to think that Aumann Agreement doesn’t apply to humans, and it’s fine to disagree on these kinds of things. Even if agreement ought to be possible in principle (which again I don’t think is necessarily true for humans), if you think that even from your perspective the value drift/corruption problem is currently overly neglected, then we can come back and revisit this at another time (e.g., when you think there’s too many people working on this problem, which might never actually happen).
I don’t understand how this is compatible with only 2% loss from value drift/corruption. Do you perhaps think the actual loss is much bigger, but almost certainly we just can’t do anything about it, so 2% is how much you expect we can potentially “save” from value drift/corruption? Or are you taking an anti-realist position and saying something like, if someone doesn’t care about averting drift/corruption, then however their values drift that doesn’t constitute any loss?
I don’t understand “better” in what sense. Whatever it is, why wouldn’t it be even better to have two terms, one of which is broadly defined so as to include all the problems that might be urgent but also includes lower priority problems and problems whose priority we’re not sure about, and another one that is defined to be a specific urgent problem. Do you currently have any objections to using “AI alignment” as the broader term (in line with the MIRI/Arbital definition and examples) and “AI motivation” as the narrower term (as suggested by Rohin)?
Yes:
The vast majority of existing usages of “alignment” should then be replaced by “motivation,” which is more specific and usually just as accurate. If you are going to split a term into new terms A and B, and you find that the vast majority of existing usage should be A, then I claim that “A” should be the one that keeps the old word.
The word “alignment” was chosen (originally be Stuart Russell I think) precisely because it is such a good name for the problem of aligning AI values with human values, it’s a word that correctly evokes what that problem is about. This is also how MIRI originally introduced the term. (I think they introduced it here, where they said “We call a smarter-than-human system that reliably pursues beneficial goals “aligned with human interests” or simply “aligned.””) Everywhere that anyone talks about alignment they use the analogy with “pointing,” and even MIRI folks usually talk about alignment as if it was mostly or entirely about pointing your AI in the right direction.
In contrast, “alignment” doesn’t really make sense as a name for the entire field of problems about making AI good. For the problem of making AI beneficial we already have the even older term “beneficial AI,” which really means exactly that. In explaining why MIRI doesn’t like that term, Rob said
[continuing last point] The proposed usage of “alignment” doesn’t meet this desiderata though, it has exactly the same problem as “beneficial AI,” except that it’s historically associated with this community. In particular it absolutely includes “garden-variety machine ethics and moral philosophy.” Yes, there is all sorts of stuff that MIRI or I wouldn’t care about that is relevant to “beneficial” AI, but under the proposed definition of alignment it’s also relevant to “aligned” AI. (This statement by Rob also makes me think that you wouldn’t in fact be happy with what he at least means by “alignment,” since I take it you explicitly mean to include moral philosophy?)
People have introduced a lot of terms and change terms frequently. I’ve changed the language on my blog multiple times at other people’s request. This isn’t costless, it really does make things more and more confusing.
I think “AI motivation” is not a good term for this area of study: it (a) suggests it’s about the study of AI motivation rather than engineering AI to be motivated to help humans, (b) is going to be perceived as aggressively anthropomorphizing (even if “alignment” is only slightly better), (c) is generally less optimized (related to the second point above, “alignment” is quite a good term for this area).
Probably “alignment” / “value alignment” would be a better split of terms than “alignment” vs. “motivation”. “Value alignment” has traditionally been used with the de re reading, but I could clarify that I’m working on de dicto value alignment when more precision is needed (everything I work on is also relevant on the de re reading, so the other interpretation is also accurate and just less precise).
I guess I have an analogous question for you: do you currently have any objections to using “beneficial AI” as the broader term, and “AI alignment” as the narrower term?
But that definition seems quite different from your “A is trying to do what H wants it to do.” For example, if H has a wrong understanding of his/her true or normative values and as a result wants A to do something that is actually harmful, then under your definition A would be still be “aligned” but under MIRI’s definition it wouldn’t be (because it wouldn’t be pursuing beneficial goals).
I think that’s right. When I say MIRI/Arbital definition of “alignment” I’m referring to what’s they’ve posted publicly, and I believe it does include moral philosophy. Rob’s statement that you quoted seems to be a private one (I don’t recall seeing it before and can’t find it through Google search) but I can certainly see how it muddies the waters from your perspective.
This seems fine to me, if you could give the benefit of doubt as to when more precision is needed. I’m basically worried about this scenario: You or someone else writes something like “I’m cautiously optimistic about Paul’s work.” The reader recalls seeing you say that you work on “value alignment”. They match that to what they’ve read from MIRI about how aligned AI “reliably pursues beneficial goals”, and end up thinking that is easier than you’d intend, or think there is more disagreement between alignment researchers about the difficulty of the broader problem than there is actually is. If you could consistently say that the goal of your work is “de dicto value alignment” then that removes most of my worry.
This actually seems best to me on the merits of the terms alone (i.e., putting historical usage aside), and I’d be fine with it if everyone could coordinate to switch to these terms/definitions.
“Do what H wants me to do” seems to me to be an example of a beneficial goal, so I’d say a system which is trying to do what H wants it to do is pursuing a beneficial goals. It may also be pursuing subgoals which turn out to be harmful, if e.g. it’s wrong about what H wants or has other mistaken empirical beliefs. I don’t think anyone could be advocating the definition “pursues no harmful subgoals,” since that basically requires perfect empirical knowledge (it seems just as hard as never taking a harmful action). Does that seem right to you?
I’ve been assuming that “reliably pursues beneficial goals” is weaker than the definition I proposed, but practically equivalent as a research goal.
I think it’s reasonable for me to be more careful about clarifying what any particular line of research agenda does or does not aim to achieve. I think that in most contexts that is going to require more precision than just saying “AI alignment” regardless of how the term was defined, I normally clarify by saying something like “an AI which is at least trying to help us get what we want.”
My guess is that MIRI folks won’t like the “beneficial AI” term because it is too broad a tent. (Which is also my objection to the proposed definition of “AI alignment,” as “overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.”) My sense is that if that were their position, then you would also be unhappy with their proposed usage of “AI alignment,” since you seem to want a broad tent that makes minimal assumptions about what problems will turn out to be important. Does that seem right?
(They might also dislike “beneficial AI” because of random contingent facts about how it’s been used in the past, and so might want a different term with the same meaning.)
My own feeling is that using “beneficial AI” to mean “AI that produces good outcomes in the world” is basically just using “beneficial” in accordance with its usual meaning, and this isn’t a case where a special technical term is needed (and indeed it’s weird to have a technical term whose definition is precisely captured by a single—different—word).
I guess both “reliable” and “beneficial” are matters of degree so “aligned” in the sense of “reliably pursues beneficial goals” is also a matter of degree. “Do what H wants A to do” would be a moderate degree of alignment whereas “Successfully figuring out and satisfying H’s true/normative values” would be a much higher degree of alignment (in that sense of alignment). Meanwhile in your sense of alignment they are at best equally aligned and the latter might actually be less aligned if H has a wrong idea of metaethics or what his true/normative values are and as a result trying to figure out and satisfy those values is not something that H wants A to do.
That seems good too.
This paragraph greatly confuses me. My understanding is that someone from MIRI (probably Eliezer) wrote the Arbital article defining “AI alignment” as “overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world”, which satisfies my desire to have a broad tent term that makes minimal assumptions about what problems will turn out to be important. I’m fine with calling this “beneficial AI” instead of “AI alignment” if everyone can coordinate on this (but I don’t know how MIRI people feel about this). I don’t understand why you think ‘MIRI folks won’t like the “beneficial AI” term because it is too broad a tent’ given that someone from MIRI gave a very broad definition to “AI alignment”. Do you perhaps think that Arbital article was written by a non-MIRI person?
In what sense is that a more beneficial goal?
“Successfully do X” seems to be the same goal as X, isn’t it?
“Figure out H’s true/normative values” is manifestly a subgoal of “satisfy H’s true/normative values.” Why would we care about that except as a subgoal?
So is the difference entirely between “satisfy H’s true/normative values” and “do what H wants”? Do you disagree with one of the previous two bullet points? Is the difference that you think “reliably pursues” implies something about “actually achieves”?
If the difference is mostly between “what H wants” and “what H truly/normatively values”, then this is just a communication difficulty. For me adding “truly” or “normatively” to “values” is just emphasis and doesn’t change the meaning.
I try to make it clear that I’m using “want” to refer to some hard-to-define idealization rather than some narrow concept, but I can see how “want” might not be a good term for this, I’d be fine using “values” or something along those lines if that would be clearer.
(This is why I wrote:
)
Ah, yes that is a big part of what I thought was the difference. (Actually I may have understood at some point that you meant “want” in an idealized sense but then forgot and didn’t re-read the post to pick up that understanding again.)
ETA: I guess another thing that contributed to this confusion is your talk of values evolving over time, and of preferences about how they evolve, which seems to suggest that by “values” you mean something like “current understanding of values” or “interim values” rather than “true/normative values” since it doesn’t seem to make sense to want one’s true/normative values to change over time.
I don’t think “values” is good either. Both “want” and “values” are commonly used words that typically (in everyday usage) mean something like “someone’s current understanding of what they want” or what I called “interim values”. I don’t see how you can expect people not to be frequently confused if you use either of them to mean “true/normative values”. Like the situation with de re / de dicto alignment, I suggest it’s not worth trying to economize on the adjectives here.
Another difference between your definition of alignment and “reliably pursues beneficial goals” is that the latter has “reliably” in it which suggests more of a de re reading. To use your example “Suppose A thinks that H likes apples, and so goes to the store to buy some apples, but H really prefers oranges.” I think most people would call an A that correctly understands H’s preferences (and gets oranges) more reliably pursuing beneficial goals.
Given this, perhaps the easiest way to reduce confusions moving forward is to just use some adjectives to distinguish your use of the words “want”, “values”, or “alignment” from other people’s.
So “wants” means a want more general than an object-level desire (like wanting to buy oranges), and it already takes into account the possibility of H changing his mind about what he wants if H discovers that his wants contradict his normative values?
If that’s right, how is this generalization defined? (E.g. The CEV was “what H wants in the limit of infinite intelligence, reasoning time and complete information”.)
I don’t really know what anyone from MIRI thinks about this issue. It was a guess based on (a) the fact that Rob didn’t like a number of possible alternative terms to “alignment” because they seemed to be too broad a definition, (b) the fact that virtually every MIRI usage of “alignment” refers to a much narrower class of problems than “beneficial AI” is usually taken to refer to, (c) the fact that Eliezer generally seems frustrated with people talking about other problems under the heading of “beneficial AI.”
(But (c) might be driven by powerful AI vs. nearer-term concerns / all the other empirical errors Eliezer thinks people are making, (b) isn’t that indicative, and (a) might be driven by other cultural baggage associated with the term / Rob was speaking off the cuff and not attempting to speak formally for MIRI.)
I’d consider it great if we standardized on “beneficial AI” to mean “AI that has good consequences” and “AI alignment” to refer to the narrower problem of aligning AI’s motivation/preferences/goals.
10x worse was originally my estimate for cost-effectiveness, not for total value at risk.
People not caring about X prima facie decreases the returns to research on X. But may increase the returns for advocacy (or acquiring resources/influence, or more creative interventions). That bullet point was really about the returns to research.
It’s not obvious that applies here. If people don’t care strongly about how their values evolve over time, that seemingly gives AIs / AI designers an opening to have greater influence over how people’s values evolve over time, and implies a larger (or at least not obviously smaller) return on research into how to do this properly. Or if people care a bit about protecting their values from manipulation from other AIs but not a lot, it seems really important/valuable to reduce the cost of such protection as much as possible.
As for advocacy, it seems a lot easier (at least for someone in my position) to convince a relatively small number of AI designers to build AIs that want to help their users evolve their values in a positive way (or figuring out what their true or normative values are, or protecting their values against manipulation), than to convince all the potential users to want that themselves.
I agree that:
If people care less about some aspect of the future, then trying to get influence over that aspect of the future is more attractive (whether by building technology that they accept as a default, or by making an explicit trade, or whatever).
A better understanding of how to prevent value drift can still be helpful if people care a little bit, and can be particularly useful to the people who care a lot (and there will be fewer people working to develop such understanding if few people care).
I think that both
(a) Trying to have influence over aspects of value change that people don’t much care about, and
(b) better understanding the important processes driving changes in values
are reasonable things to do to make the future better. (Though some parts of (a) especially are somewhat zero-sum and I think it’s worth being thoughtful about that.)
(I don’t agree with the sign of the effect described in your comment, but don’t think it’s an important point / may just be a disagreement about what else we are holding equal so it seems good to drop.)
This could refer to value change in AI controllers, like Hugh in HCH, or alternatively to value change in people living in the AI-managed world. I believe the latter could be good, but the former seems very questionable (here “value” refers to true/normative/idealized preference). So it’s hard for the same people to share the two roles. How do you ensure that value change remains good in the original sense without a reference to preference in the original sense, that hasn’t experienced any value change, a reference that remains in control? And for this discussion, it seems like the values of AI controllers (or AI+controllers) is what’s relevant.
It’s agent tiling for AI+controller agents, any value change in the whole seems to be a mistake. It might be OK to change values of subagents, but the whole shouldn’t show any value drift, only instrumentally useful tradeoffs that sacrifice less important aspects of what’s done for more important aspects, but still from the point of view of unchanged original values (to the extent that they are defined at all).
I don’t see why the anti-realist version is any easier, my preferences about how my values evolve are complex and can depend on the endpoint of that evolution process and on arbitrarily complex logical facts. I think the analogous non-realistic mathematical framing is fine. If anything the realist versions seem easier to me (and this is related to why mathematics seems so much easier than morality), since you can anchor changing preferences to some underlying ground truth and have more potential prospect for error-correction, but I don’t think it’s a big difference.
It doesn’t sound that way to me, but I’m happy to avoid framings that might give people the wrong idea.
My main complaint with this framing (and the reason that I don’t use it) is that people respond badly to invoking the concept of “corruption” here—it’s a fuzzy category that we don’t understand, and people seem to interpret it as the speaker wanting values to remain static.
But in terms of the actual meanings rather than their impacts on people, I’d be about as happy with “avoiding corruption of values” as “having our values evolve in a positive way.” I think both of them have small shortcomings as framings. My main problem with corruption is that it suggests an unrealistically bright line / downplays our uncertainty about how to think about changing values and what constitutes corruption.
It seems easier in that the AI / AI designer doesn’t have to worry about the user being wrong about how they want their values to evolve. But you’re right that the realist version might be easier in other ways, so perhaps what I should say instead is that the problem definitely seems harder if we also include the subproblem of figuring out what the right metaethics is in the first place, and (by implicitly assuming a subset of all plausible metaethical positions) the statement of the problem that you proposed also does not convey a proper amount of uncertainty in its difficulty.
That’s a good point that I hadn’t thought of. (I guess talking about “drift” has a similar issue though, in that people might misinterpret it as the speaker wanting values to remain static.) If you or anyone else have a suggestion about how to phrase the problem so as to both avoid this issue and address my concerns about not assuming a particular metaethical position, I’d highly welcome that.
That may be a connotation of the “preferences about how their values evolve,” but doesn’t seem like it follows from the anti-realist position.
I have preferences over what actions my robot takes. Yet if you asked me “what action do you want the robot to take?” I could be mistaken. I need not have access to my own preferences (since they can e.g. depend on empirical facts I don’t know). My preferences over value evolution can be similar.
Indeed, if moral realists are right, “ultimately converge to the truth” is a perfectly reasonable preference to have about how my preferences evolve. (Though again this may not be captured by the framing “help people’s preferences evolve in the way they want them to evolve.”) Perhaps the distinction is that there is some kind of idealization even of the way that preferences evolve, and maybe at that point it’s easier to just talk about preservation of idealized preferences (though that also has unfortunate implications and at least some minor technical problems).
I agree that drift is also problematic.
Would you agree with this way of stating it: There are more ways for someone to be wrong about their values under realism than under anti-realism. Under realism someone could be wrong even if they correctly state their preferences about how they want their values to evolve, because those preferences could themselves be wrong. So assuming an anti-realist position makes the problem sound easier because it implies there are fewer ways for the user to be wrong for the AI / AI designer to worry about.
Could you give an example of a statement you think could be wrong on the realist perspective, for which there couldn’t be a precisely analogous error on the non-realistic perspective?
There is some uninteresting semantic sense in which there are “more ways to be wrong” (since there is a whole extra category of statements that have truth values...) but not a sense that is relevant to the difficulty of building an AI.
I might be using the word “values” in a different way than. I think I can say something like “I’d like to deliberate in way X” and be wrong. I guess under non-realism I’m “incorrectly stating my preferences” and under realism I could be “correctly stating my preferences but be wrong,” but I don’t see how to translate that difference into any situation where I build an AI that is adequate on one perspective but inadequate on the other.
Suppose the user says “I want to try to figure out my true/normative values by doing X. Please help me do that.” If moral anti-realism is true, then the AI can only check if the user really wants to do X (e.g., by looking into the user’s brain and checking if X is encoded as a preference somewhere). But if moral realism is true, the AI could also use its own understanding of metaethics and metaphilosophy to predict if doing X would reliably lead to the user’s true/normative values, and warn the user or refuse to help or take some other action if the answer is no. Or if one can’t be certain about metaethics yet, and it looks like X might prematurely lock the user into the wrong values, the AI could warn the user about that.
I definitely don’t mean such a narrow sense of “want my values to evolve.” Seems worth using some language to clarify that.
In general the three options seem to be:
You care about what is “good” in the realist sense.
You care about what the user “actually wants” in some idealized sense.
You care about what the user “currently wants” in some narrow sense.
It seems to me that the first two are pretty similar. (And if you are uncertain about whether realism is true, and you’d be in the first case if you accepted realism, it seems like you’d probably be in the second case if you rejected realism. Of course that would depend on the nature of your uncertainty about realism, your views could depend on an arbitrary way on whether realism is true or false depending on what versions of realism/non-realism are competing, but I’m assuming something like the most common realist and non-realist views around here.)
To defend my original usage both in this thread and in the OP, which I’m not that attached to, I do think it would be typical to say that someone made a mistake if they were trying to help me get what I wanted, but failed to notice or communicate some crucial consideration that would totally change my views about what I wanted—the usual English usage of these terms involves at least mild idealization.
Longer form of my opinion:
Metaphilosophy is hard, and we need to solve it eventually. This might happen by default, i.e. if we simply build a well-motivated AI without thinking about metaphilosophy and without running any social interventions designed to get the AI’s operators to think about metaphilosophy, humanity might still realize that metaphilosophy needs to be solved, and then goes ahead and solves it. I’m quite unsure right now whether or not it will happen by default.
However, in the world where the AI’s operators don’t agree that we need to solve metaphilosophy, I am very pessimistic about the AI realizing that it should help us with metaphilosophy and doing so. The one way I could imagine it happening is by programming in the right utility function (not even learning it, since if you learn it then you probably learn that metaphilosophy doesn’t need to be solved), which seems hopelessly doomed. It seems really hard to make an AI system where you can predict in advance that it will help us solve metaphilosophy regardless of the operator’s wishes.
In the world where the AI’s operators do agree that we need to solve metaphilosophy, I think we’re in a much better position. A background assumption I have is that humans motivated to solve metaphilosophy will be able to do so given enough time—I share Paul’s intuition that humans who no longer have to worry about food, water, shelter, disease, etc. could deliberate for a long time and make progress. In that case, a well-motivated AI would be fine—it would stay deferential, perhaps learn more things in order to be more competent, and does things we ask it to do, which might include helping us in our deliberation by bringing up arguments we hadn’t considered yet. (And note a well-motivated AI should only bring up arguments it believes are true, or likely to be true.)
I’ve laid out two extreme ways the world could be, and of course there’s a spectrum between them. But thinking about the extremes makes me think of this not as a part of AI alignment, but as a social coordination problem, that is, we need to have humanity (especially the AI’s operators) agree that metaphilosophy is hard and needs to be solved. I’d support interventions that make this more likely, eg. more public writing that talks about what we do after AGI, or about the possibility of a Great Deliberation before using the cosmic endowment, etc. If we succeed at that and building a well-motivated AI system, I think that would be sufficient.
I mean something more like “don’t do things that a human wouldn’t do, that seem crazy from a human perspective”. I’m not suggesting that the AI has a perfect understanding of what “irreversible” and “high-impact” mean. But it should be able to predict what things a human would find crazy for which it should probably get the human’s approval before doing the thing. (As an analogy, most employees have a sense of what it is okay for them to take initiative on, vs. what they should get their manager’s approval for.)
Yeah, I more mean something like “continuation of the status quo” rather than “irreversible high-impact”, as TurnTrout talks about below.
I am not sure. I think it is relatively easy to look back at how we have responded to similar events in the past and notice that something is amiss—for example, it seems relatively easy for an AGI to figure out that power corrupts and that humanity has not liked it when that happened, or that many humans don’t like it when you take advantage of their motivational systems, and so to at least not be confident in the actions you mention. On the other hand, there may be similar types of events in the future that we can’t back out by looking at the past. I don’t know how to deal with these sorts of unknown unknowns.
I think sufficiently narrow AI systems have essentially no hope of solving or avoiding these problems in general, regardless of safety techniques we develop, and so in the short term to avoid these problems you want to intervene on the humans who are deploying AI systems.
Yeah, looking back I don’t like that reason, I think I had an intuition that it wasn’t an urgent problem and wanted to jot a quick sentence to that effect and the sentence came out wrong.
One reason it might not be urgent is because we need to aim for competitiveness anyway—our AI systems need to be competitive so that economic incentives don’t cause us to use unaligned variants.
We can also aim to have the world mostly run by aligned AI systems rather than unaligned ones, which would mean that there isn’t much opportunity for us to be manipulated. You might have the intuition that even one unaligned AI could successfully manipulate everyone’s values, and so we would still need the aligned AI systems to be able to defend against that. I’m not sure where I stand on that—it seems possible to me that this is just very hard to do, especially when there are aligned superintelligent systems that would by default put a stop to it if they find out about it.
But really I’m just confused on this topic and would need to think more about it.
I’m not sure I understand your proposal here. What are they agreeing to exactly? Stopping technological development at a certain level until metaphilosophy is solved?
Think of the human as a really badly designed AI with a convoluted architecture that nobody understands, spaghetti code, full of security holes, has no idea what its terminal values are and is really confused even about its “interim” values, has all kinds of potential safety problems like not being robust to distributional shifts, and is only “safe” in the sense of having passed certain tests for a very narrow distribution of inputs.
Clearly it’s not safe for a much more powerful outer AI to query the human about arbitrary actions that it’s considering, right? Instead, if the human is to contribute anything at all to safety in this situation, the outer AI has to figure out how to generate a bunch of smaller queries that the human can safely handle, from which it would then infer what the human would say if it could safely consider the actual choice under consideration. If the AI is bad at this “competence” problem it could send unsafe queries to the human and corrupt the human, and/or infer the wrong thing about what the human would approve of.
Is it clearer now why this doesn’t seem like an easy problem to me?
I’m not sure what you think the AGI would figure out, and what it would do in response to that. Are you suggesting something like, based on historical data, it would learn a classifier to predict what kind of new technologies or choices would change human values in a way that we would not like, and restrict those technologies/choices from us? It seems far from easy to do this in a robust way. I mean this classifier would be facing lots of unpredictable distributional shifts… I guess you made a similar point when you said “On the other hand, there may be similar types of events in the future that we can’t back out by looking at the past.”
ETA: Do you expect that different AIs would do different things in this regard depending on how cautious their operators are? Like some AIs would learn from their operators to be really cautious, and restrict technologies/choices that it isn’t sure won’t corrupt humans, but other operators and their AIs won’t be so cautious so a bunch of humans will be corrupted as a result, but that’s a lower priority problem because you think most AI operators will be really cautious so the percentage of value lost in the universe isn’t very high? (This is my current understanding of Paul’s position, and I wonder if you have a different position or a different way of putting it that would convince me more.) What about the problem that the corrupted humans/AIs could produce a lot of negative utility even if they are small in numbers? What about the problem of the cautious AIs being at a competitive disadvantage against other AIs who are less cautious about what they are willing to do?
This seems right.
Manipulation doesn’t have to come just from unaligned AIs, it could also come from AIs that are aligned to other people. For example, if an AI is aligned to Alice, and Alice sees something to be gained by manipulating Bob, the AI being aligned won’t stop Alice from using it to manipulate Bob.
ETA: I forgot to mention that I don’t understand this part, can you please explain more:
I don’t know, I want to outsource that decision to humans + AI at the time where it is relevant. Perhaps it involves stopping technological development. Perhaps it means continuing technological development, but not doing any space colonization. My point is simply that if humans agree that metaphilosophy needs to be solved, and the AI is trying to help humans, then metaphilosophy will probably be solved, even if I don’t know how exactly it will happen.
Yes. It seems to me like you’re considering the case where a human has to be able to give the correct answer to any question of the form “is this action a good thing to do?” I’m claiming that we could instead grow the set of things the AI does gradually, to give time for humans to figure out what it is they want. So I was imagining that humans would answer the AI’s questions in a frame where they have a lot of risk aversion, so anything that seemed particularly impactful would require a lot of deliberation before being approved.
I was thinking more of the case where a single human amassed a lot of power. Humans haven’t seemed to solve the problem of predicting how new technologies/choices would change human values, so that seems like quite a hard problem to solve (but perhaps AI could do it). I meant more that conditional on the AI knowing how some new technology or choice would affect us, it seems not too hard to figure out whether we would view it as a good thing.
Yes.
Kind of? I’d amend that slightly to say that to the extent that I think it is a problem (I’m not sure), I want to solve it in some way that is not technical research. (Possibilities: convince everyone to be cautious, obtain a decisive strategic advantage and enforce that everyone is cautious.)
Same as above.
Same as above. All of these problems that you’re talking about would also apply to technology that could make a human smarter. It seems like it would be easiest to address on that level, rather than trying to build an AI system that can deal with these problems even though the operator would not want them to correct for the problem.
This seems like an empirical fact that makes the problems listed above harder to solve.
So I broadly agree with Paul’s reasons for aiming for competitiveness. Given competitiveness, you might hope that we would automatically get defense against value manipulation by other AIs, since our aligned AI will defend us from value manipulation by similarly-capable unaligned AIs (or aligned AIs that other people have). Of course, defense might be a lot harder than offense, and you probably do think that, in which case this doesn’t really help us. (As I said, I haven’t really thought about this before.)
Overall view: I don’t think that the problems you’ve mentioned are obviously going to be solved as a part of AI alignment. I think that solving them will require mostly interventions on humans, not on the development of AI. I am weakly optimistic that humans will actually be able to coordinate and solve these problems as a result. If I were substantially more pessimistic, I would put more effort into strategy and governance issues. (Not sure I would change what I’m doing given my comparative advantage at technical research, but it would at least change what I advise other people do.)
Meta-view on our disagreement: I suspect that you have been talking about the problem of “making the future go well” while I’ve been talking about the problem of “getting AIs to do what we want” (which do seem like different problems to me). Most of the problems you’ve been talking about don’t even make it into the bucket of “getting AIs to do what we want” the way I think about it, so some of the claims (like “the urgent part is in the motivation subproblem”) are not meant to quantify over the problems you’re identifying. I think we do disagree on how important the problems you identify are, but not as much as you would think, since I’m quite uncertain about this area of problem-space.
Why isn’t that also an argument against the urgency of solving AI motivation? I.e., we don’t need to urgently solve AI motivation because humans will be able to coordinate to stop or delay AI development long enough to solve AI motivation at leisure?
It seems to me that coordination is really hard. Yes we have to push on that, but we also have to push on potential technical solutions because most likely coordination will fail, and there is enough uncertainty about the difficulty of technical solutions that I think we urgently need more people to investigate the problems to see how hard they really are.
Aside from that, I think it’s also really important to better predict/understand just how difficult solving those problems are (both socially and technically) because that understanding is highly relevant to strategic decisions we have to make today. For example if those problems are very difficult to solve so that in expectation we end up losing most of the potential value of the universe even if we solve AI motivation, then that greatly reduces the value of working on motivation relative to something like producing evidence of the difficulty of those problems in order to convince policymakers to try to coordinate on stopping/delaying AI progress, or trying to create a singleton AI. That’s why I was asking you for details of what you think the social solutions would look like.
I see, in that case I would appreciate disclaimers or clearer ways of stating that, so that people who might want to work on these problems are not discouraged from doing so more strongly than you intend.
Ok, I appreciate that.
Two reasons come to mind:
Stopping or delaying AI development feels more like trying to interfere with an already-running process, whereas there are no existing norms on what we use AI for that we would have to fight against, and debates on those norms are already beginning. For new things, I expect the public to be particularly risk-averse.
Relatedly, it is a lot easier to make norms/laws/regulations now that bind our future selves. On an individual level, it seems easier to delay your chance of going to Mars if you know you’re going to get a hovercar soon. On a societal scale, it seems easier to delay space colonization if we’re going to have lives of leisure due to automation, or to delay full automation if we’re soon going to get 4 hour workdays. Looking at the things governments and corporations say, it seems like they would be likely to do things like this. I think it makes a lot of sense to try and direct these efforts at the right target.
I want to emphasize though that my method here was having an intuition and querying for reasons behind the intuition. I would be a little surprised if someone could convince me my intuition is wrong in ~half an hour of conversation. I would not be surprised if someone could convince me that my reasons are wrong in ~half an hour of conversation.
I think it would help me if you suggested some ways that technical solutions could help with these problems. For example, with coordinating to prevent/delay corrupting technologies, the fundamental problem to me seems to be that with any technical solution, the thing that the AI does will be against the operator’s wishes-upon-reflection. (If your technical solution is in line with the operator’s wishes-upon-reflection, then I think you could also solve the problem by solving motivation.) This seems both hard to design (where does the AI get the information about what to do, if not from the operator’s wishes-upon-reflection?) as well as hard to implement (why would the operator use a system that’s going to do something they don’t want?).
You might argue that there are things that the operator would want if they could get it (eg. global coordination), but they can’t achieve it now, and so we need a technical solution for that. However, it seems like a we are in the same position as a well-motivated AI w.r.t. that operator. For example, if we try to cede control to FairBots that rationally cooperate with each other, a well-motivated AI could also do that.
Agreed. I view a lot of strategy research (eg. from FHI and OpenAI) as figuring this out from the social side, and some of my optimism is based on conversations with those researchers. On the technical side, I feel quite stuck (for the reasons above), though I haven’t tried hard enough to say that it’s too difficult to do.
I’ll keep that in mind. When I wrote the original comment, I wasn’t even thinking about problems like the ones you mention, because I categorize them as “strategy” by default, and I was trying to talk about the technical problem.
Do you think that at the time when AI development wasn’t an already-running process, and AI was still a new thing that the public could be expected to be risk-averse about (when would you say that was?), the argument “working on alignment isn’t urgent because humans can probably coordinate to stop AI development” would have been a good one?
Same question here. Back when “don’t develop AI” was still a binding on our future selves, should we have expected that we will coordinate to stop AI development, and it’s just bad luck that we haven’t succeeded in doing that?
Can you be more specific? What global agreement do you think would be reached, that is both realistic and would solve the kinds of problems that I’m worried about (e.g., unintentional corruption of humans by “aligned” AIs who give humans too much power or options that they can’t handle, and deliberate manipulation of humans by unaligned AIs or AIs aligned to other users)?
For example, create an AI that can help the user with philosophical questions at least as much as technical questions. (This could be done for example by figuring out how to better use Iterated Amplification to answer philosophical questions, or how to do imitation learning of human philosophers, or how to apply inverse reinforcement learning to philosophical reasoning.) Then the user could ask questions like “Am I likely to be corrupted by access to this technology? What can I do to prevent that while still taking advantage of it?” Or “Is this just an extremely persuasive attempt at manipulation or an actually good moral argument?”
As another example, solve metaethics and build that into the AI so that the AI can figure out or learn the actual terminal values of the user, which would make it easier to protect the user from manipulation and self-corruption. And even if the human user is corrupted, the AI still has the correct utility function, and when it has made enough technological progress it can uncorrupt the human.
Can you point me to any relevant results that have been written down, or explain what you learned from those conversations?
To address this and the question (from the parallel thread) of whether you should personally work on this, I think we need people to either solve the technical problems or at least to collectively try hard enough to convincingly say that it’s too difficult to do. (Otherwise who is going to convince policymakers to adopt the very costly social solutions? Who is going to convince people to start/join a social movement to influence policymakers to consider those costly social solutions? The fact that those things tend to take a lot of time seems like sufficient reason for urgency on the technical side, even if you expect the social solutions to be feasible.) Who are these people going to be, especially the first ones to join the field and help grow it? Probably existing AI alignment researchers, right? (I can probably make stronger arguments in this direction but I don’t want to be too “pushy” so I’ll stop here.)
I forgot to followup on this important part of our discussion:
It seems to me that a technology that could make a human smarter is much more likely (compared to AI) to accelerate all forms of intellectual progress (e.g., technological progress and philosophical/moral progress) about equally, and therefore would have a less significant effect on the kinds of problems that I’m talking about (which are largely caused by technological progress outpacing philosophical/moral progress). I could make some arguments about this, but I’m curious if this doesn’t seem obvious to you.
Assuming the above, and assuming that one has moral uncertainty that gives some weight to the concept of moral responsibility, it seems to me that an additional argument for AI researchers to work on these problems is that it’s a moral responsibility of AI researchers/companies to try to solve problems that they create, for example via technological solutions, or by coordinating amongst themselves, or by convincing policymakers to coordinate, or by funding others to work on these problems, etc., and they are currently neglecting to do this (especially with regard to the particular problems that I’m pointing out).
Yes, I agree with this. The reason I mentioned that was to make the point that the problems are a function of progress in general and aren’t specific to AI—they are just exacerbated by AI. I think this is a weak reason to expect that solutions are likely to come from outside of AI.
This seems true. Just to make sure I’m not misunderstanding, this was meant to be an observation, and not meant to argue that I personally should prioritize this, right?
This doesn’t make much sense to me. Why is this any kind of reason to expect that solutions are likely to come from outside of AI? Can you give me an analogy where this kind of reasoning more obviously makes sense?
Right, this argument wasn’t targeted to you, but I think there are other reasons for you to personally prioritize this. See my comment in the parallel thread.
From the AUP perspective, this only seems true in a way analogous to the statement that “any hypothesis can have arbitrarily long description length”. It’s possible to make practically no assumptions about what the true utility function is and still recover a sensible notion of “low impact”. That is, penalizing shifts in attainable utility for even random or simple functions still yields the desired behavior; I have experimental results to this effect which aren’t yet published. This suggests that the notion of impact captured by AUP isn’t dependent on realizability of the true utility, and hence the broader thing Rohin is pointing at should be doable.
While it’s true that some complex value loss is likely to occur when not considering an appropriate distribution over extremely complicated utility functions, it seems by-and-large negligible. This is because such loss occurs either as a continuation of the status quo or as a consequence of something objectively mild, which seems to correlate strongly with reasonably human-values mild.
Another con of the motivation-competence decomposition: unlike definition-optimization, it doesn’t actually seem to be a clean decomposition of the larger task, such that we can solve each subtask independently and then combine the solutions.
For example one way we could solve the motivation problem is by building a perfect human imitation (of someone who really wants to help H do what H wants), but then we seem to be stuck on the “competence” front, and there’s no clear way to plug this solution of “motivation” into a better generic solution to “competence” to get a more competent intent-aligned agent. Instead it seems like we have to solve the competence problem that is particular to the specific solution to motivation, or solve motivation and competence together as one large problem.
In contrast, the problem of specifying an aligned utility function and the problem of building a safe EU maximizers seem to be naturally independent problems, such that once we have a specification of an aligned utility function (or a method of specifying aligned utility functions), we can just plug that into more and more powerful and robust EU maximizers.
Furthermore I think this lack of clean decomposition shows up at the conceptual level too, not just the pragmatic level. For example, suppose we tried to increase the competence of the human imitation by combining it with a superintelligent Oracle, and it turns out the human imitation isn’t very careful and in most timelines destroys the world by asking unsafe questions that cause the Oracle to perform malign optimizations. Is this a failure of motivation or a failure of competence, or both? It seems arguable or hard to say. In contrast, in a system that is built using the definition-optimization decomposition, it seems like it would be easy to trace any safety failures to either the “definition” solution or the “optimization” solution.
I overall agree that this is a con. Certainly there are AI systems that are weak enough that you can’t talk coherently about their “motivation”. Probably all deep-learning-based systems fall into this category.
I also agree that (at least for now, and probably in the future as well) you can’t formally specify the “type signature” of motivation such that you could separately solve the competence problem without knowing the details of the solution to the motivation problem.
My hope here would be to solve the motivation problem and leave the competence problem for later, since by my view that solves most of the problem (I’m aware that you disagree with this).
I don’t agree that it’s not clean at the conceptual level. It’s perhaps less clean than the definition-optimization decomposition, but not much less.
This seems pretty clearly like a failure of competence to me, since the human imitation would (presumably) say that they don’t want the world to be destroyed, and they (presumably) did not predict that that was what would happen when they queried the oracle.
It also seems like a failure of motivation though, because as soon as the Oracle started to do malign optimization, the system as a whole is no longer trying to do what H wants.
Or is the idea that as long as the top-level or initial optimizer is trying (or tried) to do what H wants, then all subsequent failures of motivation don’t count, so we’re excluding problems like inner alignment from motivation / intent alignment?
I’m unsure what your answer would be, and what Paul’s answer would be, and whether they would be the same, which at least suggests that the concepts haven’t been cleanly decomposed yet.
ETA: Or to put it another way, supposed AI safety researchers determined ahead of time what kinds of questions won’t cause the Oracle to perform malign optimizations. Would that not count as part of the solution to motivation / intent alignment of this system (i.e., combination of human imitation and Oracle)? It seems really counterintuitive if the answer is “no”.
Oh, I see, you’re talking about the system as a whole, whereas I was thinking of the human imitation specifically. That seems like a multiagent system and I wouldn’t apply single-agent reasoning to it, so I agree motivation-competence is not the right way to think about it (but if you insisted on it, I’d say it fails motivation, mostly because the system doesn’t really have a single “motivation”).
It doesn’t seem like the definition-optimization decomposition helps either? I don’t know whether I’d call that a failure of definition or optimization.
I would say the human imitation was intent aligned, and this helped improve the competence of the human imitation. I mostly wouldn’t apply this framework to the system (and I also wouldn’t apply definition-optimization to the system).
This was an unexpected answer. Isn’t HCH also such a multiagent system? (It seems very similar to what I described: a human with access to a superhuman Oracle, although HCH wasn’t what I initially had in mind.) IDA should converge to HCH in the limit of infinite compute and training data, so this would seem to imply that the motivation-competence framework doesn’t apply to IDA either. I’m pretty sure Paul would give a different answer, if we ask him about “intent alignment”.
It seems more obvious that multiagent systems just fall outside of the definition-optimization framework, which seems to be a point in its favor as far as conceptual clarity is concerned.
Yes, I’d say that to the extent that “trying to do X” is a useful concept, it applies to systems with lots of agents just as well as it applies to one agent.
Even a very theoretically simple system like AIXI doesn’t seem to be “trying” to do just one thing, in the sense that it can e.g. exert considerable optimization power at things other than reward, even in cases where the system seems to “know” that its actions won’t lead to reward.
You could say that AIXI is “optimizing” the right thing and just messing up when it suffers inner alignment failures, but I’m not convinced that this division is actually doing much useful work. I think it’s meaningful to say “defining what we want is useful,” but beyond that it doesn’t seem like a workable way to actually analyze the hard parts of alignment or divide up the problem.
(For example, I think we can likely get OK definitions of what we value, along the lines of A Formalization of Indirect Normativity, but I’ve mostly stopped working along these lines because it no longer seems directly useful.)
I agree.
Of course, it also seems quite likely that AIs of the kind that will probably be built (“by default”) also fall outside of the definition-optimization framework. So adopting this framework as a way to analyze potential aligned AIs seems to amount to narrowing the space considerably.
So how do you see it applying in my example? Would you say that the system in my example is both trying to do what H wants it to do, and also trying to do something that H doesn’t want? Is it intent aligned period, or intent aligned at some points in time and not at others, or simultaneously intent aligned and not aligned, or something else? (I feel like we’ve had a similar discussion before and either it didn’t get resolved or I didn’t understand your position. I didn’t see a direct attempt to answer this in the comment I’m replying to, and it’s fine if you don’t want to go down this road again but I want to convey my continued confusion.)
I don’t understand how this is connected to what I was saying. (In general I often find it significantly harder to understand your comments compared to say Rohin’s. Not necessarily saying you should do something differently, as you might already be making a difficult tradeoff between how much time to spend here and elsewhere, but just offering feedback in case you didn’t realize.)
This makes sense.
The oracle is not aligned when asked questions that cause it to do malign optimization.
The human+oracle system is not aligned in situations where the human would pose such questions.
For a coherent system (e.g. a multiagent system which has converged to a Pareto efficient compromise), it make sense to talk about the one thing that it is trying to do.
For an incoherent system this abstraction may not make sense, and a system may be trying to do lots of things. I try to use benign when talking about possibly-incoherent systems, or things that don’t even resemble optimizers.
The definition in this post is a bit sloppy here, but I’m usually imagining that we are building roughly-coherent AI systems (and that if they are incoherent, some parts are malign). If you wanted to be a bit more careful with the definition, and want to admit vagueness in “what H wants it to do” (such that there can be several different preferences that are “what H wants”) we could say something like:
That’s not great either though (and I think the original post is more at an appropriate level of attempted-precision).
(In the following I will also use “aligned” to mean “intent aligned”.)
Ok, sounds like “intent aligned at some points in time and not at others” was the closest guess. To confirm, would you endorse “the system was aligned when the human imitation was still trying to figure out what questions to ask the oracle (since the system was still only trying to do what H wants), and then due to its own incompetence became not aligned when the oracle started working on the unsafe question”?
Given that intent alignment in this sense seems to be property of a system+situation instead of the system itself, how would you define when the “intent alignment problem” has been solved for an AI, or when would you call an AI (such as IDA) itself “intent aligned”? (When we can reasonably expect to keep it out of situations where its alignment fails, for some reasonable amount of time, perhaps?) Or is it the case that whenever you use “intent alignment” you always have some specific situation or set of situations in mind?
Fwiw having read this exchange, I think I approximately agree with Paul. Going back to the original response to my comment:
Yes, I shouldn’t have made a categorical statement about multiagent systems. What I should have said was that the particular multiagent system you proposed did not have a single thing it is “trying to do”, i.e. I wouldn’t say it has a single “motivation”. This allows you to say “the system is not intent-aligned”, even though you can’t say “the system is trying to do X”.
Another way of saying this is that it is an incoherent system and so the motivation abstraction / motivation-competence decomposition doesn’t make sense, but HCH is one of the few multiagent systems that is coherent. (Idk if I believe that claim, but it seems plausible.) This seems to map on to the statement:
Also, I want to note strong agreement with this:
HCH can be incoherent. I think one example that came up in an earlier discussion was the top node in HCH trying to help the user by asking (due to incompetence / insufficient understanding of corrigibility) “What is a good approximation of the user’s utility function?” followed by “What action would maximize EU according to this utility function?”
ETA: If this isn’t clearly incoherent, imagine that due to further incompetence, lower nodes work on subgoals in a way that conflict with each other.