I don’t share this intuition about the more general problem (that we probably can’t find a corrigible, universal core of reasoning unless we can hard code it)
If your definition of “corrigible” does not include things like the ability to model the user and detect ambiguities as well as a typical human, then I don’t currently have a strong intuition about this. Is your view/hope then that starting with such a core, if we amplify it enough, eventually it will figure out how to safely learn (or deduce from first principles, or something else) how to understand natural language, model the user, detect ambiguities, balance between the user’s various concerns, and so on? (If not, it would be stuck with either refusing to doing anything except literal-minded mechanical tasks that don’t require such abilities, or frequently making mistakes of the type “hack a bank when I ask it to make money”, which I don’t think is what most people have in mind when they think of “aligned AGI”.)
Is your view/hope then that starting with such a core, if we amplify it enough, eventually it will figure out how to safely learn (or deduce from first principles, or something else) how to understand natural language, model the user, detect ambiguities, balance between the user’s various concerns, and so on?
Yes. My hope is to learn or construct a core which:
Doesn’t do incorrigible optimization as it is amplified.
Increases in competence as it is amplified, including competence at tasks like “model the user,” “detect ambiguities” or “make reasonable tradeoffs about VOI vs. safety” (including info about the user’s preferences, and “safety” about the risk of value drift). I don’t have optimism about finding a core which is already highly competent at these tasks.
I grant that even given such a core, we will still be left with important and unsolved x-risk relevant questions like “Can we avoid value drift over the process of deliberation?”
It appears that I seriously misunderstood what you mean by corrigibility when I wrote this post. But in my defense, in your corrigibility post you wrote, “We say an agent is corrigible (article on Arbital) if it has these properties.” and the list includes helping you “Make better decisions and clarify my preferences” and “Acquire resources and remain in effective control of them” and to me these seem to require at least near human level ability to model the user and detect ambiguities. And others seem to have gotten the same impression from you. Did your conception of corrigibility change at some point, or did I just misunderstand what you wrote there?
Since this post probably gave even more people the wrong impression, I should perhaps write a correction, but I’m not sure how. How should I fill in this blank? “The way I interpreted Paul’s notion of corrigibility in this post is wrong. It actually means ___.”
Increases in competence as it is amplified, including competence at tasks like “model the user,” “detect ambiguities” or “make reasonable tradeoffs about VOI vs. safety”
Is there a way to resolve our disagreement/uncertainty about this, short of building such an AI and seeing what happens? (I’m imagining that it would take quite a lot of amplification before we can see clear results in these areas, so it’s not something that can be done via a project like Ought?)
I think your post is (a) a reasonable response to corrigibility as outlined in my public writing, (b) a reasonable but not decisive objection to my current best guess about how amplification could work. In particular, I don’t think anything you’ve written is too badly misleading.
In the corrigibility post, when I said “AI systems which help me do X” I meant something like “AI systems which help me do X to the best of their abilities,” rather than having in mind some particular threshold for helpfulness at which an AI is declared corrigible (similarly, I’d say an AI is aligned if it’s helping me achieve my goals to the best of its abilities, rather than fixing a certain level of helpfulness at which I’d call it aligned). I think that post was unclear, and my thinking has become a lot sharper since then, but the whole situation is still pretty muddy.
Even that’s not exactly right, and I don’t have a simple definition. I do have a lot of intuitions about why there might be a precise definition, but those are even harder to pin down.
(I’m generally conflicted about how much to try to communicate publicly about early stages of my thinking, given how frequently it changes and how fuzzy the relevant concepts are. I’ve decided to opt for a medium level of communication, since it seems like the potential benefits are pretty large. I’m sorry that this causes a lot of trouble though, and in this case I probably should have been more careful about muddying notation. I also recognize it means people are aiming at a moving target when they try to engage; I certainly don’t fault people for that, and I hope it doesn’t make it too much harder to get engagement with more precise versions of similar ideas in the future.)
Is there a way to resolve our disagreement/uncertainty about this, short of building such an AI and seeing what happens? (I’m imagining that it would take quite a lot of amplification before we can see clear results in these areas, so it’s not something that can be done via a project like Ought?)
What uncertainty in particular?
Things I hope to see before we have very powerful AI:
Clearer conceptual understanding of corrigibility.
Significant progress towards a core for metaexecution (either an explicit core, or an implicit representation as a particular person’s policy), which we can start to investigate empirically.
Amplification experiments which show clearly how complex tasks can be broken into simpler pieces, and let us talk much more concretely about what those decompositions look like and in what ways they might introduce incorrigible optimization. These will also directly resolve logical uncertainty about whether proposed decomposition techniques actually work.
Application of amplification to some core challenges for alignment, most likely either (a) producing competitive interpretable world models, or (b) improving reliability, which will make it especially easy to discuss whether amplification can safely help with these particular problems.
If my overall approach is successful, I don’t feel like there are significant uncertainties that we won’t be able to resolve until we have powerful AI. (I do think there is a significant risk that I will become very pessimistic about the “pure” version of the approach, and that it will be very difficult to resolve uncertainties about the “messy” version of the approach in advance because it is hard to predict whether the difficulties for the pure version are really going to be serious problems in practice.)
I’m generally conflicted about how much to try to communicate publicly about early stages of my thinking, given how frequently it changes and how fuzzy the relevant concepts are.
Among people I’ve had significant online discussions with, your writings on alignment tend to be the hardest to understand and easiest to misunderstand. From a selfish perspective I wish you’d spend more time writing down more details and trying harder to model your readers and preempt ambiguities and potential misunderstandings, but of course the tradeoffs probably look different from your perspective. (I also want to complain (again?) that Medium.com doesn’t show discussion threads in a nice tree structure, and doesn’t let you read a comment without clicking to expand it, so it’s hard to see what questions other people asked and how you answered. Ugh, talk about trivial inconveniences.)
What uncertainty in particular?
How much can the iterated amplification of an impoverished overseer safely learn about how to help humans (how to understand natural language, build models of users, detect ambiguity, being generally competent)? Is it enough to attract users and to help them keep most of their share of the cosmic endowment against competition with malign AIs?
I thought more about my own uncertainty about corrigibility, and I’ve fleshed out some intuitions on it. I’m intentionally keeping this a high-level sketch, because this whole framing might not make sense, and even if it does, I only want to expound on the portions that seem most objectionable.
Suppose we have an agent A optimizing for some values V. I’ll call an AI system S high-impactcalibratedwithrespect to A if, when A would consider an action “high-impact” with respect to V, S will correctly classify it as high-impact with probability at least 1-ɛ, for some small ɛ.
My intuitions about corrigibility are as follows:
1. If you’re not calibrated about high-impact, catastrophic errors can occur. (These are basically black swans, and black swans can be extremely bad.)
2. Corrigibility depends critically on high-impact calibration (when your AI is considering doing a high-impact thing, it’s critical that it knows to check that action with you).
3. To learn how to be high-impact calibrated w.r.t. A, you will have to generalize properly from training examples of low/high-impact (i.e. be robust to distributional shift).
4. To robustly generalize, you’re going to need the ontologies / internal representations that A is using. (In slightly weirder terms, you’re going to have to share A’s tastes/aesthetic.)
5. You will not be able to learn those ontologies unless you know how to optimize for V the way A is optimizing for V. (This is the core thing missing from the well-intentioned extremley non-neurotypical assistant I illlustrated.)
6. If S’s “brain” starts out very differently from A’s “brain”, S will not be able to model A’s representations unless S is significantly smarter than A.
In light of this, for any agent A, some value V they’re optimizing for, and some system S that’s assisting A, we can ask two important questions:
(I) How well can S learn A’s representations?
(II) If the representation is imperfect, how catastrophic might the resulting mistakes be?
In the case of a programmer (A) building a web app trying to make users happy (V), it’s plausible that some run-of-the-mill AI system (S) would learn a lot of the important representations right and a lot of the important representations wrong, but it also seems like none of the mistakes are particularly catastrophic (worst case, the programmer just reverts the codebase.)
In the case of a human (A) trying to make his company succeed (V), looking for a new CEO (S) to replace himself, it’s usually the case that the new CEO doesn’t have the same internal representations as the founder. If they’re too different, the result is commonly catastrophic (e.g. if the new CEO is an MBA with “more business experience”, but with no vision and irreconcilable taste). Some examples:
For those who’ve watched HBO’s Silicon Valley, Action Jack Barker epitomizes this.
When Sequoia Capital asked Larry and Sergey to find a new CEO for Google, they hemmed and hawed until they found one who had a CS Ph.D and went to Burning Man, just like they did. (Fact-check me on this one?)
When Apple ousted Steve Jobs, the company tanked, and only after he was hired back as CEO did the company turn around and become the most valuable company in the world.
(It’s worth noting that if the MBA got hired as a “faux-CEO”, where the founder could veto any of the MBA’s proposals, the founders might make some use of him. But the way in which he’d be useful is that he’d effectively be hired for some non-CEO position. In this picture, the founders are still doing most of the cognitive work in running the company, while the MBA ends up relegated to being a “narrow tool intelligence utilized for boring business-y things”. It’s also worth noting that companies care significantly about culture fit when looking for people to fill even mundane MBA-like positions...)
In the case of a human (A) generically trying to optimize for his values (V), with an AGI trained to be corrigible (S) assisting, it seems quite unlikely that S would be able to learn A’s relevant internal representations (unless it’s far smarter and thus untrustworthy), which would lead to incorrect generalizations. My intuition is that if S is not much smarter than A, but helping in extremely general ways and given significant autonomy, the resulting outcome will be very bad. I definitely think this if S is a sovereign, but also think this if e.g. it’s doing a thousand years’ worth of human cognitive work in determining if a newly distilled agent is corrigible, which I think happens in ALBA. (Please correct me if I botched some details.)
Paul: Is your picture that the corrigible AI learns the relevant internal representations in lockstep with getting smarter, such that it manages to hit a “sweet spot” where it groks human values but isn’t vastly superintelligent? Or do you think it doesn’t learn the relevant internal representations, but its action space is limited enough that none of its plausible mistakes would be catastrophic? Or do you think one of my initial intuitions (1-6) is importantly wrong? Or do you think something else?
Two final thoughts:
The way I’ve been thinking about corrigibility, there is a simple core to corrigibility, but it onlyapplieswhenthesubagentcanaccuratelypredictanyjudgmentyou’dmakeoftheworld,andisn’tmuchmorepowerfulthanyou. This is the case if e.g. the subagent starts as a clone of you, and is not the case if you’re training it from scratch (because it’ll either be too dumb to understand you, or too smart to be trustworthy). I’m currently chewing on some ideas for operationalizing this take on corrigibility using decision theory.
None of this analysis takes into account that human notions of “high-impact” are often wrong. Typical human reasoning processes are pretty susceptible to black swans, as history shows. (Daemons sprouting would be a subcase of this, where naive human judgments might judge massive algorithmic searches to be low-impact.)
I disagree with 2, 4, 5 and the conclusion, though it might depend on how you are defining terms.
On 2, if there are morally important decisions you don’t recognize as morally important (e.g. massive mindcrime), you might destroy value by making the wrong decision and not realizing the VOI, but that’s not behaving incorrigibly.
On 4, that’s one reason but not the only reason you could robustly generalize.
On 5 I don’t understand what you mean or why that might be true.
I don’t really understand what you mean by black swans (or the direct relevance to corrigibility).
On 2, if there are morally important decisions you don’t recognize as morally important (e.g. massive mindcrime), you might destroy value by making the wrong decision and not realizing the VOI, but that’s not behaving incorrigibly.
Do you consider this a violation of alignment? If not, what word would you use? If yes, do you have a word for it that’s more specific than “alignment”?
Also, I have a concern similar to zhukeepa’s 6, which is that you seem to be depending on the AI being able to learn to model the user at runtime, starting from a “brain” that’s very different from a human’s (and lacks most of the built-in information and procedure that a human would use to model another human), and this (even if it could be done safely in theory) seems to require superhuman speed or intelligence. Before it can do that, the AI, even if corrigible, is either dangerous or not generally useful, which implies that when we achieve just human-level AGI, your alignment approach won’t work or won’t be safe yet. Does this argument seem correct to you?
I use “AI alignment” to refer to the problem of “building an AI that is trying to do what you want it to do” and especially which isn’t trying to take your resources or disempower you.
I allow the possibility that an aligned AI could make mistakes, including mistakes that a philosophically sophisticated human wouldn’t make. I call those “mistakes” or “catastrophic mistakes” or usually some more specific term describing the kind of mistake (in this case a moral error, which humans as well as AI’s could make). I don’t have a particular word for the problem of differentially advancing AI so that it doesn’t make catastrophic mistakes.
I would include this family of problems, of designing an AI which is competent enough to avoid some particular class of mistakes, under the heading “AI safety.”
Before it can do that, the AI, even if corrigible, is either dangerous or not generally useful
If by “dangerous” you mean “unacceptably dangerous” then I don’t believe this step of the argument.
I do agree that my approach won’t produce a perfectly safe AGI. But that claim seems quite weak: perfect safety would require (amongst other things) a perfect understanding of physics and of all potentially relevant moral facts, to avoid a catastrophic misstep.
Presumably you are making some stronger claim, perhaps a quantitative claim about the degree of safety, or else a comparison to some other possible technique which might yield greater safety.
I use “AI alignment” to refer to the problem of “building an AI that is trying to do what you want it to do” and especially which isn’t trying to take your resources or disempower you.
I want to note that this is ambiguous and apparently could apply or not apply to the particular thing I was asking about depending on one’s interpretation. If I didn’t know your interpretation, my first thought would be that an AI that commits mindcrimes because it didn’t correctly model me (and not realizing the VOI) is trying to do something that I don’t want it to do. Your definition of “alignment” as “AI that is trying to do what you want it to do” makes sense to me but your interpretation of “AI that is trying to do what you want it to do” is not intuitive to me so I have to remember that when I’m talking with you or reading your writings.
EDIT: Also, I can’t tell the difference between what you mean by “alignment” and what you mean by “corrigibility”. (I had thought that perhaps in this mindcrime example you’d call the AI corrigible but not aligned, but apparently that’s not the case.) Are you using the two terms interchangeably? If not can you explain the difference?
Presumably you are making some stronger claim, perhaps a quantitative claim about the degree of safety, or else a comparison to some other possible technique which might yield greater safety.
I mean if an AI does not have the intellectual capacity to model the user nearly as well as a typical human would, then it’s bound to either refuse to handle requests except those not requiring modeling the user well, or make a lot more mistakes while trying to help the user than a human trying to help the user. In other words by “dangerous” I meant substantially more dangerous than a typical human assistant. Does my argument make more sense now?
If I didn’t know your interpretation, my first thought would be that an AI that commits mindcrimes because it didn’t correctly model me (and not realizing the VOI) is trying to do something that I don’t want it to do.
Ah, I agree this is ambiguous, I’m using a de dicto rather than de re interpretation of “trying to do what I want it to do.” It would be great to have a clearer way to express this.
I can’t tell the difference between what you mean by “alignment” and what you mean by “corrigibility”
Suppose that I give an indirect definition of “my long-term values” and then build an AI that effectively optimizes those values. Such an AI would likely disempower me in the short term, in order to expand faster, improve my safety, and so on. It would be “aligned” but not “corrigible.”
Similarly, if I were to train an AI to imitate a human who was simply attempting to get what they want, then that AI wouldn’t be corrigible. It may or may not be aligned, depending on how well the learning works.
In general, my intuition is that corrigibility implies alignment but not the other way around.
In other words by “dangerous” I meant substantially more dangerous than a typical human assistant. Does my argument make more sense now?
I don’t expect that such an AI would necessarily be substantially more dangerous than a typical human assistant. It might be, but there are factors pushing in both directions. In particular, “modeling the user well” seems like just one of many properties that affects how dangerous an assistant is.
On top of that, it’s not clear to me that such an AI would be worse at modeling other humans, at the point when it was human level. I think this will mostly be determined by the capacity of the model being trained, and how it uses this capacity (e.g. whether it is being asked to make large numbers of predictions about humans, or about physical systems), rather than by features of the early stages of the amplification training procedure.
Ah, I agree this is ambiguous, I’m using a de dicto rather than de re interpretation of “trying to do what I want it to do.” It would be great to have a clearer way to express this.
That clarifies things a bit, but I’m not sure how to draw a line between what counts as aligned de dicto and what doesn’t, or how to quantify it. Suppose I design an AI that uses a hand-coded algorithm to infer what the user wants and to optimize for that, and it generally works well but fails to infer that I disvalue mindcrimes. (For people who might be following this but not know what “mindcrimes” are, see section 3 of this post.) This seems analogous to IDA failing to infer that the user disvalues mindcrimes, so you’d count it as aligned? But there’s a great (multi-dimensional) range of possible errors, and it seems like there must be some types or severities of value-learning errors where you’d no longer consider the AI to be “trying to do what I want it to do”, but I don’t know what those are.
Can you propose a more formal definition, maybe something along the lines of “If in the limit of infinite computing power, this AI would achieve X% of the maximum physically feasible value of the universe, then we can call it X% Aligned”?
Not sure how motivated you are to continue this line of discussion, so I’ll mention that uncertainty/confusion about a concept/term as central as “alignment” seems really bad. For example if you say “I think my approach can achieve AI alignment” and you mean one thing but the reader thinks you mean another, that might lead to serious policy errors. Similarly if you hold a contest on “AI alignment” and a participant misinterprets what you mean and submits something that doesn’t qualify as being on topic, that’s likely to cause no small amount of frustration.
I don’t have a more formal definition. Do you think that you or someone else has a useful formal definition we could use? I would be happy to adopt a more formal definition if it doesn’t have serious problems.
Or: are there some kinds of statements that you think shouldn’t be made without a more precise definitions? Is there an alternative way to describe a vague area of research that I’m interested in, that isn’t subject to the same criticism? Do you think I typically use “alignment” in a way that’s unnecessarily problematic in light of the likely misunderstanding? I don’t see this issue as nearly as important as you do, but am happy to make low-cost adjustments.
But there’s a great (multi-dimensional) range of possible errors, and it seems like there must be some types or severities of value-learning errors where you’d no longer consider the AI to be “trying to do what I want it to do”, but I don’t know what those are.
Here’s how I see it:
We almost certainly won’t build AI which knows all potentially relevant facts about our preferences (or about the world, or about logical facts) and therefore never makes a morally relevant mistake.
Anyone who describes “aligned AGI” or “safe AI” or “FAI” is therefore talking about some milder definition than this, e.g. involving making reasonable tradeoffs between VOI and the cost of eliciting preferences, between the risk of catastrophe and the costs of inaction, and so on.
No one has yet offered a convincing milder definition, and there may be no binary definition of “success” vs. “failure.” My milder definition is clearly imprecise, like all of the other implicit definitions people use.
Is this different from your view of the situation?
maybe something along the lines of “If in the limit of infinite computing power, this AI would achieve X% of the maximum physically feasible value of the universe, then we can call it X% Aligned”?
I don’t think this is a likely way to get a good definition of alignment (“good” in the sense of either being useful or of tracking how the term is typically used).
Given competitive pressures, lots of things that are obviously not AI alignment affect how much of the universe’s value you realize (for example, do you accidentally blow up the world while doing physics). Conversely, given no competitive pressure, your AI would not need to do anything risky, either concerning its own cognition or concerning physics experiments. It’s not clear whether we’ll realize 100% of the realizable value, but again the difficulty seems completely unrelated to AI and instead related to the probable course of human deliberation.
So this is basically just equivalent to eliminating competitive pressure as safely as possible in the limit of infinite computing power, i.e. it’s evaluating how well a proposed AI design solves a particular unrealistic problem. I think it would be likely to be solved by techniques like “learn high-fidelity brain emulations and run them really fast,” which seem quite different from promising approaches to alignment.
So this is basically just equivalent to eliminating competitive pressure as safely as possible in the limit of infinite computing power, i.e. it’s evaluating how well a proposed AI design solves a particular unrealistic problem. I think it would be likely to be solved by techniques like “learn high-fidelity brain emulations and run them really fast,” which seem quite different from promising approaches to alignment.
I was trying to capture the meaning of your informal definition, so I don’t understand why “learn high-fidelity brain emulations and run them really fast” being considered aligned according to my definition is a problem, when it also seems to fit your definition of “trying to do what I want it to do”. Are you saying that kind of AI doesn’t fit your definition? Or that “promising approaches to alignment” would score substantially worse than “learn high-fidelity brain emulations and run them really fast” according to my definition (i.e., achieve much less value when given infinite computing power)?
Also, I don’t see it as a problem if “aligned” ignores competition and computational limitations, since once we agree on what alignment means in the absence of these concerns we can then coin “competitively aligned” or “feasibly aligned” or what-have-you and try to define them. But mainly I don’t understand why you’re objecting when your own definition ignores these issues.
Here is a clarification of my previous comment, which I believe was based on a misunderstanding:
I don’t like the definition “an AGI is aligned if running it leads to good long-term outcomes” as a way of carving out a set of research problems or a research goal, because “AI alignment” then includes basically all x-risk relevant research. For example, it would include understanding physics relevant to possible high-energy physics catastrophes, and then making sure we give that information to our AGI so that it doesn’t inadvertently cause a physics catastrophe.
When I use “AI alignment,” I don’t want to include differential progress in fundamental physics that could help avoid catastrophes.
Your definition in the parent only requires good behavior in the limit of infinite computation, which I assumed was a way to make these other problems easy, and thereby exclude them from the definition. For example, if we have infinite computation, our AI can then do exhaustive Bayesian inference about possible theories of physics in order to make optimal decisions. And therefore progress in physics wouldn’t be relevant to AI alignment.
But I don’t think this trick works for separating out AI alignment problems in particular, because giving your AI infinite computation (while not giving competitors infinite computation) also eliminates most of the difficulties that we do want to think of as AI alignment.
Here is what I now believe you are/were saying:
Let’s define “aligned” to mean something like “would yield good outcomes if run with infinite computation.” Then we can describe our research goal in terms of “alignment” as something like “We want a version of technique X that has the same advantages as X but produces aligned agents.”
I don’t think this is helpful either, because this “alignment” definition only tells us something about the behavior of our agent when we run it with infinite computation, and nothing about what happens when we run it in the real world. For example, the kinds of difficulties we were discussing in this thread (which come from the AI’s imperfect models of humans) would be irrelevant according to this definition of alignment, but seem extremely important in practice.
Saying what “aligned” means in the limit of infinite computation may be a useful step towards giving a definition in the realistic case of finite computation (though I don’t see how to make progress along those lines). I would be inclined to give that concept some name like “asymptotically aligned” and then use “aligned” interchangeably with “actually aligned, as implemented in the real world.”
I also think defining asymptotic alignment is non-trivial. I’d try something like: “when run with infinite computing and perfect information about the operator, including the operator’s knowledge about the world, the system outputs optimal decisions according to the operator’s {preferences}” where {preferences} is a stand-in for some as-yet-undefined concept that includes the operator’s enlightened preferences, beliefs, decision theory, etc.
Let me know if I am still misunderstanding you.
As a meta note: My current take is that more precise definitions are useful, and that I should adjust any behavior that is causing easily-corrected misunderstanding, but that coming up with more precise definitions is lower priority than making progress on the problem (and will be easier after making progress). It might be more useful to anchor this discussion to some particular significant problems arising from our definitional unclarity, if you think that it’s an important enough issue to be worth spending time on.
My current take is that more precise definitions are useful, and that I should adjust any behavior that is causing easily-corrected misunderstanding, but that coming up with more precise definitions is lower priority than making progress on the problem (and will be easier after making progress).
(In addition to the other reasons I gave for prioritizing clarity of definitions/explanations) I’d like to help contribute to making forward progress on these things (despite not being as optimistic as you), but it’s hard to do that without first understanding your existing ideas and intuitions, and that’s hard to do while being confused about what your words mean. I think this probably also applies to others who would like to contribute to this research.
>For example, the kinds of difficulties we were discussing in this thread (which come from the AI’s imperfect models of humans) would be irrelevant according to this definition of alignment, but seem extremely important in practice.
In my comment that started this sub-thread, I asked “Do you consider this [your mindcrime example] a violation of alignment?” You didn’t give a direct yes or no answer, but I thought it was clear from what you wrote that the answer is “no” (and therefore you consider these kinds of difficulties to be irrelevant according to your own definition of alignment), which is why I proposed the particular formalization that I did. I thought you were saying that these kinds of difficulties are not relevant to “alignment” but are relevant to “safety”. Did I misunderstand your answer, or perhaps you misunderstood my question, or something else?
I don’t think {not noticing that mindcrime is a problem} is a violation of alignment: the AI is trying to do what you want but makes a moral error.
I do think {if the AI is too weak, it secretly plots to kill everyone} is a violation of alignment: the AI isn’t trying to do what you want. It knows that you don’t want it to kill everyone, that’s why it’s trying to keep it secret.
(It’s technically possible for an AI to kill everyone, and even to secretly kill everyone, because it is trying to do what you want but makes a mistake. This seems like an inevitable feature of any sensible definition of alignment. I expect to now have an involved discussion about what the difference is.)
(Wei Dai and I discussed my definition of alignment offline, leading to this post which hopefully clarifies things a little bit in addition to summarizing the takeaways from this thread.)
Or: are there some kinds of statements that you think shouldn’t be made without a more precise definitions? Is there an alternative way to describe a vague area of research that I’m interested in, that isn’t subject to the same criticism? Do you think I typically use “alignment” in a way that’s unnecessarily problematic in light of the likely misunderstanding? I don’t see this issue as nearly as important as you do, but am happy to make low-cost adjustments.
The main problem from my perspective is that MIRI is using “alignment” in a very different way, to refer to a larger field of study that includes what you call “safety” and even “how rapidly an AI might gain in capability”. I think if you had a formal technical definition that you want to overload the term with, that would be fine if it’s clear (from context or explicit explanation) when you’re referring to the technical term. But since you only have a vague/ambiguous informal definition, a lot of people, if they were introduced to the term via MIRI’s writings, will easily round off your definition to theirs and fail to notice that you’re talking about something much narrower. This is even worse when you refer to “alignment” without giving any definition as in most of your writings.
The upshot here is that when you say something like “Many people endorse this or a similar vision as their current favored approach to alignment” a lot of people will interpret that as meaning your approach is supposed to solve many more problems than what you have in mind.
Given this, I think unless you can come up with a formal technical definition, you should avoid using “alignment” and pick a less overloaded term, or maybe put disclaimers everywhere. It occurs to me that it might feel unfair to you that I’m suggesting that you change your wording or add disclaimers, instead of MIRI. This is because I have the impression that more people were introduced to the term “AI alignment” through MIRI’s writings than yours, and therefore more people already have their definition in mind. (For example Eliezer just explained his version of “alignment” in his podcast with Sam Harris, who I understand to have a pretty large audience.) If that’s not the case then I’d make the suggestion to MIRI instead.
Even if you do use another term, people are still liable to round that off to the nearest concept that they’re familiar with, which would likely be MIRI’s “AI alignment”, or interpret “trying to do what we want them to do” in the de re sense, or get confused in some other way. So you probably need to write a post explaining your concept as clearly as you can and how it differs from nearby concepts, and then link to it every time you use the new term at least until most people become familiar with it.
I had previously described this problem as the “control problem” and called my blog “AI control,” following Nick Bostrom’s usage. Several people had expressed dissatisfaction with the term “control problem,” which I sympathized with (see this comment by Rob Bensinger from MIRI).
I adopted the term “AI alignment” after an email thread started by Rob about a year ago with a dozen people who frequently used the term, which was centered around the suggestion:
I think we should use the term “AI alignment” (and “alignment”, where AI is assumed) as a synonym for Bostrom’s “control problem,” since this is already more or less how the terms are most commonly used.”
He later clarified that he actually meant what Bostrom calls “the second principal agent problem,” the principal agent problem between humans and AI rather than amongst humans, which was how I was using “control problem” and what I feel is the most useful concept.
I don’t have strong feelings about terminology, and so went with the consensus of others on the thread, and have been using “alignment” instead of control since then.
I agree that the usage by Eliezer in that Arbital post is much broader. I think it’s a much less useful concept than Nick’s control problem. Is it used by Eliezer or MIRI researchers in other places? Is it used by other people?
(Note that “aligned” and “the alignment problem” could potentially have separate definitions, which is in part responsible for our confusion in the other thread).
My best guess is that “alignment” should continue to be used for this narrower problem rather than the entire problem of making AI good. I’m certainly open to the possibility that alignment is being frequently misunderstood and should be explained + linked, and that is reasonably cheap (though I’d prefer get some evidence about that, you are the main person I talk to who seems to endorse the very broad reading).
(Note that the question “how fast will AI gain in capability” is also a relevant subproblem to the narrower use of “alignment,” since knowing more about AI development makes it easier to solve the alignment problem.)
Unfortunately most people don’t bother to define “alignment” when they use it, or do so very vaguely. But aside from Eliezer, I found a couple of more places that seem to define it more broadly than you here. LCFI:
The Value Alignment Project seeks to design methods for preventing AI systems from inadvertently acting in ways inimical to human values.
I define “AI alignment” these days roughly the way the Open Philanthropy Project does:
the problem of creating AI systems that will reliably do what their users want them to do even when AI systems become much more capable than their users across a broad range of tasks
More specifically, I think of the alignment problem as “find a way to use AGI systems to do at least some ambitious, high-impact things, without inadvertently causing anything terrible to happen relative to the operator’s explicit and implicit preferences”.
This is an easier goal than “find a way to safely use AGI systems to do everything the operator could possibly want” or “find a way to use AGI systems to do everything everyone could possibly want, in a way that somehow ‘correctly’ aggregates preferences”; I sometimes see problem statements like those referred to as the “full” alignment problem.
It’s a harder goal than “find a way to get AGI systems to do roughly what the operators have in mind, without necessarily accounting for failure modes the operators didn’t think of”. Following the letter of the law rather than the spirit is only OK insofar as the difference between letter and spirit is non-catastrophic relative to the operators’ true implicit preferences.
If developers and operators can’t foresee every potential failure mode, alignment should still mean that the system fails gracefully. If developers make a moral error (relative to their own moral values) but get alignment right, alignment should mean that their moral error doesn’t automatically cause a catastrophe. This does and should mean that alignment is much harder if solutions are more fragile or local and failure modes are harder to foresee.
This way of thinking about the alignment problem seems more useful to me because it factors out questions related to value disagreements and coordination between humans (including Bostrom’s first principal-agent problem), but leaves “aligned” contentful enough that it does actually mean we’re keeping our eye on the ball. We’re not ignoring how catastrophic-accident-prone the system actually is just because the developer was being dumb.
(I guess you’d want a stronger definition if you thought it was realistic that AGI developers might earnestly in their heart-of-hearts just want to destroy the world, since that case does make the alignment problem too trivial.
I’m similarly assuming that there won’t be a deep and irreconcilable values disagreement among stakeholders about whether we should conservatively avoid high risk of mindcrime, though there may be factual disagreements aplenty, and perhaps there are irreconcilable casewise disagreements about where to draw certain normative category boundaries once you move past “just be conservative and leave a wide berth around anything remotely mindcrime-like” and start trying to implement “full alignment” that can spit out the normatively right answer to every important question.)
I wrote a post attempting to clarify my definition. I’d be curious about whether you agree.
If developers make a moral error (relative to their own moral values) but get alignment right, alignment should mean that their moral error doesn’t automatically cause a catastrophe.
Speaking to the discussion Wei Dai and I just had, I’m curious about whether you would consider any or all of these cases to be alignment failures:
There is an opportunity to engage in acausal trade that will disappear once your AI becomes too powerful, and the AI fails to take that opportunity before becoming too powerful.
Your AI doesn’t figure out how to do a reasonable “values handshake” with a competitor (where two agents agree to both pursue some appropriate compromise values in order to be Pareto efficient), conservatively avoids such handshakes, and then gets outcompeted because of the resulting inefficiency.
Your AI has well-calibrated normative uncertainty about how to do such handshakes, but decides that the competitive pressure to engage in them is strong enough to justify the risk, and makes a binding agreement that we would eventually recognize as suboptimal.
In fact our values imply that it’s a moral imperative to develop as fast as possible, your AI fails to notice this counterintuitive argument, and therefore develops too slowly and leaves 50% of the value of the universe on the table.
Your AI fails to understand consciousness (like us), has well-calibrated moral uncertainty about the topic, but responds to competitive pressure by taking a risk and running some simulations that we would ultimately regard as experiencing enough morally relevant suffering to be called a catastrophe.
Your AI faces a moral decision about how much to fight for your values, and it decides to accept a risk of extinction that on reflection you’d consider unacceptably high.
Someone credibly threatens to blow up the world if your AI doesn’t give them stuff, and your AI capitulates even though on reflection we’d regard this as a mistake.
I’m not sure whether your definition is intended to include these. The sentence “this does and should mean that alignment is much harder if solutions are more fragile or local and failure modes are harder to foresee” does suggest that interpretation, but it also sounds like you maybe aren’t explicitly thinking about problems of this kind or are assuming that they are unimportant.
I wouldn’t consider any of these “alignment problems.” These are distinct problems that we’ll face whether or not we build an AI. Whether they are important is mostly unrelated to the usual arguments for caring about AI alignment, and the techniques that we will use to solve them are probably unrelated to the techniques we will use to build an AI that won’t kill us outright. (Many of these problems are likely to be solved by an AI, just like P != NP is likely to be proved by an AI, but that doesn’t make either of them an alignment problem.)
If these kinds of errors are included in “alignment,” then I’d want some different term that referred to the particular problem of building AI that was trying to do the right thing, without including all of the difficulty of figuring out what is right (except insofar as “figure out more about what is right” is one way to try to build an AI that is trying to do the right thing.)
(I do agree that building an AI which took control of the world away from us but then was never able to resolve these problems would probably be a failure of alignment.)
I really like that list of points! Not that I’m Rob, but I’d mentally classified each of those as alignment failures, and the concern I was trying to articulate was that, by default, I’d expect an AI trying to do the right thing will make something like one of these mistakes. Those are good examples of the sorts of things I’d be scared of if I had a well-intentioned non-neurotypical assistant. Those are also what I was referring to when I talked about “black swans” popping up. And when I said:
2. Corrigibility depends critically on high-impact calibration (when your AI is considering doing a high-impact thing, it’s critical that it knows to check that action with you).
I meant that, if an AI trying to do the right thing was considering one of these actions, for it to be safe it should consult you before going ahead with any one of these. (I didn’t mean “the AI is incorrigible if it’s not high-impact calibrated”, I meant “the AI, even if corrigible, would be unsafe it’s not high-impact calibrated”.)
If these kinds of errors are included in “alignment,” then I’d want some different term that referred to the particular problem of building AI that was trying to do the right thing, without including all of the difficulty of figuring out what is right (except insofar as “figure out more about what is right” is one way to try to build an AI that is trying to do the right thing.)
I think I understand your position much better now. The way I’ve been describing “ability to figure out what is right” is “metaphilosophical competence”, and I currently take the stance that an AI trying to do the right thing will by default be catastrophic if it’s not good enough at figuring out what is right, even if it’s corrigible.
he way I’ve been describing “ability to figure out what is right” is “metaphilosophical competence”
I don’t think that “ability to figure out what is right” is captured by “metaphilosophical competence.” That’s one relevant ability, but there are many others: philosphical competence, understanding humans, historical knowledge, physics expertise...
for it to be safe it should consult you before going ahead with any one of these
OK, but that can mostly be done based on simple arguments about irreversibility and resource consumption. It doesn’t take much philosophical competence, or aesthetic sense, to notice that making a binding agreement that constrains all of your future behavior ever is a big deal, even if it would take incredible sophistication to figure out exactly which deals are good. Ditto for the other items on my list except possibly acausal trade that goes off the table based on crossing some capability threshold, but practically even that is more like a slow-burning problem than a catastrophe.
I feel like you are envisioning an AI which is really smart in some ways and implausibly dumb in others. I agree that we need to understand something about the kind of errors that our AI will make, in order to understand whether it is safe. But in order to talk about how important that problem is (and how much of a focus it should be relative to what I’m calling “alignment”) we need to actually talk about how easy or hard those errors are. In many of the cases you are describing the AI systems involved seem even dumber than existing ML (e.g. they are predicting the answer to “which of these cases would a human consider potentially catastrophic” even worse than an existing ML system would).
Using Scott Garrabrant’s terminology, I think that we should basically start by trying to get robustness to scaling up, then once we understand what’s needed for that try to get robustness to relative scale, then once we understand what’s needed for that we should aim for robustness to scaling down. I expect robustness to scaling down to be the easiest of these, and it’s definitely the easiest to get empirical feedback about. It’s also the one for which we learn the most from ongoing AI progress.
I don’t think that “ability to figure out what is right” is captured by “metaphilosophical competence.” That’s one relevant ability, but there are many others: philosphical competence, understanding humans, historical knowledge, physics expertise…
By “metaphilosophical competence” zhukeepa means to include philosophical competence and rationality (which I guess includes having the right priors and using information efficiently in all fields of study including understanding humans, historical knowledge, physics expertise). (I wish he would be more explicit about that to avoid confusion.)
I feel like you are envisioning an AI which is really smart in some ways and implausibly dumb in others.
Why is this implausible, given that we don’t yet know that meta-execution with humans acting on small inputs is universal? And even if it’s universal, meta-execution may be more efficient (requires fewer amplifications to reach a certain level of performance) in some areas than others, and therefore the resulting AI could be very smart in some ways and dumb in others at a given level of amplification.
Do you think that’s not the case, or that the strong/weak areas of meta-execution do not line up the way zhukeepa expects? To put it another way, when IDA reaches roughly human-level intelligence, which areas do you expect it to be smarter than human, which dumber than human? (I’m trying to improve my understanding and intuitions about meta-execution so I can better judge this myself.)
In many of the cases you are describing the AI systems involved seem even dumber than existing ML (e.g. they are predicting the answer to “which of these cases would a human consider potentially catastrophic” even worse than an existing ML system would).
Your scheme depends on both meta-execution and ML, and it only takes one of them to be dumb in some area for the resulting AI to be dumb in that area. Also, what existing ML system are you talking about? Is it something someone has already built, or are you imagining something we could build with current ML technology?
I’m curious to hear more about your thoughts about (4).
To flesh out my intuitions around (4) and (5): I think there are many tasks where a high-dimensional and difficult to articulate piece of knowledge is critical for completing the task. For example:
if you’re Larry or Sergey trying to hire a new CEO, you need your new CEO to be a culture fit. Which in this case means something like “being technical, brilliant, and also a hippie at heart”. It’s really, really hard to communicate this to a slick MBA. Especially the “be a hippie at heart” part. Maybe if you sent them to Burning Man and had them take a few drugs, they’d grok it?
if you’re Bill Gates hiring a new CEO, you should make sure your new CEO is also a developer at heart, not a salesman. Otherwise, you might hire Steve Ballmer, who drove Microsoft’s revenues through the roof for a few years, but also had little understanding of developers (for example he produced an event where he celebrated developers in a way developers don’t tend to like being celebrated). This led to an overall trend of the company losing its technical edge, and thus its competitive edge… this was all while Ballmer had worked with Gates at Microsoft for two decades. If Ballmer was a developer, he may have been able avoid this, but he very much wasn’t.
if you’re a self-driving car engineer delegating image classification to a modern-day neural net, you’d really want its understanding of what the classifications mean to match yours, lest they be susceptible to clever adversarial attacks. Humans understand the images to represent projections of crisp three-dimensional objects that exist in a physical world; image classifiers don’t, which is why they can get fooled so easily by overlays of random patterns. Maybe it’s possible to replicate this understanding without being an embodied agent, but it seems you’d need something beyond training a big neural net on a large collection of images, and making incremental fixes.
if you’re a startup trying to build a product, it’s very hard to do so correctly if you don’t have a detailed implicit model of your users’ workflows and pain points. It helps a lot to talk to them, but even then, you may only be getting 10% of the picture if you don’t know what it’s like to be them. Most startups die by not having this picture, flying blind, and failing to acquire any users.
if you’re trying to help your extremely awkward and non-neurotypical friend find a romantic partner, you might find it difficult to convey what exactly is so bad about carrying around slips of paper with clever replies, and pulling them out and reading from them when your date says something you don’t have a reply to. (It’s not that hard to convey why doing this particuar thing is bad. It’s hard to convey what exactly about it is bad, that would have him properly generalize and avoid all classes of mistakes like this going forward, rather than just going like “Oh, pulling out slips of paper is jarring and might make her feel bad, so I’ll stop doing this particular thing”.) (No, I did not make up this example.)
In these sorts of situations, I wouldn’t trust an AI to capture my knowledge/understanding. It’s often tacit and perceptual, it’s often acquired through being a human making direct contact with reality, and it might require a human cognitive architecture to even comprehend in the first place. (Hence my claims that proper generalization requires having the same ontologies as the overseer, which they obtained from their particular methods of solving a problem.)
In general, I feel really sketched about amplifying oversight, if the mechanism involves filtering your judgment through a bunch of well-intentioned non-neurotypical assistants, since I’d expect the tacit understandings that go into your judgment to get significantly distorted. (Hence my curiosity about whether you think we can avoid the judgment getting significantly distorted, and/or whether you think we can do fine even with significantly distorted judgment.)
It’s also pretty plausible that I’m talking completely past you here; please let me know if this is the case.
Those examples may be good evidence that humans have a lot of implicit knowledge, but I don’t think they suggest that an AI needs to learn human representations in order to be safe.
I agree that “AI systems are likely to generalize differently from humans.” I strongly believe we shouldn’t rest AI alignment on detailed claims about how an AI will generalize to a new distribution. (Though I do think we can hope to avoid errors of commission on a new distribution.)
Those examples may be good evidence that humans have a lot of implicit knowledge, but I don’t think they suggest that an AI needs to learn human representations in order to be safe.
I think my present view is something like a conjunction of:
1. An AI needs to learn human representations in order to generalize like a human does.
2. For sufficiently general and open-ended tasks, the AI will need to generalize like a human does in order to be safe. Otherwise, the default is to expect a (possibly existential) catastrophe from a benign failure.
3. For a very broad range of narrow tasks, the AI does not need to generalize like a human does in order to be safe (or, it’s easy for it to generalize like a human). Go is in this category, ZFC theorem-provers are probably in this category, and I can imagine a large swath of engineering automation also falls into this category.
4. To the extent that “general and open-ended tasks” can be broken down into narrow tasks that don’t require human generalization, they don’t require human generalization to learn safely.
My current understanding is that we agree on (3) and (4), and that you either think that (2) is false, or that it’s true but the bar for “sufficiently general and open-ended” is really high, and tasks like achieving global stability can be safely broken down into safe narrow tasks. Does this sound right to you?
I’m confused about your thoughts on (1).
(I’m currently rereading your blog posts to get a better sense of your models of how you think broad and general tasks can get broken down into narrow ones.)
For sufficiently general and open-ended tasks, the AI will need to generalize like a human does in order to be safe. Otherwise, the default is to expect a (possibly existential) catastrophe from a benign failure.
This recent post is relevant to my thinking here. For the performance guarantee, you only care about what happens on the training distribution. For the control guarantee, “generalize like a human” doesn’t seem like the only strategy, or even an especially promising strategy.
I assume you think some different kind of guarantee is needed. My best guess is that you expect we’ll have a system that is trying to do what we want, but is very alien and unable to tell what kinds of mistakes might be catastrophic to us, and that there are enough opportunities for catastrophic error that it is likely to make one.
Let me know if that’s wrong.
If that’s right, I think the difference is: I see subtle benign catastrophic errors as quite rare, such that they are quantitatively a much smaller problem than what I’m calling AI alignment, whereas you seem to think they are extremely common. (Moreover, the benign catastrophic risks I see are also mostly things like “accidentally start a nuclear war,” for which “make sure the AI generalizes like a human” is not a especially great response. But I think that’s just because I’m not seeing some big class of benign catastrophic risks that seem obvious to you, so it’s just a restatement of the same difference.)
Could you explain a bit more what kind subtle benign mistake you expect to be catastrophic?
Among people I’ve had significant online discussions with, your writings on alignment tend to be the hardest to understand and easiest to misunderstand.
Additionally, I think that there are ways to misunderstand the IDA approach that leave out significant parts of the complexity (ie. IDA based off of humans thinking for a day with unrestricted input, without doing the hard work of trying to understand corrigibility and meta-philosophy beforehand), but can seem to be plausible things to talk about in terms of “solving the AI alignment problem” if one hasn’t understood the more subtle problems that would occur. It’s then easy to miss the problems and feel optimistic about IDA working while underestimating the amount of hard philosophical work that needs to be done, or to incorrectly attack the approach for missing the problems completely.
(I think that these simpler versions of IDA might be worth thinking about as a plausible fallback plan if no other alignment approach is ready in time, but only if they are restricted in terms of accomplishing specific tasks to stabilise the world, restricted in how far the amplification is taking, replaced with something better as soon as possible, etc. I also think that working on simple versions of IDA can help make progress on issues that would be required for fully scalable IDA, ie. the experiments that Ought is running.).
Additionally, I think that there are ways to misunderstand the IDA approach that leave out significant parts of the complexity (ie. IDA based off of humans thinking for a day with unrestricted input, without doing the hard work of trying to understand corrigibility and meta-philosophy beforehand)
I guess this is in part because that’s how Paul initially described his approach, before coming up with Security Amplification in October 2016. For example in March 2016 I wrote “First, I’m assuming it’s reasonable to think of A1 as a human upload that is limited to one day of subjective time, by the end of which it must have written down any thoughts it wants to save, and be reset. Let me know if this is wrong.” and Paul didn’t object to this in his reply.
An additional issue is that even people who intellectually understand the new model might still have intuitions left over from the old one. For example I’m just now realizing that the low-amplification agents in the new scheme must have thought processes and “deliberations” that are very alien, since they don’t have human priors, natural language understanding, values, common sense judgment, etc. I wish Paul had written a post in big letters that said, “WARNING: Throw out all your old intuitions!”
I don’t have optimism about finding a core which is already highly competent at these tasks.
I’m a little confused about what this statement means. I thought that if you have an overseer that implements some reasoning core, and consider amplify(overseer) with infinite computation time and unlimited ability to query the world (ie. for background information on what humans seem to want, how they behave, etc.), then amplify(overseer) should be able to solve any problem that an agent produced by iterating IDA could solve.
Did you mean to say that
“already highly competent at these tasks” means that the core should be able to solve these problems without querying the world at all, and this is not likely to be possible?
you don’t expect to find a core such that only one round of amplification of amplify(overseer) can solve practical tasks in any reasonable amount of time/number of queries?
There is some other way that the agent produced by IDA would be more competent than the original amplified overseer?
I mean that the core itself, as a policy, won’t be able to solve these problems. It also won’t solve it after a small number of amplification steps. And probably it will have to query the world.
What is the difference between “core after a small number of amplification steps” and “core after a large number of amplification steps” that isn’t captured in “larger effective computing power” or “larger set of information about the world”, and allows the highly amplified core to solve these problems?
I didn’t mean to suggest there is a difference other than giving it more computation and more data.
I was imagining Amplify(X) as a procedure that calls X a bounded number of times, so that you need to iterate Amplify in order to have arbitrarily large runtimes, while I think you were imagining a parameterized operation Amplify(X, n) that takes n time and so can be scaled up directly. Your usage also seems fine.
Even if that’s not the difference, I strongly expect we are on the same page here about everything other than words. I’ve definitely updated some about the difficulty of words.
If your definition of “corrigible” does not include things like the ability to model the user and detect ambiguities as well as a typical human, then I don’t currently have a strong intuition about this. Is your view/hope then that starting with such a core, if we amplify it enough, eventually it will figure out how to safely learn (or deduce from first principles, or something else) how to understand natural language, model the user, detect ambiguities, balance between the user’s various concerns, and so on? (If not, it would be stuck with either refusing to doing anything except literal-minded mechanical tasks that don’t require such abilities, or frequently making mistakes of the type “hack a bank when I ask it to make money”, which I don’t think is what most people have in mind when they think of “aligned AGI”.)
Yes. My hope is to learn or construct a core which:
Doesn’t do incorrigible optimization as it is amplified.
Increases in competence as it is amplified, including competence at tasks like “model the user,” “detect ambiguities” or “make reasonable tradeoffs about VOI vs. safety” (including info about the user’s preferences, and “safety” about the risk of value drift). I don’t have optimism about finding a core which is already highly competent at these tasks.
I grant that even given such a core, we will still be left with important and unsolved x-risk relevant questions like “Can we avoid value drift over the process of deliberation?”
It appears that I seriously misunderstood what you mean by corrigibility when I wrote this post. But in my defense, in your corrigibility post you wrote, “We say an agent is corrigible (article on Arbital) if it has these properties.” and the list includes helping you “Make better decisions and clarify my preferences” and “Acquire resources and remain in effective control of them” and to me these seem to require at least near human level ability to model the user and detect ambiguities. And others seem to have gotten the same impression from you. Did your conception of corrigibility change at some point, or did I just misunderstand what you wrote there?
Since this post probably gave even more people the wrong impression, I should perhaps write a correction, but I’m not sure how. How should I fill in this blank? “The way I interpreted Paul’s notion of corrigibility in this post is wrong. It actually means ___.”
Is there a way to resolve our disagreement/uncertainty about this, short of building such an AI and seeing what happens? (I’m imagining that it would take quite a lot of amplification before we can see clear results in these areas, so it’s not something that can be done via a project like Ought?)
I think your post is (a) a reasonable response to corrigibility as outlined in my public writing, (b) a reasonable but not decisive objection to my current best guess about how amplification could work. In particular, I don’t think anything you’ve written is too badly misleading.
In the corrigibility post, when I said “AI systems which help me do X” I meant something like “AI systems which help me do X to the best of their abilities,” rather than having in mind some particular threshold for helpfulness at which an AI is declared corrigible (similarly, I’d say an AI is aligned if it’s helping me achieve my goals to the best of its abilities, rather than fixing a certain level of helpfulness at which I’d call it aligned). I think that post was unclear, and my thinking has become a lot sharper since then, but the whole situation is still pretty muddy.
Even that’s not exactly right, and I don’t have a simple definition. I do have a lot of intuitions about why there might be a precise definition, but those are even harder to pin down.
(I’m generally conflicted about how much to try to communicate publicly about early stages of my thinking, given how frequently it changes and how fuzzy the relevant concepts are. I’ve decided to opt for a medium level of communication, since it seems like the potential benefits are pretty large. I’m sorry that this causes a lot of trouble though, and in this case I probably should have been more careful about muddying notation. I also recognize it means people are aiming at a moving target when they try to engage; I certainly don’t fault people for that, and I hope it doesn’t make it too much harder to get engagement with more precise versions of similar ideas in the future.)
What uncertainty in particular?
Things I hope to see before we have very powerful AI:
Clearer conceptual understanding of corrigibility.
Significant progress towards a core for metaexecution (either an explicit core, or an implicit representation as a particular person’s policy), which we can start to investigate empirically.
Amplification experiments which show clearly how complex tasks can be broken into simpler pieces, and let us talk much more concretely about what those decompositions look like and in what ways they might introduce incorrigible optimization. These will also directly resolve logical uncertainty about whether proposed decomposition techniques actually work.
Application of amplification to some core challenges for alignment, most likely either (a) producing competitive interpretable world models, or (b) improving reliability, which will make it especially easy to discuss whether amplification can safely help with these particular problems.
If my overall approach is successful, I don’t feel like there are significant uncertainties that we won’t be able to resolve until we have powerful AI. (I do think there is a significant risk that I will become very pessimistic about the “pure” version of the approach, and that it will be very difficult to resolve uncertainties about the “messy” version of the approach in advance because it is hard to predict whether the difficulties for the pure version are really going to be serious problems in practice.)
Among people I’ve had significant online discussions with, your writings on alignment tend to be the hardest to understand and easiest to misunderstand. From a selfish perspective I wish you’d spend more time writing down more details and trying harder to model your readers and preempt ambiguities and potential misunderstandings, but of course the tradeoffs probably look different from your perspective. (I also want to complain (again?) that Medium.com doesn’t show discussion threads in a nice tree structure, and doesn’t let you read a comment without clicking to expand it, so it’s hard to see what questions other people asked and how you answered. Ugh, talk about trivial inconveniences.)
How much can the iterated amplification of an impoverished overseer safely learn about how to help humans (how to understand natural language, build models of users, detect ambiguity, being generally competent)? Is it enough to attract users and to help them keep most of their share of the cosmic endowment against competition with malign AIs?
I thought more about my own uncertainty about corrigibility, and I’ve fleshed out some intuitions on it. I’m intentionally keeping this a high-level sketch, because this whole framing might not make sense, and even if it does, I only want to expound on the portions that seem most objectionable.
Suppose we have an agent A optimizing for some values V. I’ll call an AI system S high-impact calibrated with respect to A if, when A would consider an action “high-impact” with respect to V, S will correctly classify it as high-impact with probability at least 1-ɛ, for some small ɛ.
My intuitions about corrigibility are as follows:
1. If you’re not calibrated about high-impact, catastrophic errors can occur. (These are basically black swans, and black swans can be extremely bad.)
2. Corrigibility depends critically on high-impact calibration (when your AI is considering doing a high-impact thing, it’s critical that it knows to check that action with you).
3. To learn how to be high-impact calibrated w.r.t. A, you will have to generalize properly from training examples of low/high-impact (i.e. be robust to distributional shift).
4. To robustly generalize, you’re going to need the ontologies / internal representations that A is using. (In slightly weirder terms, you’re going to have to share A’s tastes/aesthetic.)
5. You will not be able to learn those ontologies unless you know how to optimize for V the way A is optimizing for V. (This is the core thing missing from the well-intentioned extremley non-neurotypical assistant I illlustrated.)
6. If S’s “brain” starts out very differently from A’s “brain”, S will not be able to model A’s representations unless S is significantly smarter than A.
In light of this, for any agent A, some value V they’re optimizing for, and some system S that’s assisting A, we can ask two important questions:
(I) How well can S learn A’s representations?
(II) If the representation is imperfect, how catastrophic might the resulting mistakes be?
In the case of a programmer (A) building a web app trying to make users happy (V), it’s plausible that some run-of-the-mill AI system (S) would learn a lot of the important representations right and a lot of the important representations wrong, but it also seems like none of the mistakes are particularly catastrophic (worst case, the programmer just reverts the codebase.)
In the case of a human (A) trying to make his company succeed (V), looking for a new CEO (S) to replace himself, it’s usually the case that the new CEO doesn’t have the same internal representations as the founder. If they’re too different, the result is commonly catastrophic (e.g. if the new CEO is an MBA with “more business experience”, but with no vision and irreconcilable taste). Some examples:
For those who’ve watched HBO’s Silicon Valley, Action Jack Barker epitomizes this.
When Sequoia Capital asked Larry and Sergey to find a new CEO for Google, they hemmed and hawed until they found one who had a CS Ph.D and went to Burning Man, just like they did. (Fact-check me on this one?)
When Apple ousted Steve Jobs, the company tanked, and only after he was hired back as CEO did the company turn around and become the most valuable company in the world.
(It’s worth noting that if the MBA got hired as a “faux-CEO”, where the founder could veto any of the MBA’s proposals, the founders might make some use of him. But the way in which he’d be useful is that he’d effectively be hired for some non-CEO position. In this picture, the founders are still doing most of the cognitive work in running the company, while the MBA ends up relegated to being a “narrow tool intelligence utilized for boring business-y things”. It’s also worth noting that companies care significantly about culture fit when looking for people to fill even mundane MBA-like positions...)
In the case of a human (A) generically trying to optimize for his values (V), with an AGI trained to be corrigible (S) assisting, it seems quite unlikely that S would be able to learn A’s relevant internal representations (unless it’s far smarter and thus untrustworthy), which would lead to incorrect generalizations. My intuition is that if S is not much smarter than A, but helping in extremely general ways and given significant autonomy, the resulting outcome will be very bad. I definitely think this if S is a sovereign, but also think this if e.g. it’s doing a thousand years’ worth of human cognitive work in determining if a newly distilled agent is corrigible, which I think happens in ALBA. (Please correct me if I botched some details.)
Paul: Is your picture that the corrigible AI learns the relevant internal representations in lockstep with getting smarter, such that it manages to hit a “sweet spot” where it groks human values but isn’t vastly superintelligent? Or do you think it doesn’t learn the relevant internal representations, but its action space is limited enough that none of its plausible mistakes would be catastrophic? Or do you think one of my initial intuitions (1-6) is importantly wrong? Or do you think something else?
Two final thoughts:
The way I’ve been thinking about corrigibility, there is a simple core to corrigibility, but it only applies when the subagent can accurately predict any judgment you’d make of the world, and isn’t much more powerful than you. This is the case if e.g. the subagent starts as a clone of you, and is not the case if you’re training it from scratch (because it’ll either be too dumb to understand you, or too smart to be trustworthy). I’m currently chewing on some ideas for operationalizing this take on corrigibility using decision theory.
None of this analysis takes into account that human notions of “high-impact” are often wrong. Typical human reasoning processes are pretty susceptible to black swans, as history shows. (Daemons sprouting would be a subcase of this, where naive human judgments might judge massive algorithmic searches to be low-impact.)
I disagree with 2, 4, 5 and the conclusion, though it might depend on how you are defining terms.
On 2, if there are morally important decisions you don’t recognize as morally important (e.g. massive mindcrime), you might destroy value by making the wrong decision and not realizing the VOI, but that’s not behaving incorrigibly.
On 4, that’s one reason but not the only reason you could robustly generalize.
On 5 I don’t understand what you mean or why that might be true.
I don’t really understand what you mean by black swans (or the direct relevance to corrigibility).
Do you consider this a violation of alignment? If not, what word would you use? If yes, do you have a word for it that’s more specific than “alignment”?
Also, I have a concern similar to zhukeepa’s 6, which is that you seem to be depending on the AI being able to learn to model the user at runtime, starting from a “brain” that’s very different from a human’s (and lacks most of the built-in information and procedure that a human would use to model another human), and this (even if it could be done safely in theory) seems to require superhuman speed or intelligence. Before it can do that, the AI, even if corrigible, is either dangerous or not generally useful, which implies that when we achieve just human-level AGI, your alignment approach won’t work or won’t be safe yet. Does this argument seem correct to you?
I use “AI alignment” to refer to the problem of “building an AI that is trying to do what you want it to do” and especially which isn’t trying to take your resources or disempower you.
I allow the possibility that an aligned AI could make mistakes, including mistakes that a philosophically sophisticated human wouldn’t make. I call those “mistakes” or “catastrophic mistakes” or usually some more specific term describing the kind of mistake (in this case a moral error, which humans as well as AI’s could make). I don’t have a particular word for the problem of differentially advancing AI so that it doesn’t make catastrophic mistakes.
I would include this family of problems, of designing an AI which is competent enough to avoid some particular class of mistakes, under the heading “AI safety.”
If by “dangerous” you mean “unacceptably dangerous” then I don’t believe this step of the argument.
I do agree that my approach won’t produce a perfectly safe AGI. But that claim seems quite weak: perfect safety would require (amongst other things) a perfect understanding of physics and of all potentially relevant moral facts, to avoid a catastrophic misstep.
Presumably you are making some stronger claim, perhaps a quantitative claim about the degree of safety, or else a comparison to some other possible technique which might yield greater safety.
I want to note that this is ambiguous and apparently could apply or not apply to the particular thing I was asking about depending on one’s interpretation. If I didn’t know your interpretation, my first thought would be that an AI that commits mindcrimes because it didn’t correctly model me (and not realizing the VOI) is trying to do something that I don’t want it to do. Your definition of “alignment” as “AI that is trying to do what you want it to do” makes sense to me but your interpretation of “AI that is trying to do what you want it to do” is not intuitive to me so I have to remember that when I’m talking with you or reading your writings.
EDIT: Also, I can’t tell the difference between what you mean by “alignment” and what you mean by “corrigibility”. (I had thought that perhaps in this mindcrime example you’d call the AI corrigible but not aligned, but apparently that’s not the case.) Are you using the two terms interchangeably? If not can you explain the difference?
I mean if an AI does not have the intellectual capacity to model the user nearly as well as a typical human would, then it’s bound to either refuse to handle requests except those not requiring modeling the user well, or make a lot more mistakes while trying to help the user than a human trying to help the user. In other words by “dangerous” I meant substantially more dangerous than a typical human assistant. Does my argument make more sense now?
Ah, I agree this is ambiguous, I’m using a de dicto rather than de re interpretation of “trying to do what I want it to do.” It would be great to have a clearer way to express this.
Suppose that I give an indirect definition of “my long-term values” and then build an AI that effectively optimizes those values. Such an AI would likely disempower me in the short term, in order to expand faster, improve my safety, and so on. It would be “aligned” but not “corrigible.”
Similarly, if I were to train an AI to imitate a human who was simply attempting to get what they want, then that AI wouldn’t be corrigible. It may or may not be aligned, depending on how well the learning works.
In general, my intuition is that corrigibility implies alignment but not the other way around.
I don’t expect that such an AI would necessarily be substantially more dangerous than a typical human assistant. It might be, but there are factors pushing in both directions. In particular, “modeling the user well” seems like just one of many properties that affects how dangerous an assistant is.
On top of that, it’s not clear to me that such an AI would be worse at modeling other humans, at the point when it was human level. I think this will mostly be determined by the capacity of the model being trained, and how it uses this capacity (e.g. whether it is being asked to make large numbers of predictions about humans, or about physical systems), rather than by features of the early stages of the amplification training procedure.
That clarifies things a bit, but I’m not sure how to draw a line between what counts as aligned de dicto and what doesn’t, or how to quantify it. Suppose I design an AI that uses a hand-coded algorithm to infer what the user wants and to optimize for that, and it generally works well but fails to infer that I disvalue mindcrimes. (For people who might be following this but not know what “mindcrimes” are, see section 3 of this post.) This seems analogous to IDA failing to infer that the user disvalues mindcrimes, so you’d count it as aligned? But there’s a great (multi-dimensional) range of possible errors, and it seems like there must be some types or severities of value-learning errors where you’d no longer consider the AI to be “trying to do what I want it to do”, but I don’t know what those are.
Can you propose a more formal definition, maybe something along the lines of “If in the limit of infinite computing power, this AI would achieve X% of the maximum physically feasible value of the universe, then we can call it X% Aligned”?
Not sure how motivated you are to continue this line of discussion, so I’ll mention that uncertainty/confusion about a concept/term as central as “alignment” seems really bad. For example if you say “I think my approach can achieve AI alignment” and you mean one thing but the reader thinks you mean another, that might lead to serious policy errors. Similarly if you hold a contest on “AI alignment” and a participant misinterprets what you mean and submits something that doesn’t qualify as being on topic, that’s likely to cause no small amount of frustration.
I don’t have a more formal definition. Do you think that you or someone else has a useful formal definition we could use? I would be happy to adopt a more formal definition if it doesn’t have serious problems.
Or: are there some kinds of statements that you think shouldn’t be made without a more precise definitions? Is there an alternative way to describe a vague area of research that I’m interested in, that isn’t subject to the same criticism? Do you think I typically use “alignment” in a way that’s unnecessarily problematic in light of the likely misunderstanding? I don’t see this issue as nearly as important as you do, but am happy to make low-cost adjustments.
Here’s how I see it:
We almost certainly won’t build AI which knows all potentially relevant facts about our preferences (or about the world, or about logical facts) and therefore never makes a morally relevant mistake.
Anyone who describes “aligned AGI” or “safe AI” or “FAI” is therefore talking about some milder definition than this, e.g. involving making reasonable tradeoffs between VOI and the cost of eliciting preferences, between the risk of catastrophe and the costs of inaction, and so on.
No one has yet offered a convincing milder definition, and there may be no binary definition of “success” vs. “failure.” My milder definition is clearly imprecise, like all of the other implicit definitions people use.
Is this different from your view of the situation?
I don’t think this is a likely way to get a good definition of alignment (“good” in the sense of either being useful or of tracking how the term is typically used).
Given competitive pressures, lots of things that are obviously not AI alignment affect how much of the universe’s value you realize (for example, do you accidentally blow up the world while doing physics). Conversely, given no competitive pressure, your AI would not need to do anything risky, either concerning its own cognition or concerning physics experiments. It’s not clear whether we’ll realize 100% of the realizable value, but again the difficulty seems completely unrelated to AI and instead related to the probable course of human deliberation.
So this is basically just equivalent to eliminating competitive pressure as safely as possible in the limit of infinite computing power, i.e. it’s evaluating how well a proposed AI design solves a particular unrealistic problem. I think it would be likely to be solved by techniques like “learn high-fidelity brain emulations and run them really fast,” which seem quite different from promising approaches to alignment.
I was trying to capture the meaning of your informal definition, so I don’t understand why “learn high-fidelity brain emulations and run them really fast” being considered aligned according to my definition is a problem, when it also seems to fit your definition of “trying to do what I want it to do”. Are you saying that kind of AI doesn’t fit your definition? Or that “promising approaches to alignment” would score substantially worse than “learn high-fidelity brain emulations and run them really fast” according to my definition (i.e., achieve much less value when given infinite computing power)?
Also, I don’t see it as a problem if “aligned” ignores competition and computational limitations, since once we agree on what alignment means in the absence of these concerns we can then coin “competitively aligned” or “feasibly aligned” or what-have-you and try to define them. But mainly I don’t understand why you’re objecting when your own definition ignores these issues.
Here is a clarification of my previous comment, which I believe was based on a misunderstanding:
I don’t like the definition “an AGI is aligned if running it leads to good long-term outcomes” as a way of carving out a set of research problems or a research goal, because “AI alignment” then includes basically all x-risk relevant research. For example, it would include understanding physics relevant to possible high-energy physics catastrophes, and then making sure we give that information to our AGI so that it doesn’t inadvertently cause a physics catastrophe.
When I use “AI alignment,” I don’t want to include differential progress in fundamental physics that could help avoid catastrophes.
Your definition in the parent only requires good behavior in the limit of infinite computation, which I assumed was a way to make these other problems easy, and thereby exclude them from the definition. For example, if we have infinite computation, our AI can then do exhaustive Bayesian inference about possible theories of physics in order to make optimal decisions. And therefore progress in physics wouldn’t be relevant to AI alignment.
But I don’t think this trick works for separating out AI alignment problems in particular, because giving your AI infinite computation (while not giving competitors infinite computation) also eliminates most of the difficulties that we do want to think of as AI alignment.
Here is what I now believe you are/were saying:
I don’t think this is helpful either, because this “alignment” definition only tells us something about the behavior of our agent when we run it with infinite computation, and nothing about what happens when we run it in the real world. For example, the kinds of difficulties we were discussing in this thread (which come from the AI’s imperfect models of humans) would be irrelevant according to this definition of alignment, but seem extremely important in practice.
Saying what “aligned” means in the limit of infinite computation may be a useful step towards giving a definition in the realistic case of finite computation (though I don’t see how to make progress along those lines). I would be inclined to give that concept some name like “asymptotically aligned” and then use “aligned” interchangeably with “actually aligned, as implemented in the real world.”
I also think defining asymptotic alignment is non-trivial. I’d try something like: “when run with infinite computing and perfect information about the operator, including the operator’s knowledge about the world, the system outputs optimal decisions according to the operator’s {preferences}” where {preferences} is a stand-in for some as-yet-undefined concept that includes the operator’s enlightened preferences, beliefs, decision theory, etc.
Let me know if I am still misunderstanding you.
As a meta note: My current take is that more precise definitions are useful, and that I should adjust any behavior that is causing easily-corrected misunderstanding, but that coming up with more precise definitions is lower priority than making progress on the problem (and will be easier after making progress). It might be more useful to anchor this discussion to some particular significant problems arising from our definitional unclarity, if you think that it’s an important enough issue to be worth spending time on.
(In addition to the other reasons I gave for prioritizing clarity of definitions/explanations) I’d like to help contribute to making forward progress on these things (despite not being as optimistic as you), but it’s hard to do that without first understanding your existing ideas and intuitions, and that’s hard to do while being confused about what your words mean. I think this probably also applies to others who would like to contribute to this research.
>For example, the kinds of difficulties we were discussing in this thread (which come from the AI’s imperfect models of humans) would be irrelevant according to this definition of alignment, but seem extremely important in practice.
In my comment that started this sub-thread, I asked “Do you consider this [your mindcrime example] a violation of alignment?” You didn’t give a direct yes or no answer, but I thought it was clear from what you wrote that the answer is “no” (and therefore you consider these kinds of difficulties to be irrelevant according to your own definition of alignment), which is why I proposed the particular formalization that I did. I thought you were saying that these kinds of difficulties are not relevant to “alignment” but are relevant to “safety”. Did I misunderstand your answer, or perhaps you misunderstood my question, or something else?
I don’t think {not noticing that mindcrime is a problem} is a violation of alignment: the AI is trying to do what you want but makes a moral error.
I do think {if the AI is too weak, it secretly plots to kill everyone} is a violation of alignment: the AI isn’t trying to do what you want. It knows that you don’t want it to kill everyone, that’s why it’s trying to keep it secret.
(It’s technically possible for an AI to kill everyone, and even to secretly kill everyone, because it is trying to do what you want but makes a mistake. This seems like an inevitable feature of any sensible definition of alignment. I expect to now have an involved discussion about what the difference is.)
(Wei Dai and I discussed my definition of alignment offline, leading to this post which hopefully clarifies things a little bit in addition to summarizing the takeaways from this thread.)
The main problem from my perspective is that MIRI is using “alignment” in a very different way, to refer to a larger field of study that includes what you call “safety” and even “how rapidly an AI might gain in capability”. I think if you had a formal technical definition that you want to overload the term with, that would be fine if it’s clear (from context or explicit explanation) when you’re referring to the technical term. But since you only have a vague/ambiguous informal definition, a lot of people, if they were introduced to the term via MIRI’s writings, will easily round off your definition to theirs and fail to notice that you’re talking about something much narrower. This is even worse when you refer to “alignment” without giving any definition as in most of your writings.
The upshot here is that when you say something like “Many people endorse this or a similar vision as their current favored approach to alignment” a lot of people will interpret that as meaning your approach is supposed to solve many more problems than what you have in mind.
Given this, I think unless you can come up with a formal technical definition, you should avoid using “alignment” and pick a less overloaded term, or maybe put disclaimers everywhere. It occurs to me that it might feel unfair to you that I’m suggesting that you change your wording or add disclaimers, instead of MIRI. This is because I have the impression that more people were introduced to the term “AI alignment” through MIRI’s writings than yours, and therefore more people already have their definition in mind. (For example Eliezer just explained his version of “alignment” in his podcast with Sam Harris, who I understand to have a pretty large audience.) If that’s not the case then I’d make the suggestion to MIRI instead.
Even if you do use another term, people are still liable to round that off to the nearest concept that they’re familiar with, which would likely be MIRI’s “AI alignment”, or interpret “trying to do what we want them to do” in the de re sense, or get confused in some other way. So you probably need to write a post explaining your concept as clearly as you can and how it differs from nearby concepts, and then link to it every time you use the new term at least until most people become familiar with it.
I had previously described this problem as the “control problem” and called my blog “AI control,” following Nick Bostrom’s usage. Several people had expressed dissatisfaction with the term “control problem,” which I sympathized with (see this comment by Rob Bensinger from MIRI).
I adopted the term “AI alignment” after an email thread started by Rob about a year ago with a dozen people who frequently used the term, which was centered around the suggestion:
He later clarified that he actually meant what Bostrom calls “the second principal agent problem,” the principal agent problem between humans and AI rather than amongst humans, which was how I was using “control problem” and what I feel is the most useful concept.
I don’t have strong feelings about terminology, and so went with the consensus of others on the thread, and have been using “alignment” instead of control since then.
I agree that the usage by Eliezer in that Arbital post is much broader. I think it’s a much less useful concept than Nick’s control problem. Is it used by Eliezer or MIRI researchers in other places? Is it used by other people?
(Note that “aligned” and “the alignment problem” could potentially have separate definitions, which is in part responsible for our confusion in the other thread).
My best guess is that “alignment” should continue to be used for this narrower problem rather than the entire problem of making AI good. I’m certainly open to the possibility that alignment is being frequently misunderstood and should be explained + linked, and that is reasonably cheap (though I’d prefer get some evidence about that, you are the main person I talk to who seems to endorse the very broad reading).
(Note that the question “how fast will AI gain in capability” is also a relevant subproblem to the narrower use of “alignment,” since knowing more about AI development makes it easier to solve the alignment problem.)
Unfortunately most people don’t bother to define “alignment” when they use it, or do so very vaguely. But aside from Eliezer, I found a couple of more places that seem to define it more broadly than you here. LCFI:
And yourself in 2017:
I also did find an instance of someone defining “alignment” as a sub-field of “AI safety” as you do here.
I define “AI alignment” these days roughly the way the Open Philanthropy Project does:
More specifically, I think of the alignment problem as “find a way to use AGI systems to do at least some ambitious, high-impact things, without inadvertently causing anything terrible to happen relative to the operator’s explicit and implicit preferences”.
This is an easier goal than “find a way to safely use AGI systems to do everything the operator could possibly want” or “find a way to use AGI systems to do everything everyone could possibly want, in a way that somehow ‘correctly’ aggregates preferences”; I sometimes see problem statements like those referred to as the “full” alignment problem.
It’s a harder goal than “find a way to get AGI systems to do roughly what the operators have in mind, without necessarily accounting for failure modes the operators didn’t think of”. Following the letter of the law rather than the spirit is only OK insofar as the difference between letter and spirit is non-catastrophic relative to the operators’ true implicit preferences.
If developers and operators can’t foresee every potential failure mode, alignment should still mean that the system fails gracefully. If developers make a moral error (relative to their own moral values) but get alignment right, alignment should mean that their moral error doesn’t automatically cause a catastrophe. This does and should mean that alignment is much harder if solutions are more fragile or local and failure modes are harder to foresee.
This way of thinking about the alignment problem seems more useful to me because it factors out questions related to value disagreements and coordination between humans (including Bostrom’s first principal-agent problem), but leaves “aligned” contentful enough that it does actually mean we’re keeping our eye on the ball. We’re not ignoring how catastrophic-accident-prone the system actually is just because the developer was being dumb.
(I guess you’d want a stronger definition if you thought it was realistic that AGI developers might earnestly in their heart-of-hearts just want to destroy the world, since that case does make the alignment problem too trivial.
I’m similarly assuming that there won’t be a deep and irreconcilable values disagreement among stakeholders about whether we should conservatively avoid high risk of mindcrime, though there may be factual disagreements aplenty, and perhaps there are irreconcilable casewise disagreements about where to draw certain normative category boundaries once you move past “just be conservative and leave a wide berth around anything remotely mindcrime-like” and start trying to implement “full alignment” that can spit out the normatively right answer to every important question.)
I wrote a post attempting to clarify my definition. I’d be curious about whether you agree.
Speaking to the discussion Wei Dai and I just had, I’m curious about whether you would consider any or all of these cases to be alignment failures:
There is an opportunity to engage in acausal trade that will disappear once your AI becomes too powerful, and the AI fails to take that opportunity before becoming too powerful.
Your AI doesn’t figure out how to do a reasonable “values handshake” with a competitor (where two agents agree to both pursue some appropriate compromise values in order to be Pareto efficient), conservatively avoids such handshakes, and then gets outcompeted because of the resulting inefficiency.
Your AI has well-calibrated normative uncertainty about how to do such handshakes, but decides that the competitive pressure to engage in them is strong enough to justify the risk, and makes a binding agreement that we would eventually recognize as suboptimal.
In fact our values imply that it’s a moral imperative to develop as fast as possible, your AI fails to notice this counterintuitive argument, and therefore develops too slowly and leaves 50% of the value of the universe on the table.
Your AI fails to understand consciousness (like us), has well-calibrated moral uncertainty about the topic, but responds to competitive pressure by taking a risk and running some simulations that we would ultimately regard as experiencing enough morally relevant suffering to be called a catastrophe.
Your AI faces a moral decision about how much to fight for your values, and it decides to accept a risk of extinction that on reflection you’d consider unacceptably high.
Someone credibly threatens to blow up the world if your AI doesn’t give them stuff, and your AI capitulates even though on reflection we’d regard this as a mistake.
I’m not sure whether your definition is intended to include these. The sentence “this does and should mean that alignment is much harder if solutions are more fragile or local and failure modes are harder to foresee” does suggest that interpretation, but it also sounds like you maybe aren’t explicitly thinking about problems of this kind or are assuming that they are unimportant.
I wouldn’t consider any of these “alignment problems.” These are distinct problems that we’ll face whether or not we build an AI. Whether they are important is mostly unrelated to the usual arguments for caring about AI alignment, and the techniques that we will use to solve them are probably unrelated to the techniques we will use to build an AI that won’t kill us outright. (Many of these problems are likely to be solved by an AI, just like P != NP is likely to be proved by an AI, but that doesn’t make either of them an alignment problem.)
If these kinds of errors are included in “alignment,” then I’d want some different term that referred to the particular problem of building AI that was trying to do the right thing, without including all of the difficulty of figuring out what is right (except insofar as “figure out more about what is right” is one way to try to build an AI that is trying to do the right thing.)
(I do agree that building an AI which took control of the world away from us but then was never able to resolve these problems would probably be a failure of alignment.)
I really like that list of points! Not that I’m Rob, but I’d mentally classified each of those as alignment failures, and the concern I was trying to articulate was that, by default, I’d expect an AI trying to do the right thing will make something like one of these mistakes. Those are good examples of the sorts of things I’d be scared of if I had a well-intentioned non-neurotypical assistant. Those are also what I was referring to when I talked about “black swans” popping up. And when I said:
I meant that, if an AI trying to do the right thing was considering one of these actions, for it to be safe it should consult you before going ahead with any one of these. (I didn’t mean “the AI is incorrigible if it’s not high-impact calibrated”, I meant “the AI, even if corrigible, would be unsafe it’s not high-impact calibrated”.)
I think I understand your position much better now. The way I’ve been describing “ability to figure out what is right” is “metaphilosophical competence”, and I currently take the stance that an AI trying to do the right thing will by default be catastrophic if it’s not good enough at figuring out what is right, even if it’s corrigible.
I don’t think that “ability to figure out what is right” is captured by “metaphilosophical competence.” That’s one relevant ability, but there are many others: philosphical competence, understanding humans, historical knowledge, physics expertise...
OK, but that can mostly be done based on simple arguments about irreversibility and resource consumption. It doesn’t take much philosophical competence, or aesthetic sense, to notice that making a binding agreement that constrains all of your future behavior ever is a big deal, even if it would take incredible sophistication to figure out exactly which deals are good. Ditto for the other items on my list except possibly acausal trade that goes off the table based on crossing some capability threshold, but practically even that is more like a slow-burning problem than a catastrophe.
I feel like you are envisioning an AI which is really smart in some ways and implausibly dumb in others. I agree that we need to understand something about the kind of errors that our AI will make, in order to understand whether it is safe. But in order to talk about how important that problem is (and how much of a focus it should be relative to what I’m calling “alignment”) we need to actually talk about how easy or hard those errors are. In many of the cases you are describing the AI systems involved seem even dumber than existing ML (e.g. they are predicting the answer to “which of these cases would a human consider potentially catastrophic” even worse than an existing ML system would).
Using Scott Garrabrant’s terminology, I think that we should basically start by trying to get robustness to scaling up, then once we understand what’s needed for that try to get robustness to relative scale, then once we understand what’s needed for that we should aim for robustness to scaling down. I expect robustness to scaling down to be the easiest of these, and it’s definitely the easiest to get empirical feedback about. It’s also the one for which we learn the most from ongoing AI progress.
By “metaphilosophical competence” zhukeepa means to include philosophical competence and rationality (which I guess includes having the right priors and using information efficiently in all fields of study including understanding humans, historical knowledge, physics expertise). (I wish he would be more explicit about that to avoid confusion.)
Why is this implausible, given that we don’t yet know that meta-execution with humans acting on small inputs is universal? And even if it’s universal, meta-execution may be more efficient (requires fewer amplifications to reach a certain level of performance) in some areas than others, and therefore the resulting AI could be very smart in some ways and dumb in others at a given level of amplification.
Do you think that’s not the case, or that the strong/weak areas of meta-execution do not line up the way zhukeepa expects? To put it another way, when IDA reaches roughly human-level intelligence, which areas do you expect it to be smarter than human, which dumber than human? (I’m trying to improve my understanding and intuitions about meta-execution so I can better judge this myself.)
Your scheme depends on both meta-execution and ML, and it only takes one of them to be dumb in some area for the resulting AI to be dumb in that area. Also, what existing ML system are you talking about? Is it something someone has already built, or are you imagining something we could build with current ML technology?
I replied about (2) and black swans in a comment way down.
I’m curious to hear more about your thoughts about (4).
To flesh out my intuitions around (4) and (5): I think there are many tasks where a high-dimensional and difficult to articulate piece of knowledge is critical for completing the task. For example:
if you’re Larry or Sergey trying to hire a new CEO, you need your new CEO to be a culture fit. Which in this case means something like “being technical, brilliant, and also a hippie at heart”. It’s really, really hard to communicate this to a slick MBA. Especially the “be a hippie at heart” part. Maybe if you sent them to Burning Man and had them take a few drugs, they’d grok it?
if you’re Bill Gates hiring a new CEO, you should make sure your new CEO is also a developer at heart, not a salesman. Otherwise, you might hire Steve Ballmer, who drove Microsoft’s revenues through the roof for a few years, but also had little understanding of developers (for example he produced an event where he celebrated developers in a way developers don’t tend to like being celebrated). This led to an overall trend of the company losing its technical edge, and thus its competitive edge… this was all while Ballmer had worked with Gates at Microsoft for two decades. If Ballmer was a developer, he may have been able avoid this, but he very much wasn’t.
if you’re a self-driving car engineer delegating image classification to a modern-day neural net, you’d really want its understanding of what the classifications mean to match yours, lest they be susceptible to clever adversarial attacks. Humans understand the images to represent projections of crisp three-dimensional objects that exist in a physical world; image classifiers don’t, which is why they can get fooled so easily by overlays of random patterns. Maybe it’s possible to replicate this understanding without being an embodied agent, but it seems you’d need something beyond training a big neural net on a large collection of images, and making incremental fixes.
if you’re a startup trying to build a product, it’s very hard to do so correctly if you don’t have a detailed implicit model of your users’ workflows and pain points. It helps a lot to talk to them, but even then, you may only be getting 10% of the picture if you don’t know what it’s like to be them. Most startups die by not having this picture, flying blind, and failing to acquire any users.
if you’re trying to help your extremely awkward and non-neurotypical friend find a romantic partner, you might find it difficult to convey what exactly is so bad about carrying around slips of paper with clever replies, and pulling them out and reading from them when your date says something you don’t have a reply to. (It’s not that hard to convey why doing this particuar thing is bad. It’s hard to convey what exactly about it is bad, that would have him properly generalize and avoid all classes of mistakes like this going forward, rather than just going like “Oh, pulling out slips of paper is jarring and might make her feel bad, so I’ll stop doing this particular thing”.) (No, I did not make up this example.)
In these sorts of situations, I wouldn’t trust an AI to capture my knowledge/understanding. It’s often tacit and perceptual, it’s often acquired through being a human making direct contact with reality, and it might require a human cognitive architecture to even comprehend in the first place. (Hence my claims that proper generalization requires having the same ontologies as the overseer, which they obtained from their particular methods of solving a problem.)
In general, I feel really sketched about amplifying oversight, if the mechanism involves filtering your judgment through a bunch of well-intentioned non-neurotypical assistants, since I’d expect the tacit understandings that go into your judgment to get significantly distorted. (Hence my curiosity about whether you think we can avoid the judgment getting significantly distorted, and/or whether you think we can do fine even with significantly distorted judgment.)
It’s also pretty plausible that I’m talking completely past you here; please let me know if this is the case.
Those examples may be good evidence that humans have a lot of implicit knowledge, but I don’t think they suggest that an AI needs to learn human representations in order to be safe.
I agree that “AI systems are likely to generalize differently from humans.” I strongly believe we shouldn’t rest AI alignment on detailed claims about how an AI will generalize to a new distribution. (Though I do think we can hope to avoid errors of commission on a new distribution.)
I think my present view is something like a conjunction of:
1. An AI needs to learn human representations in order to generalize like a human does.
2. For sufficiently general and open-ended tasks, the AI will need to generalize like a human does in order to be safe. Otherwise, the default is to expect a (possibly existential) catastrophe from a benign failure.
3. For a very broad range of narrow tasks, the AI does not need to generalize like a human does in order to be safe (or, it’s easy for it to generalize like a human). Go is in this category, ZFC theorem-provers are probably in this category, and I can imagine a large swath of engineering automation also falls into this category.
4. To the extent that “general and open-ended tasks” can be broken down into narrow tasks that don’t require human generalization, they don’t require human generalization to learn safely.
My current understanding is that we agree on (3) and (4), and that you either think that (2) is false, or that it’s true but the bar for “sufficiently general and open-ended” is really high, and tasks like achieving global stability can be safely broken down into safe narrow tasks. Does this sound right to you?
I’m confused about your thoughts on (1).
(I’m currently rereading your blog posts to get a better sense of your models of how you think broad and general tasks can get broken down into narrow ones.)
This recent post is relevant to my thinking here. For the performance guarantee, you only care about what happens on the training distribution. For the control guarantee, “generalize like a human” doesn’t seem like the only strategy, or even an especially promising strategy.
I assume you think some different kind of guarantee is needed. My best guess is that you expect we’ll have a system that is trying to do what we want, but is very alien and unable to tell what kinds of mistakes might be catastrophic to us, and that there are enough opportunities for catastrophic error that it is likely to make one.
Let me know if that’s wrong.
If that’s right, I think the difference is: I see subtle benign catastrophic errors as quite rare, such that they are quantitatively a much smaller problem than what I’m calling AI alignment, whereas you seem to think they are extremely common. (Moreover, the benign catastrophic risks I see are also mostly things like “accidentally start a nuclear war,” for which “make sure the AI generalizes like a human” is not a especially great response. But I think that’s just because I’m not seeing some big class of benign catastrophic risks that seem obvious to you, so it’s just a restatement of the same difference.)
Could you explain a bit more what kind subtle benign mistake you expect to be catastrophic?
Additionally, I think that there are ways to misunderstand the IDA approach that leave out significant parts of the complexity (ie. IDA based off of humans thinking for a day with unrestricted input, without doing the hard work of trying to understand corrigibility and meta-philosophy beforehand), but can seem to be plausible things to talk about in terms of “solving the AI alignment problem” if one hasn’t understood the more subtle problems that would occur. It’s then easy to miss the problems and feel optimistic about IDA working while underestimating the amount of hard philosophical work that needs to be done, or to incorrectly attack the approach for missing the problems completely.
(I think that these simpler versions of IDA might be worth thinking about as a plausible fallback plan if no other alignment approach is ready in time, but only if they are restricted in terms of accomplishing specific tasks to stabilise the world, restricted in how far the amplification is taking, replaced with something better as soon as possible, etc. I also think that working on simple versions of IDA can help make progress on issues that would be required for fully scalable IDA, ie. the experiments that Ought is running.).
I guess this is in part because that’s how Paul initially described his approach, before coming up with Security Amplification in October 2016. For example in March 2016 I wrote “First, I’m assuming it’s reasonable to think of A1 as a human upload that is limited to one day of subjective time, by the end of which it must have written down any thoughts it wants to save, and be reset. Let me know if this is wrong.” and Paul didn’t object to this in his reply.
An additional issue is that even people who intellectually understand the new model might still have intuitions left over from the old one. For example I’m just now realizing that the low-amplification agents in the new scheme must have thought processes and “deliberations” that are very alien, since they don’t have human priors, natural language understanding, values, common sense judgment, etc. I wish Paul had written a post in big letters that said, “WARNING: Throw out all your old intuitions!”
I’m a little confused about what this statement means. I thought that if you have an overseer that implements some reasoning core, and consider amplify(overseer) with infinite computation time and unlimited ability to query the world (ie. for background information on what humans seem to want, how they behave, etc.), then amplify(overseer) should be able to solve any problem that an agent produced by iterating IDA could solve.
Did you mean to say that
“already highly competent at these tasks” means that the core should be able to solve these problems without querying the world at all, and this is not likely to be possible?
you don’t expect to find a core such that only one round of amplification of amplify(overseer) can solve practical tasks in any reasonable amount of time/number of queries?
There is some other way that the agent produced by IDA would be more competent than the original amplified overseer?
I mean that the core itself, as a policy, won’t be able to solve these problems. It also won’t solve it after a small number of amplification steps. And probably it will have to query the world.
What is the difference between “core after a small number of amplification steps” and “core after a large number of amplification steps” that isn’t captured in “larger effective computing power” or “larger set of information about the world”, and allows the highly amplified core to solve these problems?
I didn’t mean to suggest there is a difference other than giving it more computation and more data.
I was imagining Amplify(X) as a procedure that calls X a bounded number of times, so that you need to iterate Amplify in order to have arbitrarily large runtimes, while I think you were imagining a parameterized operation Amplify(X, n) that takes n time and so can be scaled up directly. Your usage also seems fine.
Even if that’s not the difference, I strongly expect we are on the same page here about everything other than words. I’ve definitely updated some about the difficulty of words.
Okay, I agree that we’re on the same page. Amplify(X,n) is what I had in mind.