paulfchristiano comments on Can corrigibility be learned safely?

paulfchristiano 7 Apr 2018 1:19 UTC
LW: 6 AF: 2
AF
Can you propose a more formal definition
I don’t have a more formal definition. Do you think that you or someone else has a useful formal definition we could use? I would be happy to adopt a more formal definition if it doesn’t have serious problems.
Or: are there some kinds of statements that you think shouldn’t be made without a more precise definitions? Is there an alternative way to describe a vague area of research that I’m interested in, that isn’t subject to the same criticism? Do you think I typically use “alignment” in a way that’s unnecessarily problematic in light of the likely misunderstanding? I don’t see this issue as nearly as important as you do, but am happy to make low-cost adjustments.
But there’s a great (multi-dimensional) range of possible errors, and it seems like there must be some types or severities of value-learning errors where you’d no longer consider the AI to be “trying to do what I want it to do”, but I don’t know what those are.
Here’s how I see it:
- We almost certainly won’t build AI which knows all potentially relevant facts about our preferences (or about the world, or about logical facts) and therefore never makes a morally relevant mistake.
- Anyone who describes “aligned AGI” or “safe AI” or “FAI” is therefore talking about some milder definition than this, e.g. involving making reasonable tradeoffs between VOI and the cost of eliciting preferences, between the risk of catastrophe and the costs of inaction, and so on.
- No one has yet offered a convincing milder definition, and there may be no binary definition of “success” vs. “failure.” My milder definition is clearly imprecise, like all of the other implicit definitions people use.
Is this different from your view of the situation?
maybe something along the lines of “If in the limit of infinite computing power, this AI would achieve X% of the maximum physically feasible value of the universe, then we can call it X% Aligned”?
I don’t think this is a likely way to get a good definition of alignment (“good” in the sense of either being useful or of tracking how the term is typically used).
Given competitive pressures, lots of things that are obviously not AI alignment affect how much of the universe’s value you realize (for example, do you accidentally blow up the world while doing physics). Conversely, given no competitive pressure, your AI would not need to do anything risky, either concerning its own cognition or concerning physics experiments. It’s not clear whether we’ll realize 100% of the realizable value, but again the difficulty seems completely unrelated to AI and instead related to the probable course of human deliberation.
So this is basically just equivalent to eliminating competitive pressure as safely as possible in the limit of infinite computing power, i.e. it’s evaluating how well a proposed AI design solves a particular unrealistic problem. I think it would be likely to be solved by techniques like “learn high-fidelity brain emulations and run them really fast,” which seem quite different from promising approaches to alignment.
- Wei Dai 7 Apr 2018 2:25 UTC
  LW: 4 AF: 1
  AF Parent
  
  So this is basically just equivalent to eliminating competitive pressure as safely as possible in the limit of infinite computing power, i.e. it’s evaluating how well a proposed AI design solves a particular unrealistic problem. I think it would be likely to be solved by techniques like “learn high-fidelity brain emulations and run them really fast,” which seem quite different from promising approaches to alignment.
  
  I was trying to capture the meaning of your informal definition, so I don’t understand why “learn high-fidelity brain emulations and run them really fast” being considered aligned according to my definition is a problem, when it also seems to fit your definition of “trying to do what I want it to do”. Are you saying that kind of AI doesn’t fit your definition? Or that “promising approaches to alignment” would score substantially worse than “learn high-fidelity brain emulations and run them really fast” according to my definition (i.e., achieve much less value when given infinite computing power)?
  
  Also, I don’t see it as a problem if “aligned” ignores competition and computational limitations, since once we agree on what alignment means in the absence of these concerns we can then coin “competitively aligned” or “feasibly aligned” or what-have-you and try to define them. But mainly I don’t understand why you’re objecting when your own definition ignores these issues.
  - paulfchristiano 7 Apr 2018 3:57 UTC
    LW: 4 AF: 1
    AF Parent
    Here is a clarification of my previous comment, which I believe was based on a misunderstanding:
    I don’t like the definition “an AGI is aligned if running it leads to good long-term outcomes” as a way of carving out a set of research problems or a research goal, because “AI alignment” then includes basically all x-risk relevant research. For example, it would include understanding physics relevant to possible high-energy physics catastrophes, and then making sure we give that information to our AGI so that it doesn’t inadvertently cause a physics catastrophe.
    When I use “AI alignment,” I don’t want to include differential progress in fundamental physics that could help avoid catastrophes.
    Your definition in the parent only requires good behavior in the limit of infinite computation, which I assumed was a way to make these other problems easy, and thereby exclude them from the definition. For example, if we have infinite computation, our AI can then do exhaustive Bayesian inference about possible theories of physics in order to make optimal decisions. And therefore progress in physics wouldn’t be relevant to AI alignment.
    But I don’t think this trick works for separating out AI alignment problems in particular, because giving your AI infinite computation (while not giving competitors infinite computation) also eliminates most of the difficulties that we do want to think of as AI alignment.
    Here is what I now believe you are/were saying:
    Let’s define “aligned” to mean something like “would yield good outcomes if run with infinite computation.” Then we can describe our research goal in terms of “alignment” as something like “We want a version of technique X that has the same advantages as X but produces aligned agents.”
    I don’t think this is helpful either, because this “alignment” definition only tells us something about the behavior of our agent when we run it with infinite computation, and nothing about what happens when we run it in the real world. For example, the kinds of difficulties we were discussing in this thread (which come from the AI’s imperfect models of humans) would be irrelevant according to this definition of alignment, but seem extremely important in practice.
    Saying what “aligned” means in the limit of infinite computation may be a useful step towards giving a definition in the realistic case of finite computation (though I don’t see how to make progress along those lines). I would be inclined to give that concept some name like “asymptotically aligned” and then use “aligned” interchangeably with “actually aligned, as implemented in the real world.”
    I also think defining asymptotic alignment is non-trivial. I’d try something like: “when run with infinite computing and perfect information about the operator, including the operator’s knowledge about the world, the system outputs optimal decisions according to the operator’s {preferences}” where {preferences} is a stand-in for some as-yet-undefined concept that includes the operator’s enlightened preferences, beliefs, decision theory, etc.
    Let me know if I am still misunderstanding you.
    As a meta note: My current take is that more precise definitions are useful, and that I should adjust any behavior that is causing easily-corrected misunderstanding, but that coming up with more precise definitions is lower priority than making progress on the problem (and will be easier after making progress). It might be more useful to anchor this discussion to some particular significant problems arising from our definitional unclarity, if you think that it’s an important enough issue to be worth spending time on.
    - Wei Dai 8 Apr 2018 18:06 UTC
      LW: 5 AF: 1
      AF Parent
      
      My current take is that more precise definitions are useful, and that I should adjust any behavior that is causing easily-corrected misunderstanding, but that coming up with more precise definitions is lower priority than making progress on the problem (and will be easier after making progress).
      
      (In addition to the other reasons I gave for prioritizing clarity of definitions/explanations) I’d like to help contribute to making forward progress on these things (despite not being as optimistic as you), but it’s hard to do that without first understanding your existing ideas and intuitions, and that’s hard to do while being confused about what your words mean. I think this probably also applies to others who would like to contribute to this research.
    - Wei Dai 7 Apr 2018 4:47 UTC
      2 points
      Parent
      >For example, the kinds of difficulties we were discussing in this thread (which come from the AI’s imperfect models of humans) would be irrelevant according to this definition of alignment, but seem extremely important in practice.
      In my comment that started this sub-thread, I asked “Do you consider this [your mindcrime example] a violation of alignment?” You didn’t give a direct yes or no answer, but I thought it was clear from what you wrote that the answer is “no” (and therefore you consider these kinds of difficulties to be irrelevant according to your own definition of alignment), which is why I proposed the particular formalization that I did. I thought you were saying that these kinds of difficulties are not relevant to “alignment” but are relevant to “safety”. Did I misunderstand your answer, or perhaps you misunderstood my question, or something else?
      - paulfchristiano 7 Apr 2018 5:02 UTC
        2 points
        Parent
        I don’t think {not noticing that mindcrime is a problem} is a violation of alignment: the AI is trying to do what you want but makes a moral error.
        I do think {if the AI is too weak, it secretly plots to kill everyone} is a violation of alignment: the AI isn’t trying to do what you want. It knows that you don’t want it to kill everyone, that’s why it’s trying to keep it secret.
        (It’s technically possible for an AI to kill everyone, and even to secretly kill everyone, because it is trying to do what you want but makes a mistake. This seems like an inevitable feature of any sensible definition of alignment. I expect to now have an involved discussion about what the difference is.)
        paulfchristiano 8 Apr 2018 23:40 UTC
        4 points
        Parent
        (Wei Dai and I discussed my definition of alignment offline, leading to this post which hopefully clarifies things a little bit in addition to summarizing the takeaways from this thread.)
- Wei Dai 7 Apr 2018 4:08 UTC
  LW: 3 AF: 2
  AF Parent
  
  Or: are there some kinds of statements that you think shouldn’t be made without a more precise definitions? Is there an alternative way to describe a vague area of research that I’m interested in, that isn’t subject to the same criticism? Do you think I typically use “alignment” in a way that’s unnecessarily problematic in light of the likely misunderstanding? I don’t see this issue as nearly as important as you do, but am happy to make low-cost adjustments.
  
  The main problem from my perspective is that MIRI is using “alignment” in a very different way, to refer to a larger field of study that includes what you call “safety” and even “how rapidly an AI might gain in capability”. I think if you had a formal technical definition that you want to overload the term with, that would be fine if it’s clear (from context or explicit explanation) when you’re referring to the technical term. But since you only have a vague/ambiguous informal definition, a lot of people, if they were introduced to the term via MIRI’s writings, will easily round off your definition to theirs and fail to notice that you’re talking about something much narrower. This is even worse when you refer to “alignment” without giving any definition as in most of your writings.
  
  The upshot here is that when you say something like “Many people endorse this or a similar vision as their current favored approach to alignment” a lot of people will interpret that as meaning your approach is supposed to solve many more problems than what you have in mind.
  
  Given this, I think unless you can come up with a formal technical definition, you should avoid using “alignment” and pick a less overloaded term, or maybe put disclaimers everywhere. It occurs to me that it might feel unfair to you that I’m suggesting that you change your wording or add disclaimers, instead of MIRI. This is because I have the impression that more people were introduced to the term “AI alignment” through MIRI’s writings than yours, and therefore more people already have their definition in mind. (For example Eliezer just explained his version of “alignment” in his podcast with Sam Harris, who I understand to have a pretty large audience.) If that’s not the case then I’d make the suggestion to MIRI instead.
  
  Even if you do use another term, people are still liable to round that off to the nearest concept that they’re familiar with, which would likely be MIRI’s “AI alignment”, or interpret “trying to do what we want them to do” in the de re sense, or get confused in some other way. So you probably need to write a post explaining your concept as clearly as you can and how it differs from nearby concepts, and then link to it every time you use the new term at least until most people become familiar with it.
  - paulfchristiano 7 Apr 2018 4:50 UTC
    LW: 9 AF: 4
    AF Parent
    I had previously described this problem as the “control problem” and called my blog “AI control,” following Nick Bostrom’s usage. Several people had expressed dissatisfaction with the term “control problem,” which I sympathized with (see this comment by Rob Bensinger from MIRI).
    I adopted the term “AI alignment” after an email thread started by Rob about a year ago with a dozen people who frequently used the term, which was centered around the suggestion:
    I think we should use the term “AI alignment” (and “alignment”, where AI is assumed) as a synonym for Bostrom’s “control problem,” since this is already more or less how the terms are most commonly used.”
    He later clarified that he actually meant what Bostrom calls “the second principal agent problem,” the principal agent problem between humans and AI rather than amongst humans, which was how I was using “control problem” and what I feel is the most useful concept.
    I don’t have strong feelings about terminology, and so went with the consensus of others on the thread, and have been using “alignment” instead of control since then.
    I agree that the usage by Eliezer in that Arbital post is much broader. I think it’s a much less useful concept than Nick’s control problem. Is it used by Eliezer or MIRI researchers in other places? Is it used by other people?
    (Note that “aligned” and “the alignment problem” could potentially have separate definitions, which is in part responsible for our confusion in the other thread).
    My best guess is that “alignment” should continue to be used for this narrower problem rather than the entire problem of making AI good. I’m certainly open to the possibility that alignment is being frequently misunderstood and should be explained + linked, and that is reasonably cheap (though I’d prefer get some evidence about that, you are the main person I talk to who seems to endorse the very broad reading).
    (Note that the question “how fast will AI gain in capability” is also a relevant subproblem to the narrower use of “alignment,” since knowing more about AI development makes it easier to solve the alignment problem.)
    - Wei Dai 7 Apr 2018 10:15 UTC
      LW: 4 AF: 2
      AF Parent
      Unfortunately most people don’t bother to define “alignment” when they use it, or do so very vaguely. But aside from Eliezer, I found a couple of more places that seem to define it more broadly than you here. LCFI:
      
      The Value Alignment Project seeks to design methods for preventing AI systems from inadvertently acting in ways inimical to human values.
      
      And yourself in 2017:
      
      By “AI alignment” I mean building AI systems which robustly advance human interests.
      
      I also did find an instance of someone defining “alignment” as a sub-field of “AI safety” as you do here.
      - Rob Bensinger 7 Apr 2018 15:43 UTC
        LW: 8 AF: 3
        AF Parent
        I define “AI alignment” these days roughly the way the Open Philanthropy Project does:
        the problem of creating AI systems that will reliably do what their users want them to do even when AI systems become much more capable than their users across a broad range of tasks
        More specifically, I think of the alignment problem as “find a way to use AGI systems to do at least some ambitious, high-impact things, without inadvertently causing anything terrible to happen relative to the operator’s explicit and implicit preferences”.
        This is an easier goal than “find a way to safely use AGI systems to do everything the operator could possibly want” or “find a way to use AGI systems to do everything everyone could possibly want, in a way that somehow ‘correctly’ aggregates preferences”; I sometimes see problem statements like those referred to as the “full” alignment problem.
        It’s a harder goal than “find a way to get AGI systems to do roughly what the operators have in mind, without necessarily accounting for failure modes the operators didn’t think of”. Following the letter of the law rather than the spirit is only OK insofar as the difference between letter and spirit is non-catastrophic relative to the operators’ true implicit preferences.
        If developers and operators can’t foresee every potential failure mode, alignment should still mean that the system fails gracefully. If developers make a moral error (relative to their own moral values) but get alignment right, alignment should mean that their moral error doesn’t automatically cause a catastrophe. This does and should mean that alignment is much harder if solutions are more fragile or local and failure modes are harder to foresee.
        This way of thinking about the alignment problem seems more useful to me because it factors out questions related to value disagreements and coordination between humans (including Bostrom’s first principal-agent problem), but leaves “aligned” contentful enough that it does actually mean we’re keeping our eye on the ball. We’re not ignoring how catastrophic-accident-prone the system actually is just because the developer was being dumb.
        (I guess you’d want a stronger definition if you thought it was realistic that AGI developers might earnestly in their heart-of-hearts just want to destroy the world, since that case does make the alignment problem too trivial.
        I’m similarly assuming that there won’t be a deep and irreconcilable values disagreement among stakeholders about whether we should conservatively avoid high risk of mindcrime, though there may be factual disagreements aplenty, and perhaps there are irreconcilable casewise disagreements about where to draw certain normative category boundaries once you move past “just be conservative and leave a wide berth around anything remotely mindcrime-like” and start trying to implement “full alignment” that can spit out the normatively right answer to every important question.)
        paulfchristiano 7 Apr 2018 19:50 UTC
        LW: 7 AF: 3
        AF Parent
        I wrote a post attempting to clarify my definition. I’d be curious about whether you agree.
        If developers make a moral error (relative to their own moral values) but get alignment right, alignment should mean that their moral error doesn’t automatically cause a catastrophe.
        Speaking to the discussion Wei Dai and I just had, I’m curious about whether you would consider any or all of these cases to be alignment failures:
        There is an opportunity to engage in acausal trade that will disappear once your AI becomes too powerful, and the AI fails to take that opportunity before becoming too powerful.
        Your AI doesn’t figure out how to do a reasonable “values handshake” with a competitor (where two agents agree to both pursue some appropriate compromise values in order to be Pareto efficient), conservatively avoids such handshakes, and then gets outcompeted because of the resulting inefficiency.
        Your AI has well-calibrated normative uncertainty about how to do such handshakes, but decides that the competitive pressure to engage in them is strong enough to justify the risk, and makes a binding agreement that we would eventually recognize as suboptimal.
        In fact our values imply that it’s a moral imperative to develop as fast as possible, your AI fails to notice this counterintuitive argument, and therefore develops too slowly and leaves 50% of the value of the universe on the table.
        Your AI fails to understand consciousness (like us), has well-calibrated moral uncertainty about the topic, but responds to competitive pressure by taking a risk and running some simulations that we would ultimately regard as experiencing enough morally relevant suffering to be called a catastrophe.
        Your AI faces a moral decision about how much to fight for your values, and it decides to accept a risk of extinction that on reflection you’d consider unacceptably high.
        Someone credibly threatens to blow up the world if your AI doesn’t give them stuff, and your AI capitulates even though on reflection we’d regard this as a mistake.
        I’m not sure whether your definition is intended to include these. The sentence “this does and should mean that alignment is much harder if solutions are more fragile or local and failure modes are harder to foresee” does suggest that interpretation, but it also sounds like you maybe aren’t explicitly thinking about problems of this kind or are assuming that they are unimportant.
        I wouldn’t consider any of these “alignment problems.” These are distinct problems that we’ll face whether or not we build an AI. Whether they are important is mostly unrelated to the usual arguments for caring about AI alignment, and the techniques that we will use to solve them are probably unrelated to the techniques we will use to build an AI that won’t kill us outright. (Many of these problems are likely to be solved by an AI, just like P != NP is likely to be proved by an AI, but that doesn’t make either of them an alignment problem.)
        If these kinds of errors are included in “alignment,” then I’d want some different term that referred to the particular problem of building AI that was trying to do the right thing, without including all of the difficulty of figuring out what is right (except insofar as “figure out more about what is right” is one way to try to build an AI that is trying to do the right thing.)
        (I do agree that building an AI which took control of the world away from us but then was never able to resolve these problems would probably be a failure of alignment.)
        What links here?
        Viliam's comment on Mark Xu’s Shortform by Mark Xu (17 Oct 2020 21:58 UTC; 4 points)
        zhukeepa 8 Apr 2018 18:08 UTC
        LW: 7 AF: 3
        AF Parent
        I really like that list of points! Not that I’m Rob, but I’d mentally classified each of those as alignment failures, and the concern I was trying to articulate was that, by default, I’d expect an AI trying to do the right thing will make something like one of these mistakes. Those are good examples of the sorts of things I’d be scared of if I had a well-intentioned non-neurotypical assistant. Those are also what I was referring to when I talked about “black swans” popping up. And when I said:
        2. Corrigibility depends critically on high-impact calibration (when your AI is considering doing a high-impact thing, it’s critical that it knows to check that action with you).
        I meant that, if an AI trying to do the right thing was considering one of these actions, for it to be safe it should consult you before going ahead with any one of these. (I didn’t mean “the AI is incorrigible if it’s not high-impact calibrated”, I meant “the AI, even if corrigible, would be unsafe it’s not high-impact calibrated”.)
        If these kinds of errors are included in “alignment,” then I’d want some different term that referred to the particular problem of building AI that was trying to do the right thing, without including all of the difficulty of figuring out what is right (except insofar as “figure out more about what is right” is one way to try to build an AI that is trying to do the right thing.)
        I think I understand your position much better now. The way I’ve been describing “ability to figure out what is right” is “metaphilosophical competence”, and I currently take the stance that an AI trying to do the right thing will by default be catastrophic if it’s not good enough at figuring out what is right, even if it’s corrigible.
        What links here?
        zhukeepa's comment on Can corrigibility be learned safely? by Wei Dai (8 Apr 2018 19:57 UTC; 7 points)
        paulfchristiano 8 Apr 2018 20:33 UTC
        LW: 7 AF: 3
        AF Parent
        he way I’ve been describing “ability to figure out what is right” is “metaphilosophical competence”
        I don’t think that “ability to figure out what is right” is captured by “metaphilosophical competence.” That’s one relevant ability, but there are many others: philosphical competence, understanding humans, historical knowledge, physics expertise...
        for it to be safe it should consult you before going ahead with any one of these
        OK, but that can mostly be done based on simple arguments about irreversibility and resource consumption. It doesn’t take much philosophical competence, or aesthetic sense, to notice that making a binding agreement that constrains all of your future behavior ever is a big deal, even if it would take incredible sophistication to figure out exactly which deals are good. Ditto for the other items on my list except possibly acausal trade that goes off the table based on crossing some capability threshold, but practically even that is more like a slow-burning problem than a catastrophe.
        I feel like you are envisioning an AI which is really smart in some ways and implausibly dumb in others. I agree that we need to understand something about the kind of errors that our AI will make, in order to understand whether it is safe. But in order to talk about how important that problem is (and how much of a focus it should be relative to what I’m calling “alignment”) we need to actually talk about how easy or hard those errors are. In many of the cases you are describing the AI systems involved seem even dumber than existing ML (e.g. they are predicting the answer to “which of these cases would a human consider potentially catastrophic” even worse than an existing ML system would).
        Using Scott Garrabrant’s terminology, I think that we should basically start by trying to get robustness to scaling up, then once we understand what’s needed for that try to get robustness to relative scale, then once we understand what’s needed for that we should aim for robustness to scaling down. I expect robustness to scaling down to be the easiest of these, and it’s definitely the easiest to get empirical feedback about. It’s also the one for which we learn the most from ongoing AI progress.
        Wei Dai 10 Apr 2018 7:42 UTC
        LW: 3 AF: 2
        AF Parent
        I don’t think that “ability to figure out what is right” is captured by “metaphilosophical competence.” That’s one relevant ability, but there are many others: philosphical competence, understanding humans, historical knowledge, physics expertise…
        By “metaphilosophical competence” zhukeepa means to include philosophical competence and rationality (which I guess includes having the right priors and using information efficiently in all fields of study including understanding humans, historical knowledge, physics expertise). (I wish he would be more explicit about that to avoid confusion.)
        I feel like you are envisioning an AI which is really smart in some ways and implausibly dumb in others.
        Why is this implausible, given that we don’t yet know that meta-execution with humans acting on small inputs is universal? And even if it’s universal, meta-execution may be more efficient (requires fewer amplifications to reach a certain level of performance) in some areas than others, and therefore the resulting AI could be very smart in some ways and dumb in others at a given level of amplification.
        Do you think that’s not the case, or that the strong/weak areas of meta-execution do not line up the way zhukeepa expects? To put it another way, when IDA reaches roughly human-level intelligence, which areas do you expect it to be smarter than human, which dumber than human? (I’m trying to improve my understanding and intuitions about meta-execution so I can better judge this myself.)
        In many of the cases you are describing the AI systems involved seem even dumber than existing ML (e.g. they are predicting the answer to “which of these cases would a human consider potentially catastrophic” even worse than an existing ML system would).
        Your scheme depends on both meta-execution and ML, and it only takes one of them to be dumb in some area for the resulting AI to be dumb in that area. Also, what existing ML system are you talking about? Is it something someone has already built, or are you imagining something we could build with current ML technology?