Rob Bensinger comments on Can corrigibility be learned safely?

Rob Bensinger 7 Apr 2018 15:43 UTC
LW: 8 AF: 3
AF
I define “AI alignment” these days roughly the way the Open Philanthropy Project does:
the problem of creating AI systems that will reliably do what their users want them to do even when AI systems become much more capable than their users across a broad range of tasks
More specifically, I think of the alignment problem as “find a way to use AGI systems to do at least some ambitious, high-impact things, without inadvertently causing anything terrible to happen relative to the operator’s explicit and implicit preferences”.
This is an easier goal than “find a way to safely use AGI systems to do everything the operator could possibly want” or “find a way to use AGI systems to do everything everyone could possibly want, in a way that somehow ‘correctly’ aggregates preferences”; I sometimes see problem statements like those referred to as the “full” alignment problem.
It’s a harder goal than “find a way to get AGI systems to do roughly what the operators have in mind, without necessarily accounting for failure modes the operators didn’t think of”. Following the letter of the law rather than the spirit is only OK insofar as the difference between letter and spirit is non-catastrophic relative to the operators’ true implicit preferences.
If developers and operators can’t foresee every potential failure mode, alignment should still mean that the system fails gracefully. If developers make a moral error (relative to their own moral values) but get alignment right, alignment should mean that their moral error doesn’t automatically cause a catastrophe. This does and should mean that alignment is much harder if solutions are more fragile or local and failure modes are harder to foresee.
This way of thinking about the alignment problem seems more useful to me because it factors out questions related to value disagreements and coordination between humans (including Bostrom’s first principal-agent problem), but leaves “aligned” contentful enough that it does actually mean we’re keeping our eye on the ball. We’re not ignoring how catastrophic-accident-prone the system actually is just because the developer was being dumb.
(I guess you’d want a stronger definition if you thought it was realistic that AGI developers might earnestly in their heart-of-hearts just want to destroy the world, since that case does make the alignment problem too trivial.
I’m similarly assuming that there won’t be a deep and irreconcilable values disagreement among stakeholders about whether we should conservatively avoid high risk of mindcrime, though there may be factual disagreements aplenty, and perhaps there are irreconcilable casewise disagreements about where to draw certain normative category boundaries once you move past “just be conservative and leave a wide berth around anything remotely mindcrime-like” and start trying to implement “full alignment” that can spit out the normatively right answer to every important question.)
- paulfchristiano 7 Apr 2018 19:50 UTC
  LW: 7 AF: 3
  AF Parent
  I wrote a post attempting to clarify my definition. I’d be curious about whether you agree.
  If developers make a moral error (relative to their own moral values) but get alignment right, alignment should mean that their moral error doesn’t automatically cause a catastrophe.
  Speaking to the discussion Wei Dai and I just had, I’m curious about whether you would consider any or all of these cases to be alignment failures:
  - There is an opportunity to engage in acausal trade that will disappear once your AI becomes too powerful, and the AI fails to take that opportunity before becoming too powerful.
  - Your AI doesn’t figure out how to do a reasonable “values handshake” with a competitor (where two agents agree to both pursue some appropriate compromise values in order to be Pareto efficient), conservatively avoids such handshakes, and then gets outcompeted because of the resulting inefficiency.
  - Your AI has well-calibrated normative uncertainty about how to do such handshakes, but decides that the competitive pressure to engage in them is strong enough to justify the risk, and makes a binding agreement that we would eventually recognize as suboptimal.
  - In fact our values imply that it’s a moral imperative to develop as fast as possible, your AI fails to notice this counterintuitive argument, and therefore develops too slowly and leaves 50% of the value of the universe on the table.
  - Your AI fails to understand consciousness (like us), has well-calibrated moral uncertainty about the topic, but responds to competitive pressure by taking a risk and running some simulations that we would ultimately regard as experiencing enough morally relevant suffering to be called a catastrophe.
  - Your AI faces a moral decision about how much to fight for your values, and it decides to accept a risk of extinction that on reflection you’d consider unacceptably high.
  - Someone credibly threatens to blow up the world if your AI doesn’t give them stuff, and your AI capitulates even though on reflection we’d regard this as a mistake.
  I’m not sure whether your definition is intended to include these. The sentence “this does and should mean that alignment is much harder if solutions are more fragile or local and failure modes are harder to foresee” does suggest that interpretation, but it also sounds like you maybe aren’t explicitly thinking about problems of this kind or are assuming that they are unimportant.
  I wouldn’t consider any of these “alignment problems.” These are distinct problems that we’ll face whether or not we build an AI. Whether they are important is mostly unrelated to the usual arguments for caring about AI alignment, and the techniques that we will use to solve them are probably unrelated to the techniques we will use to build an AI that won’t kill us outright. (Many of these problems are likely to be solved by an AI, just like P != NP is likely to be proved by an AI, but that doesn’t make either of them an alignment problem.)
  If these kinds of errors are included in “alignment,” then I’d want some different term that referred to the particular problem of building AI that was trying to do the right thing, without including all of the difficulty of figuring out what is right (except insofar as “figure out more about what is right” is one way to try to build an AI that is trying to do the right thing.)
  (I do agree that building an AI which took control of the world away from us but then was never able to resolve these problems would probably be a failure of alignment.)
  What links here?
  - Viliam's comment on Mark Xu’s Shortform by Mark Xu (17 Oct 2020 21:58 UTC; 4 points)
  - zhukeepa 8 Apr 2018 18:08 UTC
    LW: 7 AF: 3
    AF Parent
    I really like that list of points! Not that I’m Rob, but I’d mentally classified each of those as alignment failures, and the concern I was trying to articulate was that, by default, I’d expect an AI trying to do the right thing will make something like one of these mistakes. Those are good examples of the sorts of things I’d be scared of if I had a well-intentioned non-neurotypical assistant. Those are also what I was referring to when I talked about “black swans” popping up. And when I said:
    2. Corrigibility depends critically on high-impact calibration (when your AI is considering doing a high-impact thing, it’s critical that it knows to check that action with you).
    I meant that, if an AI trying to do the right thing was considering one of these actions, for it to be safe it should consult you before going ahead with any one of these. (I didn’t mean “the AI is incorrigible if it’s not high-impact calibrated”, I meant “the AI, even if corrigible, would be unsafe it’s not high-impact calibrated”.)
    If these kinds of errors are included in “alignment,” then I’d want some different term that referred to the particular problem of building AI that was trying to do the right thing, without including all of the difficulty of figuring out what is right (except insofar as “figure out more about what is right” is one way to try to build an AI that is trying to do the right thing.)
    I think I understand your position much better now. The way I’ve been describing “ability to figure out what is right” is “metaphilosophical competence”, and I currently take the stance that an AI trying to do the right thing will by default be catastrophic if it’s not good enough at figuring out what is right, even if it’s corrigible.
    What links here?
    zhukeepa's comment on Can corrigibility be learned safely? by Wei Dai (8 Apr 2018 19:57 UTC; 7 points)
    - paulfchristiano 8 Apr 2018 20:33 UTC
      LW: 7 AF: 3
      AF Parent
      he way I’ve been describing “ability to figure out what is right” is “metaphilosophical competence”
      I don’t think that “ability to figure out what is right” is captured by “metaphilosophical competence.” That’s one relevant ability, but there are many others: philosphical competence, understanding humans, historical knowledge, physics expertise...
      for it to be safe it should consult you before going ahead with any one of these
      OK, but that can mostly be done based on simple arguments about irreversibility and resource consumption. It doesn’t take much philosophical competence, or aesthetic sense, to notice that making a binding agreement that constrains all of your future behavior ever is a big deal, even if it would take incredible sophistication to figure out exactly which deals are good. Ditto for the other items on my list except possibly acausal trade that goes off the table based on crossing some capability threshold, but practically even that is more like a slow-burning problem than a catastrophe.
      I feel like you are envisioning an AI which is really smart in some ways and implausibly dumb in others. I agree that we need to understand something about the kind of errors that our AI will make, in order to understand whether it is safe. But in order to talk about how important that problem is (and how much of a focus it should be relative to what I’m calling “alignment”) we need to actually talk about how easy or hard those errors are. In many of the cases you are describing the AI systems involved seem even dumber than existing ML (e.g. they are predicting the answer to “which of these cases would a human consider potentially catastrophic” even worse than an existing ML system would).
      Using Scott Garrabrant’s terminology, I think that we should basically start by trying to get robustness to scaling up, then once we understand what’s needed for that try to get robustness to relative scale, then once we understand what’s needed for that we should aim for robustness to scaling down. I expect robustness to scaling down to be the easiest of these, and it’s definitely the easiest to get empirical feedback about. It’s also the one for which we learn the most from ongoing AI progress.
      - Wei Dai 10 Apr 2018 7:42 UTC
        LW: 3 AF: 2
        AF Parent
        I don’t think that “ability to figure out what is right” is captured by “metaphilosophical competence.” That’s one relevant ability, but there are many others: philosphical competence, understanding humans, historical knowledge, physics expertise…
        By “metaphilosophical competence” zhukeepa means to include philosophical competence and rationality (which I guess includes having the right priors and using information efficiently in all fields of study including understanding humans, historical knowledge, physics expertise). (I wish he would be more explicit about that to avoid confusion.)
        I feel like you are envisioning an AI which is really smart in some ways and implausibly dumb in others.
        Why is this implausible, given that we don’t yet know that meta-execution with humans acting on small inputs is universal? And even if it’s universal, meta-execution may be more efficient (requires fewer amplifications to reach a certain level of performance) in some areas than others, and therefore the resulting AI could be very smart in some ways and dumb in others at a given level of amplification.
        Do you think that’s not the case, or that the strong/weak areas of meta-execution do not line up the way zhukeepa expects? To put it another way, when IDA reaches roughly human-level intelligence, which areas do you expect it to be smarter than human, which dumber than human? (I’m trying to improve my understanding and intuitions about meta-execution so I can better judge this myself.)
        In many of the cases you are describing the AI systems involved seem even dumber than existing ML (e.g. they are predicting the answer to “which of these cases would a human consider potentially catastrophic” even worse than an existing ML system would).
        Your scheme depends on both meta-execution and ML, and it only takes one of them to be dumb in some area for the resulting AI to be dumb in that area. Also, what existing ML system are you talking about? Is it something someone has already built, or are you imagining something we could build with current ML technology?