Richard_Ngo comments on Distinguishing claims about training vs deployment

Richard_Ngo 3 Feb 2021 19:14 UTC
LW: 7 AF: 3
AF
Saying “vast majority” seems straightfowardly misleading. Bostrom just says “a wide range”; it’s a huge leap from there to “vast majority”, which we have no good justification for making. In particular, by doing so you’re dismissing bounded goals. And if you’re talking about a “state of ignorance” about AI, then you have little reason to override the priors we have from previous technological development, like “we build things that do what we want”.
On your analogy, see the last part of my reply to Adam below. The process of building things intrinsically picks out a small section of the space of possibilities.
- Daniel Kokotajlo 4 Feb 2021 11:34 UTC
  LW: 4 AF: 2
  AF Parent
  I disagree that we have no good justification for making the “vast majority” claim, I think it’s in fact true in the relevant sense.
  I disagree that we had little reason to override the priors we had from previous tech development like “we build things that do what we want.” You are playing reference class tennis; we could equally have had a prior “AI is in the category of ‘new invasive species appearing’ and so our default should be that it displaces the old species, just as humans wiped out neanderthals etc.” or a prior of “Risk from AI is in the category of side-effects of new technology; no one is doubting that the paperclip-making AI will in fact make lots of paperclips, the issue is whether it will have unintended side-effects, and historically most new techs do.” Now, there’s nothing wrong with playing reference class tennis, it’s what you should do when you are very ignorant I suppose. My point is that in the context in which the classic arguments appeared, they were useful evidence that updated people in the direction of “Huh AI could be really dangerous” and people were totally right to update in that direction on the basis of these arguments, and moreover these arguments have been more-or-less vindicated by the last ten years or so, in that on further inspection AI does indeed seem to be potentially very dangerous and it does indeed seem to be not safe/friendly/etc. by default. (Perhaps one way of thinking about these arguments is that they were throwing in one more reference class into the game of tennis, the “space of possible goals” reference class.)
  I set up my analogy specifically to avoid your objection; the process of rolling a loaded die intrinsically is heavily biased towards a small section of the space of possibilities.
  - Richard_Ngo 4 Feb 2021 12:15 UTC
    LW: 2 AF: 1
    AF Parent
    I disagree that we have no good justification for making the “vast majority” claim.
    Can you point me to the sources which provide this justification? Your analogy seems to only be relevant conditional on this claim.
    My point is that in the context in which the classic arguments appeared, they were useful evidence that updated people in the direction of “Huh AI could be really dangerous” and people were totally right to update in that direction on the basis of these arguments
    They were right to update in that direction, but that doesn’t mean that they were right to update as far as they did. E.g. when Eliezer says that the default trajectory gives us approximately a zero percent chance of success, this is clearly going too far, given the evidence. But many people made comparably large updates.
    - Daniel Kokotajlo 4 Feb 2021 13:34 UTC
      LW: 3 AF: 2
      AF Parent
      I think I agree that they may have been wrong to update as far as they did. (Credence = 50%) So maybe we don’t disagree much after all.
      As for sources which provide that justification, oh, I don’t remember, I’d start by rereading Superintelligence and Yudkowsky’s old posts and try to find the relevant parts. But here’s my own summary of the argument as I understand it:
      1. The goals that we imagine superintelligent AGI having, when spelled out in detail, have ALL so far been the sort that would very likely lead to existential catastrophe of the instrumental convergence variety.
      2. We’ve even tried hard to imagine goals that aren’t of this sort, and so far we haven’t come up with anything. Things that seem promising, like “Place this strawberry on that plate, then do nothing else” actually don’t work when you unpack the details.
      3. Therefore, we are justified in thinking that the vast majority of possible ASI goals will lead to doom via instrumental convergence.
      I agree that our thinking has improved since then, with more work being done on impact measures and bounded goals and quantilizers and whatnot that makes such things seem not-totally-impossible to achieve. And of course the model of ASI as a rational agent with a well-defined goal has justly come under question also. But given the context of how people were thinking about things at the time, I feel like they would have been justified in making the “vast majority of possible goals” claim, even if they restricted themselves to more modest “wide range” claims.
      I don’t see how my analogy is only relevant conditional on this claim. To flip it around, you keep mentioning how AI won’t be a random draw from the space of all possible goals—why is that relevant? Very few things are random draws from the space of all possible X, yet reasoning about what’s typical in the space of possible X’s is often useful. Maybe I should have worked harder to pick a more real-world analogy than the weird loaded die one. Maybe something to do with thermodynamics or something—the space of all possible states my scrambled eggs could be in does contain states in which they spontaneously un-scramble later, but it’s a very small region of that space.
      - Richard_Ngo 4 Feb 2021 17:16 UTC
        LW: 8 AF: 4
        AF Parent
        1. The goals that we imagine superintelligent AGI having, when spelled out in detail, have ALL so far been the sort that would very likely lead to existential catastrophe of the instrumental convergence variety.
        2. We’ve even tried hard to imagine goals that aren’t of this sort, and so far we haven’t come up with anything. Things that seem promising, like “Place this strawberry on that plate, then do nothing else” actually don’t work when you unpack the details.
        Okay, this is where we disagree. I think what “unpacking the details” actually gives you is something like: “We don’t know how to describe the goal ‘place this strawberry on that plate’ in the form of a simple utility function over states of the world which can be coded into a superintelligent expected utility maximiser in a safe way”. But this proves far too much, because I am a general intelligence, and I am perfectly capable of having the goal which you described above in a way that doesn’t lead to catastrophe—not because I’m aligned with humans, but because I’m able to have bounded goals. And I can very easily imagine an AGI having a bounded goal in the same way. I don’t know how to build a particular bounded goal into an AGI—but nobody knows how to code a simple utility function into an AGI either. So why privilege the latter type of goals over the former?
        Also, in your previous comment, you give an old argument and say that, based on this, “they would have been justified in making the “vast majority of possible goals” claim”. But in the comment before that, you say “I disagree that we have no good justification for making the “vast majority” claim” in the present tense. Just to clarify: are you defending only the past tense claim, or also the present tense claim?
        given the context of how people were thinking about things at the time, I feel like they would have been justified in making the “vast majority of possible goals” claim, even if they restricted themselves to more modest “wide range” claims.
        Other people being wrong doesn’t provide justification for making very bold claims, so I don’t see why the context is relevant. If this is a matter of credit assignment, then I’m happy to say that making the classic arguments was very creditworthy and valuable. That doesn’t justify all subsequent inferences from them. In particular, a lack of good counterarguments at the time should not be taken as very strong evidence, since it often takes a while for good criticisms to emerge.
        Daniel Kokotajlo 4 Feb 2021 20:29 UTC
        LW: 7 AF: 3
        AF Parent
        Again, I’m not sure we disagree that much in the grand scheme of things—I agree our thinking has improved over the past ten years, and I’m very much a fan of your more rigorous way of thinking about things.
        FWIW, I disagree with this:
        But this proves far too much, because I am a general intelligence, and I am perfectly capable of having the goal which you described above in a way that doesn’t lead to catastrophe—not because I’m aligned with humans, but because I’m able to have bounded goals.
        There are other explanations for this phenomenon besides “I’m able to have bounded goals.” One is that you are in fact aligned with humans. Another is that you would in fact lead to catastrophe-by-the-standards-of-X if you were powerful enough and had a different goals than X. For example, suppose that right after reading this comment, you find yourself transported out of your body and placed into the body of a giant robot on an alien planet. The aliens have trained you to be smarter than them and faster than them; it’s a “That Alien Message” scenario basically. And you see that the aliens are sending you instructions.… “PUT BERRY.… ON PLATE.… OVER THERE...” You notice that these aliens are idiots and left their work lying around the workshop, so you can easily kill them and take command of the computer and rescue all your comrades back on Earth and whatnot, and it really doesn’t seem like this is a trick or anything, they really are that stupid… Do you put the strawberry on the plate? No.
        What people discovered back then was that you think you can “very easily imagine an AGI with bounded goals,” but this is on the same level as how some people think they can “very easily imagine an AGI considering doing something bad, and then realizing that it’s bad, and then doing good things instead.” Like, yeah it’s logically possible, but when we dig into the details we realize that we have no reason to think it’s the default outcome and plenty of reason to think it’s not.
        I was originally making the past tense claim, and I guess maybe now I’m making the present tense claim? Not sure, I feel like I probably shouldn’t, you are about to tear me apart, haha...
        Other people being wrong can sometimes provide justification for making “bold claims” of the form “X is the default outcome.” this is because claims of that form are routinely justified on even less evidence, namely no evidence at all. Implicit in our priors about the world are bajillions of claims of that form. So if you have a prior that says AI taking over is the default outcome (because AI not taking over would involve something special like alignment or bounded goals or whatnot) then you are already justified, given that prior, in thinking that AI taking over is the default outcome. And if all the people you encounter who disagree are giving terrible arguments, then that’s a nice cherry on top which provides further evidence.
        I think ultimately our disagreement is not worth pursuing much here. I’m not even sure it’s a real disagreement, given that you think the classic arguments did justify updates in the right direction to some extent, etc. and I agree that people probably updated too strongly, etc. Though the bit about bounded goals was interesting, and seems worth pursuing.
        Thanks for engaging with me btw!
        TurnTrout 5 Feb 2021 0:16 UTC
        LW: 2 AF: 1
        AF Parent
        I guess maybe now I’m making the present tense claim [that we have good justification for making the “vast majority” claim]?
        I mean, on a very skeptical prior, I don’t think we have good enough justification to believe it’s more probable than not that take-over-the-world behavior will be robustly incentiized for the actual TAI we build, but I think we have somewhat more evidence for the ‘vast majority’ claim than we did before.
        (And I agree with a point I expect Richard to make, which is that the power-seeking theorems apply for optimal agents, which may not look much at all like trained agents)
        I also wrote about this (and received a response from Ben Garfinkel) about half a year ago.
        Daniel Kokotajlo 5 Feb 2021 8:23 UTC
        LW: 5 AF: 2
        AF Parent
        Currently you probably have a very skeptical prior about what the surface of the farthest earth-sized planet from Earth in the Milky Way looks like. Yet you are very justified in being very confident it doesn’t look like this:
        Why? Because this is a very small region in the space of possibilities for earth-sized-planets-in-the-Milky-Way. And yeah, it’s true that planets are NOT drawn randomly from that space of possibilities, and it’s true that this planet is in the reference class of “Earth-sized planets in the Milky way” and the only other member of that reference class we’ve observed so far DOES look like that… But given our priors, those facts are basically irrelevant.
        
        I think this is a decent metaphor for what was happening ten years ago or so with all these debates about orthogonality and instrumental convergence. People had a confused understanding of how minds and instrumental reasoning worked; then people like Yudkowsky and Bostrom became less confused by thinking about the space of possible minds and goals and whatnot, and convinced themselves and others that actually the situation is analogous to this planets example (though maybe less extreme): The burden of proof should be on whoever wants to claim that AI will be fine by default, not on whoever wants to claim it won’t be fine by default. I think they were right about this and still are right about this. Nevertheless I’m glad that we are moving away from this skeptical-priors, burden-of-proof stuff and towards more rigorous understandings. Just as I’d see it as progress if some geologists came along and said “Actually we have a pretty good idea now of how continents drift, and so we have some idea of what the probability distribution over map-images is like, and maps that look anything like this one have very low measure, even conditional on the planet being earth-sized and in the milky way.” But I’d see it as “confirming more rigorously what we already knew, just in case, cos you never really know for sure” progress.
        
        Richard_Ngo 5 Feb 2021 15:42 UTC
        LW: 4 AF: 2
        AF Parent
        The burden of proof should be on whoever wants to claim that AI will be fine by default, not on whoever wants to claim it won’t be fine by default.
        I’m happy to wrap up this conversation in general, but it’s worth noting before I do that I still strongly disagree with this comment. We’ve identified a couple of interesting facts about goals, like “unbounded large-scale final goals lead to convergent instrumental goals”, but we have nowhere near a good enough understanding of the space of goal-like behaviour to say that everything apart from a “very small region” will lead to disaster. This is circular reasoning from the premise that goals are by default unbounded and consequentialist to the conclusion that it’s very hard to get bounded or non-consequentialist goals. (It would be rendered non-circular by arguments about why coherence theorems about utility functions are so important, but there’s been a lot of criticism of those arguments and no responses so far.)
        Daniel Kokotajlo 5 Feb 2021 16:37 UTC
        LW: 2 AF: 1
        AF Parent
        OK, interesting. I agree this is a double crux. For reasons I’ve explained above, it doesn’t seem like circular reasoning to me, it doesn’t seem like I’m assuming that goals are by default unbounded and consequentialist etc. But maybe I am. I haven’t thought about this as much as you have, my views on the topic have been crystallizing throughout this conversation, so I admit there’s a good chance I’m wrong and you are right. Perhaps I/we will return to it one day, but for now, thanks again and goodbye!