johnswentworth comments on Alignment As A Bottleneck To Usefulness Of GPT-3

johnswentworth 23 Jul 2020 17:02 UTC
LW: 4 AF: 2
AF
No, this is not the “tasty ice cream flavors” problem. The problem there is that the concept is inherently relative to a person. That problem could apply to “human values”, but that’s a separate issue from what dxu is talking about.
The problem is that “what a committee of famous moral philosophers would endorse saying/doing”, or human written text containing the phrase “human values”, is a proxy for human values, not a direct pointer to the actual concept. And if a system is trained to predict what the committee says, or what the text says, then it will learn the proxy, but that does not imply that it directly uses the concept.
- John_Maxwell 24 Jul 2020 9:09 UTC
  LW: 2 AF: 1
  AF Parent
  Well, the moral judgements of a high-fidelity upload of a benevolent human are also a proxy for human values—an inferior proxy, actually. Seems to me you’re letting the perfect be the enemy of the good.
  - johnswentworth 24 Jul 2020 16:24 UTC
    LW: 10 AF: 3
    AF Parent
    It doesn’t matter how high-fidelity the upload is or how benevolent the human is, I’m not happy giving them the power to launch nukes without at least two keys, and a bunch of other safeguards on top of that. “Don’t let the perfect be the enemy of the good” is advice for writing emails and cleaning the house, not nuclear security.
    The capabilities of powerful AGI will be a lot more dangerous than nukes, and merit a lot more perfectionism.
    Humans themselves are not aligned enough that I would be happy giving them the sort of power that AGI will eventually have. They’d probably be better than many of the worst-case scenarios, but they still wouldn’t be a best or even good scenario. Humans just don’t have the processing power to avoid shooting themselves (and the rest of the universe) in the foot sooner or later, given that kind of power.
    - John_Maxwell 25 Jul 2020 8:25 UTC
      2 points
      AF Parent
      
      It doesn’t matter how high-fidelity the upload is or how benevolent the human is, I’m not happy giving them the power to launch nukes without at least two keys, and a bunch of other safeguards on top of that.
      
      Here are some of the people who have the power to set off nukes right now:
      
      Donald Trump
      
      Vladimir Putin
      
      Kim Jong-un
      
      Both parties in this conflict
      
      And this conflict
      
      “Don’t let the perfect be the enemy of the good” is advice for writing emails and cleaning the house, not nuclear security.
      
      Tell that to the Norwegian commandos who successfully sabotaged Hitler’s nuclear weapons program.
      
      “A good plan violently executed now is better than a perfect plan executed at some indefinite time in the future.”—George Patton
      
      Just because it’s in your nature (and my nature, and the nature of many people who read this site) to be a cautious nerd, does not mean that the cautious nerd orientation is always the best orientation to have.
      
      In any case, it may be that the annual amount of xrisk is actually quite low, and no one outside the rationalist community is smart enough to invent AGI, and we have all the time in the world. In which case, yes, being perfectionistic is the right strategy. But this still seems to represent a major retreat from the AI doomist position that AI doom is the default outcome. It’s a classic motte-and-bailey:
      
      “It’s very hard to build an AGI which isn’t a paperclipper!”
      
      “Well actually here are some straightforward ways one might be able to create a helpful non-paperclipper AGI...”
      
      “Yeah but we gotta be super perfectionistic because there is so much at stake!”
      
      Your final “humans will misuse AI” worry may be justified, but I think naive deployment of this worry is likely to be counterproductive. Suppose there are two types of people, “cautious” and “incautious”. Suppose that the “humans will misuse AI” worry discourages cautious people from developing AGI, but not incautious people. So now we’re in a world where the first AGI is most likely controlled by incautious people, making the “humans will misuse AI” worry even more severe.
      
      Humans just don’t have the processing power to avoid shooting themselves (and the rest of the universe) in the foot sooner or later, given that kind of power.
      
      If you’re willing to grant the premise of the technical alignment problem being solved, shooting oneself in the foot would appear to be much less of a worry, because you can simply tell your FAI “please don’t let me shoot myself in the foot too badly”, and it will prevent you from doing that.
      - johnswentworth 25 Jul 2020 17:33 UTC
        LW: 8 AF: 2
        AF Parent
        It’s a classic motte-and-bailey:
        “It’s very hard to build an AGI which isn’t a paperclipper!”
        “Well actually here are some straightforward ways one might be able to create a helpful non-paperclipper AGI...”
        “Yeah but we gotta be super perfectionistic because there is so much at stake!”
        There is a single coherent position here in which it is very hard to build an AGI which reliably is not a paperclipper. Yes, there are straightforward ways one might be able to create a helpful non-paperclipper AGI. But that “might” is carrying a lot of weight. All those straightforward ways have failure modes which will definitely occur in at least some range of parameters, and we don’t know exactly what those parameter ranges are.
        It’s sort of like saying:
        “It’s very hard to design a long bridge which won’t fall down!”
        “Well actually here are some straightforward ways one might be able to create a long non-falling-down bridge...” <shows picture of a wooden truss>
        What I’m saying is, that truss is design is 100% going to fail once it gets big enough, and we don’t currently know how big that is. When I say “it’s hard to design a long bridge which won’t fall down”, I do not mean a bridge which might not fall down if we’re lucky and just happen to be within the safe parameter range.
        In any case, it may be that the annual amount of xrisk is actually quite low, and no one outside the rationalist community is smart enough to invent AGI, and we have all the time in the world. In which case, yes, being perfectionistic is the right strategy. But this still seems to represent a major retreat from the AI doomist position that AI doom is the default outcome.
        These are sufficient conditions for a careful strategy to make sense, not necessary conditions. Here’s another set of sufficient conditions, which I find more realistic: the gains to be had in reducing AI risk are binary. Either we find the “right” way of doing things, in which case risk drops to near-zero, or we don’t, in which case it’s a gamble and we don’t have much ability to adjust the chances/payoff. There are no significant marginal gains to be had.
        John_Maxwell 26 Jul 2020 9:33 UTC
        −1 points
        AF Parent
        There is a single coherent position here in which it is very hard to build an AGI which reliably is not a paperclipper.
        This is simultaneously
        a major retreat from the “default outcome is doom” thesis which is frequently trotted out on this site (the statement is consistent with a AGI design that’s is 99.9% likely to be safe, which is very much incompatible with “default outcome is doom”)
        unrelated to our upload discussion (an upload is not an AGI, but you said even a great upload wasn’t good enough for you)
        You’ve picked a position vaguely in between the motte and the bailey and said “the motte and the bailey are both equivalent to this position!” That doesn’t look at all true to me.
        All those straightforward ways have failure modes which will definitely occur in at least some range of parameters, and we don’t know exactly what those parameter ranges are.
        This is a very strong claim which to my knowledge has not been well-justified anywhere. Daniel K agreed with me the other day that there isn’t a standard reference for this claim. Do you know of one?
        There are a couple problems I see here:
        Simple is not the same as obvious. Even if someone at some point tried to think of every obvious solution and justifiably discarded them all, there are probably many “obvious” solutions they didn’t think of.
        Nothing ever gets counted as evidence against this claim. Simple proposals get rejected on the basis that everyone knows simple proposals won’t work.
        A MIRI employee openly admitted here that they apply different standards of evidence to claims of safety vs claims of not-safety. Maybe there are good arguments for that, but the problem is that if you’re not careful, your view of reality is gonna get distorted. Which means community wisdom on claims such as “simple solutions never work” is likely to be systematically wrong. “Everyone knows X”, without a good written defense of X, or a good answer to “what would change the community’s mind about X”, is fertile ground for information cascades etc. And this is on top of standard ideological homophily problems (the AI safety community is very self-selected subset of the broader AI research world).
        What I’m saying is, that truss is design is 100% going to fail once it gets big enough, and we don’t currently know how big that is. When I say “it’s hard to design a long bridge which won’t fall down”, I do not mean a bridge which might not fall down if we’re lucky and just happen to be within the safe parameter range.
        My perception of your behavior in this thread is: instead of talking about whether the bridge can be extended, you changed the subject and explained that the real problem is that the bridge has to support very heavy trucks. This is logically rude. And it makes it impossible to have an in-depth discussion about whether the bridge design can actually be extended or not. From my perspective, you’ve pulled this conversational move multiple times in this thread. It seems to be pretty common when I have discussions about AI safety people. That’s part of why I find the discussions so frustrating. My view is that this is a cultural problem which has to be solved for the AI safety community to do much useful AI safety work (as opposed to “complaining about how hard AI safety is” work, which is useful but insufficient).
        Anyway, I’ll let you have the last word in this thread.
        dxu 26 Jul 2020 17:54 UTC
        21 points
        Parent
        For what it’s worth, my perception of this thread is the opposite of yours: it seems to me John Wentworth’s arguments have been clear, consistent, and easy to follow, whereas you (John Maxwell) have been making very little effort to address his position, instead choosing to repeatedly strawman said position (and also repeatedly attempting to lump in what Wentworth has been saying with what you think other people have said in the past, thereby implicitly asking him to defend whatever you think those other people’s positions were).
        Whether you’ve been doing this out of a lack of desire to properly engage, an inability to comprehend the argument itself, or some other odd obstacle is in some sense irrelevant to the object-level fact of what has been happening during this conversation. You’ve made your frustration with “AI safety people” more than clear over the course of this conversation (and I did advise you not to engage further if that was the case!), but I submit that in this particular case (at least), the entirety of your frustration can be traced back to your own lack of willingness to put forth interpretive labor.
        To be clear: I am making this comment in this tone (which I am well aware is unkind) because there are multiple aspects of your behavior in this thread that I find not only logically rude, but ordinarily rude as well. I more or less summarized these aspects in the first paragraph of my comment, but there’s one particularly onerous aspect I want to highlight: over the course of this discussion, you’ve made multiple references to other uninvolved people (either with whom you agree or disagree), without making any effort at all to lay out what those people said or why it’s relevant to the current discussion. There are two examples of this from your latest comment alone:
        Daniel K agreed with me the other day that there isn’t a standard reference for this claim. [Note: your link here is broken; here’s a fixed version.]
        A MIRI employee openly admitted here that they apply different standards of evidence to claims of safety vs claims of not-safety.
        Ignoring the question of whether these two quoted statements are true (note that even the fixed version of the link above goes only to a top-level post, and I don’t see any comments on that post from the other day), this is counterproductive for a number of reasons.
        Firstly, it’s inefficient. If you believe a particular statement is false (and furthermore, that your basis for this belief is sound), you should first attempt to refute that statement directly, which gives your interlocutor the opportunity to either counter your refutation or concede the point, thereby moving the conversation forward. If you instead counter merely by invoking somebody else’s opinion, you both increase the difficulty of answering and end up offering weaker evidence.
        Secondly, it’s irrelevant. John Wentworth does not work at MIRI (neither does Daniel Kokotajlo, for that matter), so bringing up aspects of MIRI’s position you dislike does nothing but highlight a potential area where his position differs from MIRI’s. (I say “potential” because it’s not at all obvious to me that you’ve been representing MIRI’s position accurately.) In order to properly challenge his position, again it becomes more useful to critique his assertions directly rather than round them off to the closest thing said by someone from MIRI.
        Thirdly, it’s a distraction. When you regularly reference a group of people who aren’t present in the actual conversation, repeatedly make mention of your frustration and “grumpiness” with those people, and frequently compare your actual interlocutor’s position to what you imagine those people have said, all while your actual interlocutor has said nothing to indicate affiliation with or endorsement of those people, it doesn’t paint a picture of an objective critic. To be blunt: it paints a picture of someone with a one-sided grudge against the people in question, and is attempting to inject that grudge into conversations where it shouldn’t be present.
        I hope future conversations can be more pleasant than this.
        johnswentworth 26 Jul 2020 22:32 UTC
        4 points
        Parent
        I appreciate the defense and agree with a fair bit of this. That said, I’ve actually found the general lack of interpretive labor somewhat helpful in this instance—it’s forcing me to carefully and explicitly explain a lot of things I normally don’t, and John Maxwell has correctly pointed out a lot of seeming-inconsistencies in those explanations. At the very least, it’s helping make a lot of my models more explicit and legible. It’s mentally unpleasant, but a worthwhile exercise to go through.
        Ben Pace 27 Jul 2020 4:29 UTC
        2 points
        Parent
        I think I want John to feel able to have this kind of conversation when it feels fruitful to him, and not feel obligated to do so otherwise. I expect this is the case, but just wanted to make it common knowledge.
        johnswentworth 26 Jul 2020 16:47 UTC
        LW: 8 AF: 3
        AF Parent
        This is a very strong claim which to my knowledge has not been well-justified anywhere. Daniel K agreed with me the other day that there isn’t a standard reference for this claim. Do you know of one?
        There isn’t a standard reference because the argument takes one sentence, and I’ve been repeating it over and over again: what would Bayesian updates on low-level physics do? That’s the unique solution with best-possible predictive power, so we know that anything which scales up to best-possible predictive power in the limit will eventually behave that way.
        (BTW I think that link is dead)
        My perception of your behavior in this thread is: instead of talking about whether the bridge can be extended, you changed the subject and explained that the real problem is that the bridge has to support very heavy trucks. This is logically rude. And it makes it impossible to have an in-depth discussion about whether the bridge design can actually be extended or not.
        The “what would Bayesian updates on a low-level model do?” question is exactly the argument that the bridge design cannot be extended indefinitely, which is why I keep bringing it up over and over again.
        This does point to one possibly-useful-to-notice ambiguous point: the difference between “this method would produce an aligned AI” vs “this method would continue to produce aligned AI over time, as things scale up”. I am definitely thinking mainly about long-term alignment here; I don’t really care about alignment on low-power AI like GPT-3 except insofar as it’s a toy problem for alignment of more powerful AIs (or insofar as it’s profitable, but that’s a different matter).
        I’ve been less careful than I should be about distinguishing these two in this thread. All these things which we’re saying “might work” are things which might work in the short term on some low-power AI, but will definitely not work in the long term on high-power AI. That’s probably part of why it seems like I keep switching positions—I haven’t been properly distinguishing when we’re talking short-term vs long-term.
        A second comment on this:
        instead of talking about whether the bridge can be extended, you changed the subject and explained that the real problem is that the bridge has to support very heavy trucks
        If we want to make a piece of code faster, the first step is to profile the code to figure out which step is the slow one. If we want to make a beam stronger, the first step is to figure out where it fails. If we want to extend a bridge design, the first step is to figure out which piece fails under load if we just elongate everything.
        Likewise, if we want to scale up an AI alignment method, the first step is to figure out exactly how it fails under load as the AI’s capabilities grow.
        I think you currently do not understand the failure mode I keep pointing to by saying “what would Bayesian updates on low-level physics do?”. Elsewhere in the thread, you said that optimizing “for having a diverse range of models that all seem to fit the data” would fix the problem, which is my main evidence that you don’t understand the problem. The problem is not “the data underdetermines what we’re asking for”, the problem is “the data fully determines what we’re asking for, and we’re asking for a proxy rather than the thing we actually want”.