avturchin comments on If AI is based on GPT, how to ensure its safety?

avturchin 18 Jun 2020 22:27 UTC
4 points
It works the same way as GPT makes TL;DR summaries. There is no any reward for correct TL;DR or any training—it just completes sequence in the most probable way. The same way some self-driving cars work: there is a neural net from end to end, without any internal world models, an it just predicts what a normal car will do in this situation. I heard from a ML friend that they could achieve some reasonably good driving with such models.
- Daniel Kokotajlo 19 Jun 2020 0:11 UTC
  4 points
  Parent
  Oh, OK. So perhaps we give it a humanoid robot body, so that it is as similar as possible to the humans in its dataset, and then we set up the motors so that the body does whatever GPT-7 predicts it will do, and GPT-7 is trained on datasets of human videos (say) so if you ask it to bring the coffee it probably will? Thanks, this is much clearer now.
  - Raemon 19 Jun 2020 0:15 UTC
    4 points
    Parent
    What’s still a bit unclear to me is if it has any ability to continue to learn (I guess from the stipulated proposal the answer is “no”, but I’m just like “guys, why the hell did you build GPT-7-Bot instead of something that allowed better iterated amplification or something?”)
    Is the spirit of the question “there is no ability to rewrite it’s architecture, or to re-train it on new data, or anything?”
    - avturchin 19 Jun 2020 0:45 UTC
      2 points
      Parent
      Even GPT-2 could be calibrated by some resent events, called “examples”—so it has some form of memory. GPT-7 robot has access to all data it observed before, so if it said “I want to kill Bill”, it will act in the future as if it has such desire. In other words, it behave as if it has memory.
      It doesn’t have build-in ability to rewrite its architecture, but it can write code on a laptop or order things in the internet. But it doesn’t know much about its own internal structure except that it is very large GPT model.
      - Raemon 19 Jun 2020 1:56 UTC
        4 points
        Parent
        Nod. And does it seem to have the ability to gain new cognitive skills? Like, if it reads a bunch of LessWrong posts or attends CFAR, does it’s ‘memory’ start to include things that prompt it to, say, “stop and notice it’s confused” and “form more hypotheses when facing weird phenomena” and “cultivate curiosity about it’s own internal structure.”
        (I assume so, just doublechecking)
        In that case, it seems like the most obvious ways to keep it friendly are the same way you make a human friendly (expose it to ideas you think will guide it on a useful moral trajectory).
        I’m not actually sure what sort of other actions you’re allowing in the hypothetical.