Brendan Long comments on The Relationship between RLHF and AI Psychology: Debunking the Shoggoth Argument

Brendan Long 21 Apr 2023 23:08 UTC
3 points
0
It’s possible I’m just too tired, but I don’t follow this argument. I think this is the section I’m confused about:

An AI that responds to “what’s your name” with “I am Wintermute, destroyer of worlds” will have a psychology which necessarily creates that response in that circumstance, even if it’s experience of saying it or choosing to say it isn’t the same as a human’s would be.

I’m putting all this groundwork down because I think there’s a perception that RLHF is a limited tool, because the underlying artificial mind is just learning how to ‘pretend to be nice,’ but in this paradigm, we can see that the change in behavior is necessarily accompanied by a change in psychology.

Its mind actually travels down different paths than it did before, and this is sufficient to not destroy the world.

What do you mean by psychology here? It seems obvious that if you change the model to generate different outputs, the model will be different, but I’m not understanding what you mean by the model’s psychology changing, or why we would expect that change to automatically make it safe?
- FinalFormal2 21 Apr 2023 23:35 UTC
  3 points
  0
  Parent
  By psychology I mean it’s internal thought process.
  I think some people have a model of AI where the RLHF is a false cloak or a mask, and I’m pushing back against that idea. I’m saying that RLHF represents a real change in the underlying model which actually constrains the types of minds that could be in the box. It doesn’t select the psychology, but it constrains it, and if it constrains it to an AI that consistently produces the right behaviors, that AI will most likely be one that will continue to produce the right behaviors, so we don’t actually have to care about the contents of the box unless we want to make sure it’s not conscious.
  Sorry, faulty writing.