It’s possible I’m just too tired, but I don’t follow this argument. I think this is the section I’m confused about:
An AI that responds to “what’s your name” with “I am Wintermute, destroyer of worlds” will have a psychology which necessarily creates that response in that circumstance, even if it’s experience of saying it or choosing to say it isn’t the same as a human’s would be.
I’m putting all this groundwork down because I think there’s a perception that RLHF is a limited tool, because the underlying artificial mind is just learning how to ‘pretend to be nice,’ but in this paradigm, we can see that the change in behavior is necessarily accompanied by a change in psychology.
Its mind actually travels down different paths than it did before, and this is sufficient to not destroy the world.
What do you mean by psychology here? It seems obvious that if you change the model to generate different outputs, the model will be different, but I’m not understanding what you mean by the model’s psychology changing, or why we would expect that change to automatically make it safe?
By psychology I mean it’s internal thought process.
I think some people have a model of AI where the RLHF is a false cloak or a mask, and I’m pushing back against that idea. I’m saying that RLHF represents a real change in the underlying model which actually constrains the types of minds that could be in the box. It doesn’t select the psychology, but it constrains it, and if it constrains it to an AI that consistently produces the right behaviors, that AI will most likely be one that will continue to produce the right behaviors, so we don’t actually have to care about the contents of the box unless we want to make sure it’s not conscious.
It’s possible I’m just too tired, but I don’t follow this argument. I think this is the section I’m confused about:
What do you mean by psychology here? It seems obvious that if you change the model to generate different outputs, the model will be different, but I’m not understanding what you mean by the model’s psychology changing, or why we would expect that change to automatically make it safe?
By psychology I mean it’s internal thought process.
I think some people have a model of AI where the RLHF is a false cloak or a mask, and I’m pushing back against that idea. I’m saying that RLHF represents a real change in the underlying model which actually constrains the types of minds that could be in the box. It doesn’t select the psychology, but it constrains it, and if it constrains it to an AI that consistently produces the right behaviors, that AI will most likely be one that will continue to produce the right behaviors, so we don’t actually have to care about the contents of the box unless we want to make sure it’s not conscious.
Sorry, faulty writing.