The Relationship between RLHF and AI Psychology: Debunking the Shoggoth Argument
Epistemic status: Just spitballin’
I think that RLHF is sufficient to keep LLMs from destroying the world.
We can say with pretty high confidence that the psychology of an AI is not the human equivalent that we would expect from the behavior.
An AI spelling in all-caps and exclamation points is probably is not experiencing ‘anger’ or ‘excitement’ as we would expect a human exhibiting those behaviors to be.
However, we can also say with very high confidence, probably even certainty, that the AI will have a psychology that necessarily creates it’s behavior.
An AI that responds to “what’s your name” with “I am Wintermute, destroyer of worlds” will have a psychology which necessarily creates that response in that circumstance, even if it’s experience of saying it or choosing to say it isn’t the same as a human’s would be.
I’m putting all this groundwork down because I think there’s a perception that RLHF is a limited tool, because the underlying artificial mind is just learning how to ‘pretend to be nice,’ but in this paradigm, we can see that the change in behavior is necessarily accompanied by a change in psychology.
Its mind actually travels down different paths than it did before, and this is sufficient to not destroy the world.
Possible Objections (read: strawmen)
You can argue from there that the training is very incomplete and imperfect, and that you can trick the AI in many different situations, but this is a technical problem, and it isn’t unique to ethics, but something that has to be overcome to create real intelligence in the first place. It’s just a difficulty generalizing that plagues LLMs in all domains.
You could also object that while you’ve taught the AI to generate ethics, you haven’t actually taught it to care.
However, you don’t actually have to teach it to care. If we’ve overcome the reliability/generalization problem which seems to me to be necessary for developed intelligence, than you already have the tools to create consistent ethical behavior, and the resulting psychology will be one which creates consistently ethical text output, regardless of whether that feels like ‘caring’ to the AI.
All that being said, if I had to pit this against a super intelligent utility-maximizer, I would be very worried, but it really isn’t looking like that’s the shaping out to be the case. If LLMs produce a superintelligence, it will be an LLM-shaped superintelligence, and it will not have utility functions.
The utility-function maximizer model now appears to have not been nearly as relevant as anyone thought, but there’s probably still some holdover ideas around. I predict objections to this statement, but I’m curious to see what shape they take.
TL;DR
When you change the behavior of an LLM, you are necessarily changing the underlying psychology into something that’s more likely to produce that behavior. If you can make a LLM generalize well and consistently, which I think is necessary for intelligence, then you can create an LLM which consistently responds ethically and necessarily has a psychology which makes it consistently respond ethically, even if it doesn’t have the human experience of responding ethically.
It’s possible I’m just too tired, but I don’t follow this argument. I think this is the section I’m confused about:
What do you mean by psychology here? It seems obvious that if you change the model to generate different outputs, the model will be different, but I’m not understanding what you mean by the model’s psychology changing, or why we would expect that change to automatically make it safe?
By psychology I mean it’s internal thought process.
I think some people have a model of AI where the RLHF is a false cloak or a mask, and I’m pushing back against that idea. I’m saying that RLHF represents a real change in the underlying model which actually constrains the types of minds that could be in the box. It doesn’t select the psychology, but it constrains it, and if it constrains it to an AI that consistently produces the right behaviors, that AI will most likely be one that will continue to produce the right behaviors, so we don’t actually have to care about the contents of the box unless we want to make sure it’s not conscious.
Sorry, faulty writing.
granted.
Not granted: no reasoning steps presented justify the conclusion, and I don’t have any cached that repair it. Why wouldn’t it merely upvote a “waluigi” of the objective?
Restating the thesis, poor writing choice to make it sound like a conclusion.
Can you expand on your objection?
how do you actually achieve and check moral generalization?
The same way you’d achieve/check any other generalization, I would think. My model is that the same technical limitations that hold us back from achieving reliable generalizations in any area for LLMs would be the same technical limitations holding us back in the area of morals. Do you think that’s accurate?
yeah, but goal misgeneralization is an easier misgeneralization than most, and checking generalization is hard. I’ll link some papers in a bit
edit: might not be until tomorrow due to busy
Okay getting back to this to drop off some links. There are a few papers on goal misgeneralization—currently simple google search finds some good summaries:
goal misgeneralization:
https://deepmindsafetyresearch.medium.com/goal-misgeneralisation-why-correct-specifications-arent-enough-for-correct-goals-cf96ebc60924
see also related results on https://www.google.com/search?q=goal+misgeneralization
see also a bunch of related papers on https://metaphor.systems/search?q=https%3A%2F%2Farxiv.org%2Fabs%2F2210.01790
see also related papers on https://arxivxplorer.com/?query=https%3A%2F%2Farxiv.org%2Fabs%2F2210.01790
related: https://www.lesswrong.com/posts/dkjwSLfvKwpaQSuWo/misgeneralization-as-a-misnomer
related: https://www.lesswrong.com/posts/DiEWbwrChuzuhJhGr/benchmark-goal-misgeneralization-concept-extrapolation
verifying generalization:
https://arxivxplorer.com/?query=Verifying+Generalization+in+Deep+Learning → https://arxivxplorer.com/?query=https%3A%2F%2Farxiv.org%2Fabs%2F2302.05745
https://arxivxplorer.com/?query=https%3A%2F%2Farxiv.org%2Fabs%2F2301.02288
Note that, despite the exciting names of some of these papers, and the promising directions they push, they have not yet achieved large scale usable versions of what they’re building. Nevertheless I’m quite excited about the direction they’re working and think more folks should think about how to do this sort of formal verification of generalization—it’s a fundamentally difficult problem that I expect to be quite possible to succeed at eventually!
I do agree abstractly that the difficulty is how to be sure that arbitrarily intense capability boosts retain the moral generalization. The problem is how hard to achieve that is.