My impression is that most people around here aren’t especially worried about GPT-n being either: capable of recursive self-improvement leading to foom, or obtaining morally significant levels of consciousness.
Reasons given include:
GPT has a large number of parameters with a shallow layer-depth, meaning it is incapable of “deep” reasoning
GPT’s training function “predict the next character” makes it unlikely to make a “treacherous turn”
GPT is not “agenty” in the sense of having a model of the world and viewing itself as existing within that model.
On the other hand, I believe is widely agreed that if you take a reinforcement learner (say Google’s Dreamer) and give it virtually any objective function (the classic example being “make paperclips”) and enough compute, it will destroy the world. The general reason being given is Goodhart’s Law.
My question is, does this apparent difference in perceived safely arise purely from our expectations of the two architecture’s capabilities. Or is there actually some consensus that different architectures carry inherently different levels of risk?
Question
To make this more concrete, suppose you were presented with two “human level” AGI’s, one built using GPT-n (say using this method) and one built using a Reinforcement Learner with a world-model and some seeming innocuous objective function (say “predict the most human like response to your input text”).
Pretend you have both AGIs in separate boxes front of you, and complete diagram of their software and hardware and you communicate with them solely using a keyboard and text terminal attached to the box. Both of the AGIs are capable of carrying on a conversation at a level equal to a college-educated human.
If using all the testing methods at your disposal, you perceived these two AGIs to be equally “intelligent”, would you consider one more dangerous than the other?
Would you consider one of them to be more likely to be conscious than the other?
What significant moral or safety questions about these two AGIs would you have different answers for (if any)?
Application
Suppose that you consider these two AGIs equally dangerous. Then the alignment problem mostly boils down to the correct choice of objective function.
If, on the other hand, there are widely agreed upon differences in the safety levels of different architectures, then AI safety should focus quite heavily on finding and promoting the safest architectures.
Yes, architecture matters. We don’t know what architectures are how likely to produce a rogue agent, but we have subjective expectations, and what a coincidence it would be that they should be the same in each case. For example, if an architecture easily solves a given task, it needs to be scaled up less and then I expect less opportunity for a mesa-optimizer to arise. Mixture of Experts is riskier if the chance that an expert of given size will go rogue is astronomically close to neither 0 nor 1; it is less risky if that chance scales too fast with size. Of course, it’s a shaky assumption that splitting a large network into experts will render them unable to form an entire rogue. My object-level arguments might be wrong, but our risk mitigation strategies should not disregard the question of architecture entirely.
Since everyone’s talking sensibly about capabilities / safety, let me talk insensibly about consciousness.
Sometimes when we ask about consciousness, we just mean a sort of self-modeling or self-awareness. Does this thing represent itself in the model of the world? Is it clever and responsive to us interacting with it?
But I’m going to assume you mean you mean something more like “if I want to care about humans even if their bodies are different shapes, should I also care about these things?” Or “Is there something it is like to be these things?”
When we wonder about AI’s consciousness (or animal consciousness, or aliens, or whatever), there is no simple physical property that is “the thing” we are asking about. Instead, we are asking about a whole mixed-up list of different things we care about and that all go together in non-brain-damaged humans, but don’t have to go together in AIs. To look at a tiny piece of the list, my pain response involves unconscious responses of my body (e.g. reducing blood flow to extremities) that I then perceive, associations in my thoughts with other painful events or concepts of damage or danger, reflexes to say “ow” or swear or groan, particular reflexive facial expressions, trying to avoid the painful stimulus, difficulty focusing on other things, etc.
These things usually go together in me and in most humans, but we might imagine a person who has some parts of pain but not others. For example, let’s just say they have the first half of my list but not the second half: their body unconsciously goes into fight-or-flight mode, they sense that something is happening and associate it with examples of danger or damage, but they have no reflex to say “ow” or look pained, they don’t feel an urge to avoid the stimulus, and they suffer no more impediment to thinking clearly while injured than you do when you see the color red. It’s totally reasonable to say that this person “doesn’t really feel pain,” but the precise flavor of this “not really” is totally different than the way in which a person under general anesthesia doesn’t really feel pain.
If we lose just a tiny piece of the list rather than half of it, the change is small, and we’d say we still “really” feel pain but maybe in a slightly different way. Similarly, if we lost our sense of pain we’d still feel that we were “really” conscious, if with a slightly different flavor. This is because pain is just a tiny part of what goes together in consciousness—if we also lost how our expectations color what objects we recognize in vision, how we store and recall concepts from memory, how we feel associations between our senses, how we have a sense of our own past, and a dozen other bits of humanity, then we’d be well into the uncanny valley. (Or we don’t really have to lose these things, we just have to lose the way that they go together in humans, just like how the person missing half of the parts of their pain response doesn’t start getting points back if they say “ow” but at times uncorrelated with them being stabbed.)
Again, I need to reiterate that there is nothing magical about this list of functions and feelings, nothing that makes it a necessary list-of-things-that-go-together, it’s just some things that happen to form a neat bundle in humans. But also there’s nothing wrong with caring about these things! We’re not doing physics here, you can’t automatically get better definitions of words by applying Occam’s Razor and cutting out all the messy references to human nature.
Because the notion of consciousness has our own anthropocentric perspective so baked into it, any AI not specially designed to have all the stuff we care about will almost surely be missing many parts of the list, and be missing many human correlations between parts.
So, to get around to the question: Neither of these AIs will be conscious in the sense we care about. The person who only has half the correlates of pain is astronomically closer to feeling pain than these things are to being conscious. The question is not “what is the probability” they’re as conscious as you or I (since that’s 0.0), the question is what degree of consciousness do they have—how human-like are their pieces, arranged in how recognizable a way?
Yet after all this preamble, I’m not really sure which I’d pick to be more conscious. Perhaps for most architectures of the RL agent it’s actually less conscious, because it’s more prone to learn cheap and deceptive tricks rather than laboriously imitating the human reasoning that produces the text. But this requires us to think about how we feel about the human-ness of GPT-n, which even if it simulates humans seems like it simulates too many humans, in a way that destroys cognitive correlations present in an individual.
In the paper “Reward is Enough”, it is argued that all AI is really RL, and that loss is the reward. This means that a language model has a goal function to predict the next word in a text. By this reasoning, your human-level RL system should be equivalent to your GPT-n system.
That said, my intuition tells me there should be some fundamental difference. It always seemed to me that NLP is the light side of the force and RL is the dark side. Giving AI a numerical goal? That’s how you get paperclips. Giving AI the ability to understand all of human thought and wisdom? That sounds like a better idea.
To give a model of how things could go wrong in your hypothetical, suppose that the RL system was misaligned in such a way that, when you give it a goal function like “predict the next word”, it builds a model of the entire planet and all of human society, and then conquers the world to get as much computing power as possible, all because it wants to be 99.9999% sure rather than 99.99% sure that it will predict the next word correctly. A GPT-n system is more chill, it wants to get the next word correct but it’s not a goal, more like an instinct.
However, I think you’re likely to be tempted to put a layer of RL on top of your GPT-n so it can act like an agent, and then we’re back where we started.
I suspect the difference is mostly in what training opportunities are available, not what type of system is used internally.
In principle, a strong NLP AI might learn some behaviour that manipulates humans. It’s just that in practice it is more difficult for it to do so, because in almost all of the training phase there is no interaction at all. The input is decoupled from its output, so there is no training signal to improve any ability to manipulate the input.
In reality there are some side-channels that are interactive, such as selection of fine-tuning training based on human evaluation. A sufficiently powerful system might be able to learn enough from that to manipulate the world, but it seems much less likely than some other type of system with more interactive learning doing it first.