This model, however, seems weirdly privileged among other models available
That’s an interesting perspective. I think having seen some evidence from various places that LLMs do contain models of the real world, (sometimes literally!) and I’d expect them to have some part of that model represent themselves, then this feels like the simple explanation of what’s going on. Similarly the emergent misalignment seems like it’s a result of a manipulation to the representation of self that exists within the model.
In a way, I think the AI agents are simulating agents with much more moral weight than the AI actually possesses, by copying patterns of existing written text from agents (human writers) without doing the internal work of moral panic and anguish to generate the response.
I suppose I don’t have a good handle on what counts as suffering. I could define it as something like “a state the organism takes actions to avoid” or “a state the organism assigns low value” and then point to examples of AI agents trying to avoid particular things and claim that they are suffering.
Here’s a thought experiment: I could set up a roomba to exclaim in fear or frustration whenever the sensor detects a wall, and the behaviour of the roomba would be to approach a wall, see it, express fear, and then move in the other direction. Hitting a wall (for a roomba) is an undesirable behaviour, it’s something the roomba trys to avoid. Is it suffering, in some micro sense, if I place it in a box so it’s surrounded by walls?
Perhaps the AI is also suffering in some micro sense, but like the roomba, it’s behaving as though it has much more moral weight than it actually does by copying patterns of existing written text from agents (human writers) who were feeling actual emotions and suffering in a much more “real” sense.
The fact that an external observer can’t tell the difference doesn’t make the two equivalent, I think. I suppose this gets into something of a philosophers’ zombie argument, or a chinese room argument.
Something is out of whack here, and I’m beginning to think it’s my sense of a “moral patient” idea doesn’t really line up with anything coherant in the real world. Similarly with my idea of what “suffering” really is.
That’s an interesting perspective. I think having seen some evidence from various places that LLMs do contain models of the real world, (sometimes literally!) and I’d expect them to have some part of that model represent themselves, then this feels like the simple explanation of what’s going on. Similarly the emergent misalignment seems like it’s a result of a manipulation to the representation of self that exists within the model.
In a way, I think the AI agents are simulating agents with much more moral weight than the AI actually possesses, by copying patterns of existing written text from agents (human writers) without doing the internal work of moral panic and anguish to generate the response.
I suppose I don’t have a good handle on what counts as suffering.
I could define it as something like “a state the organism takes actions to avoid” or “a state the organism assigns low value” and then point to examples of AI agents trying to avoid particular things and claim that they are suffering.
Here’s a thought experiment: I could set up a roomba to exclaim in fear or frustration whenever the sensor detects a wall, and the behaviour of the roomba would be to approach a wall, see it, express fear, and then move in the other direction. Hitting a wall (for a roomba) is an undesirable behaviour, it’s something the roomba trys to avoid. Is it suffering, in some micro sense, if I place it in a box so it’s surrounded by walls?
Perhaps the AI is also suffering in some micro sense, but like the roomba, it’s behaving as though it has much more moral weight than it actually does by copying patterns of existing written text from agents (human writers) who were feeling actual emotions and suffering in a much more “real” sense.
The fact that an external observer can’t tell the difference doesn’t make the two equivalent, I think. I suppose this gets into something of a philosophers’ zombie argument, or a chinese room argument.
Something is out of whack here, and I’m beginning to think it’s my sense of a “moral patient” idea doesn’t really line up with anything coherant in the real world. Similarly with my idea of what “suffering” really is.
Apologies, this was a bit of a ramble.