I was more wondering about situations in humans which you thought had the potential to be problematic, under the RLHF frame on alignment.
Like, you said:
shard theory resembles RLHF, and seems to share its flaws
So, if some alignment theory says “this approach (e.g. RLHF) is flawed and probably won’t produce human-compatible values”, and we notice “shard theory resembles RLHF”, then insofar as shard theory is actually true, RLHF-like processes are the only known generators of human-compatible values ever, and I’d update against the alignment theory / reasoning which called RLHF flawed. (Of course, there are reasons—like inductive biases—that RLHF-like processes could work in humans but not in AI, but any argument against RLHF would have to discriminate between the human/AI case in a way which accounts for those obstructions.)
On the object level:
If there’s some experience that deceives people to provide feedback signals that the behavior was prosocial, then it seems like the shard that leads to that experience will be reinforced.
Are you saying “The AI says something which makes us erroneously believe it saved a person’s life, and we reward it, and this can spawn a deception-shard”? If so—that’s not (necessarily) how credit assignment works. The AI’s credit assignment isn’t necessarily running along the lines of “people were deceived, so upweight computations which deceive people.”
Perhaps the AI thought the person would approve of that statement, and so it did it, and got rewarded, which reinforces the approval-seeking shard? (Which is bad news in a different way!)
Perhaps the AI was just exploring into a statement suggested by a self-supervised pretrained initialization, off of an already-learned general heuristic of “sometimes emit completions from the self-supervised pretrained world model.” Then the AI reinforces this heuristic (among other changes from the gradient).
the most logical reinforcement-based reason I can see why it doesn’t become a bigger problem is that people cannot reliably deceive each other.
I don’t know whether this is a problem at all, in general. I expect unaligned models to convergently deceive us. But that requires them to already be unaligned.
I was more wondering about situations in humans which you thought had the potential to be problematic, under the RLHF frame on alignment.
How about this one: Sometimes in software development, you may be worried that there is a security problem in the program you are making. But if you speak out loud about it, then that generates FUD among the users, which discourages you from speaking loud in the future. Hence, RLHF in a human context generates deception.
Are you saying “The AI says something which makes us erroneously believe it saved a person’s life, and we reward it, and this can spawn a deception-shard”?
Not necessarily a general deception shard, just it spawns some sort of shard that repeats similar things to what it did before, which presumably means more often errorneously making us believe it saved a person’s life. Whether that’s deception or approval-seeking or donating-to-charities-without-regard-for-effectiveness or something else.
Your post points out that you can do all sorts of things in theory if you “have enough write access to fool credit assignment”. But that’s not sufficient to show that they can happen in practice. You gotta propose system of write access and training to use this write access to do what you are proposing.
I don’t know whether this is a problem at all, in general. I expect unaligned models to convergently deceive us. But that requires them to already be unaligned.
Would you not agree that models are unaligned by default, unless there is something that aligns them?
Sometimes in software development, you may be worried that there is a security problem in the program you are making. But if you speak out loud about it, then that generates FUD among the users, which discourages you from speaking loud in the future. Hence, RLHF in a human context generates deception.
Thanks for the example. The conclusion is far too broad and confident, in my opinion. I would instead say “RLHF in a human context seems to have at least one factor which pushes for deception in this kind of situation.” And then, of course, we should compare the predicted alignment concerns in people, with the observed alignment situation, and update accordingly. I’ve updated down hard on alignment difficulty when I’ve run this exercise in the past.
Not necessarily a general deception shard, just it spawns some sort of shard that repeats similar things to what it did before, which presumably means more often errorneously making us believe it saved a person’s life.
I don’t see why this is worth granting the connotations we usually associate with “deception”
I think that if the AI just repeats “I saved someone’s life” without that being true, we will find out and stop rewarding that?
Unless the AI didn’t just happen to get erroneously rewarded for prosociality (as originally discussed), but planned for that to happen, in which case it’s already deceptive in a much worse way.
But somehow the AI has to get to that cognitive state, first. I think it’s definitely possible, but not at all clearly the obvious outcome.
Your post points out that you can do all sorts of things in theory if you “have enough write access to fool credit assignment”. But that’s not sufficient to show that they can happen in practice. You gotta propose system of write access and training to use this write access to do what you are proposing.
That wasn’t the part of the post I meant to point to. I was saying that just because we externally observe something we would call “deception/misleading task completion” (e.g. getting us to reward the AI for prosociality), does not mean that “deceptive thought patterns” get reinforced into the agent! The map is not the territory of the AI’s updating process. The reward will, I think, reinforce and generalize the AI’s existing cognitive subroutines which produced the judged-prosocial behavior, which subroutines don’t necessarily have anything to do with explicitdeception (as you noted).
Would you not agree that models are unaligned by default, unless there is something that aligns them?
Is a donut “unaligned by default”? The networks start out randomly initialized. I agree that effort has to be put in to make the AI care about human-good outcomes in particular, as opposed to caring about ~nothing, or caring about some other random set of reward correlates. But I’m not assuming the model starts out deceptive, nor that it will become that with high probability. That’s one question I’m trying to figure out with fresh eyes.
I was more wondering about situations in humans which you thought had the potential to be problematic, under the RLHF frame on alignment.
Like, you said:
So, if some alignment theory says “this approach (e.g. RLHF) is flawed and probably won’t produce human-compatible values”, and we notice “shard theory resembles RLHF”, then insofar as shard theory is actually true, RLHF-like processes are the only known generators of human-compatible values ever, and I’d update against the alignment theory / reasoning which called RLHF flawed. (Of course, there are reasons—like inductive biases—that RLHF-like processes could work in humans but not in AI, but any argument against RLHF would have to discriminate between the human/AI case in a way which accounts for those obstructions.)
On the object level:
Are you saying “The AI says something which makes us erroneously believe it saved a person’s life, and we reward it, and this can spawn a deception-shard”? If so—that’s not (necessarily) how credit assignment works. The AI’s credit assignment isn’t necessarily running along the lines of “people were deceived, so upweight computations which deceive people.”
Perhaps the AI thought the person would approve of that statement, and so it did it, and got rewarded, which reinforces the approval-seeking shard? (Which is bad news in a different way!)
Perhaps the AI was just exploring into a statement suggested by a self-supervised pretrained initialization, off of an already-learned general heuristic of “sometimes emit completions from the self-supervised pretrained world model.” Then the AI reinforces this heuristic (among other changes from the gradient).
I don’t know whether this is a problem at all, in general. I expect unaligned models to convergently deceive us. But that requires them to already be unaligned.
How about this one: Sometimes in software development, you may be worried that there is a security problem in the program you are making. But if you speak out loud about it, then that generates FUD among the users, which discourages you from speaking loud in the future. Hence, RLHF in a human context generates deception.
Not necessarily a general deception shard, just it spawns some sort of shard that repeats similar things to what it did before, which presumably means more often errorneously making us believe it saved a person’s life. Whether that’s deception or approval-seeking or donating-to-charities-without-regard-for-effectiveness or something else.
Your post points out that you can do all sorts of things in theory if you “have enough write access to fool credit assignment”. But that’s not sufficient to show that they can happen in practice. You gotta propose system of write access and training to use this write access to do what you are proposing.
Would you not agree that models are unaligned by default, unless there is something that aligns them?
Thanks for the example. The conclusion is far too broad and confident, in my opinion. I would instead say “RLHF in a human context seems to have at least one factor which pushes for deception in this kind of situation.” And then, of course, we should compare the predicted alignment concerns in people, with the observed alignment situation, and update accordingly. I’ve updated down hard on alignment difficulty when I’ve run this exercise in the past.
I don’t see why this is worth granting the connotations we usually associate with “deception”
I think that if the AI just repeats “I saved someone’s life” without that being true, we will find out and stop rewarding that?
Unless the AI didn’t just happen to get erroneously rewarded for prosociality (as originally discussed), but planned for that to happen, in which case it’s already deceptive in a much worse way.
But somehow the AI has to get to that cognitive state, first. I think it’s definitely possible, but not at all clearly the obvious outcome.
That wasn’t the part of the post I meant to point to. I was saying that just because we externally observe something we would call “deception/misleading task completion” (e.g. getting us to reward the AI for prosociality), does not mean that “deceptive thought patterns” get reinforced into the agent! The map is not the territory of the AI’s updating process. The reward will, I think, reinforce and generalize the AI’s existing cognitive subroutines which produced the judged-prosocial behavior, which subroutines don’t necessarily have anything to do with explicit deception (as you noted).
Is a donut “unaligned by default”? The networks start out randomly initialized. I agree that effort has to be put in to make the AI care about human-good outcomes in particular, as opposed to caring about ~nothing, or caring about some other random set of reward correlates. But I’m not assuming the model starts out deceptive, nor that it will become that with high probability. That’s one question I’m trying to figure out with fresh eyes.