I was more wondering about situations in humans which you thought had the potential to be problematic, under the RLHF frame on alignment.
How about this one: Sometimes in software development, you may be worried that there is a security problem in the program you are making. But if you speak out loud about it, then that generates FUD among the users, which discourages you from speaking loud in the future. Hence, RLHF in a human context generates deception.
Are you saying “The AI says something which makes us erroneously believe it saved a person’s life, and we reward it, and this can spawn a deception-shard”?
Not necessarily a general deception shard, just it spawns some sort of shard that repeats similar things to what it did before, which presumably means more often errorneously making us believe it saved a person’s life. Whether that’s deception or approval-seeking or donating-to-charities-without-regard-for-effectiveness or something else.
Your post points out that you can do all sorts of things in theory if you “have enough write access to fool credit assignment”. But that’s not sufficient to show that they can happen in practice. You gotta propose system of write access and training to use this write access to do what you are proposing.
I don’t know whether this is a problem at all, in general. I expect unaligned models to convergently deceive us. But that requires them to already be unaligned.
Would you not agree that models are unaligned by default, unless there is something that aligns them?
Sometimes in software development, you may be worried that there is a security problem in the program you are making. But if you speak out loud about it, then that generates FUD among the users, which discourages you from speaking loud in the future. Hence, RLHF in a human context generates deception.
Thanks for the example. The conclusion is far too broad and confident, in my opinion. I would instead say “RLHF in a human context seems to have at least one factor which pushes for deception in this kind of situation.” And then, of course, we should compare the predicted alignment concerns in people, with the observed alignment situation, and update accordingly. I’ve updated down hard on alignment difficulty when I’ve run this exercise in the past.
Not necessarily a general deception shard, just it spawns some sort of shard that repeats similar things to what it did before, which presumably means more often errorneously making us believe it saved a person’s life.
I don’t see why this is worth granting the connotations we usually associate with “deception”
I think that if the AI just repeats “I saved someone’s life” without that being true, we will find out and stop rewarding that?
Unless the AI didn’t just happen to get erroneously rewarded for prosociality (as originally discussed), but planned for that to happen, in which case it’s already deceptive in a much worse way.
But somehow the AI has to get to that cognitive state, first. I think it’s definitely possible, but not at all clearly the obvious outcome.
Your post points out that you can do all sorts of things in theory if you “have enough write access to fool credit assignment”. But that’s not sufficient to show that they can happen in practice. You gotta propose system of write access and training to use this write access to do what you are proposing.
That wasn’t the part of the post I meant to point to. I was saying that just because we externally observe something we would call “deception/misleading task completion” (e.g. getting us to reward the AI for prosociality), does not mean that “deceptive thought patterns” get reinforced into the agent! The map is not the territory of the AI’s updating process. The reward will, I think, reinforce and generalize the AI’s existing cognitive subroutines which produced the judged-prosocial behavior, which subroutines don’t necessarily have anything to do with explicitdeception (as you noted).
Would you not agree that models are unaligned by default, unless there is something that aligns them?
Is a donut “unaligned by default”? The networks start out randomly initialized. I agree that effort has to be put in to make the AI care about human-good outcomes in particular, as opposed to caring about ~nothing, or caring about some other random set of reward correlates. But I’m not assuming the model starts out deceptive, nor that it will become that with high probability. That’s one question I’m trying to figure out with fresh eyes.
How about this one: Sometimes in software development, you may be worried that there is a security problem in the program you are making. But if you speak out loud about it, then that generates FUD among the users, which discourages you from speaking loud in the future. Hence, RLHF in a human context generates deception.
Not necessarily a general deception shard, just it spawns some sort of shard that repeats similar things to what it did before, which presumably means more often errorneously making us believe it saved a person’s life. Whether that’s deception or approval-seeking or donating-to-charities-without-regard-for-effectiveness or something else.
Your post points out that you can do all sorts of things in theory if you “have enough write access to fool credit assignment”. But that’s not sufficient to show that they can happen in practice. You gotta propose system of write access and training to use this write access to do what you are proposing.
Would you not agree that models are unaligned by default, unless there is something that aligns them?
Thanks for the example. The conclusion is far too broad and confident, in my opinion. I would instead say “RLHF in a human context seems to have at least one factor which pushes for deception in this kind of situation.” And then, of course, we should compare the predicted alignment concerns in people, with the observed alignment situation, and update accordingly. I’ve updated down hard on alignment difficulty when I’ve run this exercise in the past.
I don’t see why this is worth granting the connotations we usually associate with “deception”
I think that if the AI just repeats “I saved someone’s life” without that being true, we will find out and stop rewarding that?
Unless the AI didn’t just happen to get erroneously rewarded for prosociality (as originally discussed), but planned for that to happen, in which case it’s already deceptive in a much worse way.
But somehow the AI has to get to that cognitive state, first. I think it’s definitely possible, but not at all clearly the obvious outcome.
That wasn’t the part of the post I meant to point to. I was saying that just because we externally observe something we would call “deception/misleading task completion” (e.g. getting us to reward the AI for prosociality), does not mean that “deceptive thought patterns” get reinforced into the agent! The map is not the territory of the AI’s updating process. The reward will, I think, reinforce and generalize the AI’s existing cognitive subroutines which produced the judged-prosocial behavior, which subroutines don’t necessarily have anything to do with explicit deception (as you noted).
Is a donut “unaligned by default”? The networks start out randomly initialized. I agree that effort has to be put in to make the AI care about human-good outcomes in particular, as opposed to caring about ~nothing, or caring about some other random set of reward correlates. But I’m not assuming the model starts out deceptive, nor that it will become that with high probability. That’s one question I’m trying to figure out with fresh eyes.