What do power differentials have to do with the kind of mechanistic training story posited by shard theory?
The mechanistically relevant part of your point seems to be that feedback signals from other people probably transduce into reinforcement events in a person’s brain, such that the post-reinforcement person is incrementally “more prosocial.” But the important part isn’t “feedback signals from other people with ~equal power”, it’s the transduced reinforcement events which increase prosociality.
So let’s figure out how to supply good reinforcement events to AI agents. I think that approach will generalize pretty well (and is, in a sense, all that success requires in the deep learning alignment regime).
I guess to me, shard theory resembles RLHF, and seems to share its flaws (unless this gets addressed in a future post or I missed it in one of the existing posts or something).
So for instance learning values by reinforcement events seems likely to lead to deception. If there’s some experience that deceives people to provide feedback signals that the behavior was prosocial, then it seems like the shard that leads to that experience will be reinforced.
This doesn’t become much of a problem in practice among humans (or well, it actually does seem to be a fairly significant problem, but not x-risk level significant), but the most logical reinforcement-based reason I can see why it doesn’t become a bigger problem is that people cannot reliably deceive each other. (There may also be innate honesty instincts? But that runs into genome inaccessibility problems.)
These seem like standard objections around here so I assume you’ve thought about them. I just don’t notice those thoughts anywhere in the work.
I think a lot (but probably not all) of the standard objections don’t make much sense to me anymore. Anyways, can you say more here, so I can make sure I’m following?
If there’s some experience that deceives people to provide feedback signals that the behavior was prosocial, then it seems like the shard that leads to that experience will be reinforced.
(A concrete instantiated scenario would be most helpful! Like, Bob is talking with Alice, who gives him approval-reward of some kind when he does something she wants, and then...)
So I guess if we want to be concrete, the most obvious place to start would be classical cases where RLHF has gone wrong. Like a gripper pretending to pick up an object by placing its hand in front of the camera, or a game-playing AI pretending to make progress by replaying the same part of the game over and over again. Though these are “easy” in the sense that they seem correctable by taking more context into consideration.
One issue with giving concrete examples is that I think nobody has gotten RLHF to work in problems that are too “big” for humans to have all the context. So we don’t really know how it would work in the regime where it seems irreparably dangerous. Like I could say “what if we give it the task of coming up with plans for an engineering project and it has learned to not make pollution that causes health problems obvious? Due to previously having suggested a design with obvious pollution and having that design punished”, but who knows how RLHF will actually be used in engineering?
I was more wondering about situations in humans which you thought had the potential to be problematic, under the RLHF frame on alignment.
Like, you said:
shard theory resembles RLHF, and seems to share its flaws
So, if some alignment theory says “this approach (e.g. RLHF) is flawed and probably won’t produce human-compatible values”, and we notice “shard theory resembles RLHF”, then insofar as shard theory is actually true, RLHF-like processes are the only known generators of human-compatible values ever, and I’d update against the alignment theory / reasoning which called RLHF flawed. (Of course, there are reasons—like inductive biases—that RLHF-like processes could work in humans but not in AI, but any argument against RLHF would have to discriminate between the human/AI case in a way which accounts for those obstructions.)
On the object level:
If there’s some experience that deceives people to provide feedback signals that the behavior was prosocial, then it seems like the shard that leads to that experience will be reinforced.
Are you saying “The AI says something which makes us erroneously believe it saved a person’s life, and we reward it, and this can spawn a deception-shard”? If so—that’s not (necessarily) how credit assignment works. The AI’s credit assignment isn’t necessarily running along the lines of “people were deceived, so upweight computations which deceive people.”
Perhaps the AI thought the person would approve of that statement, and so it did it, and got rewarded, which reinforces the approval-seeking shard? (Which is bad news in a different way!)
Perhaps the AI was just exploring into a statement suggested by a self-supervised pretrained initialization, off of an already-learned general heuristic of “sometimes emit completions from the self-supervised pretrained world model.” Then the AI reinforces this heuristic (among other changes from the gradient).
the most logical reinforcement-based reason I can see why it doesn’t become a bigger problem is that people cannot reliably deceive each other.
I don’t know whether this is a problem at all, in general. I expect unaligned models to convergently deceive us. But that requires them to already be unaligned.
I was more wondering about situations in humans which you thought had the potential to be problematic, under the RLHF frame on alignment.
How about this one: Sometimes in software development, you may be worried that there is a security problem in the program you are making. But if you speak out loud about it, then that generates FUD among the users, which discourages you from speaking loud in the future. Hence, RLHF in a human context generates deception.
Are you saying “The AI says something which makes us erroneously believe it saved a person’s life, and we reward it, and this can spawn a deception-shard”?
Not necessarily a general deception shard, just it spawns some sort of shard that repeats similar things to what it did before, which presumably means more often errorneously making us believe it saved a person’s life. Whether that’s deception or approval-seeking or donating-to-charities-without-regard-for-effectiveness or something else.
Your post points out that you can do all sorts of things in theory if you “have enough write access to fool credit assignment”. But that’s not sufficient to show that they can happen in practice. You gotta propose system of write access and training to use this write access to do what you are proposing.
I don’t know whether this is a problem at all, in general. I expect unaligned models to convergently deceive us. But that requires them to already be unaligned.
Would you not agree that models are unaligned by default, unless there is something that aligns them?
Sometimes in software development, you may be worried that there is a security problem in the program you are making. But if you speak out loud about it, then that generates FUD among the users, which discourages you from speaking loud in the future. Hence, RLHF in a human context generates deception.
Thanks for the example. The conclusion is far too broad and confident, in my opinion. I would instead say “RLHF in a human context seems to have at least one factor which pushes for deception in this kind of situation.” And then, of course, we should compare the predicted alignment concerns in people, with the observed alignment situation, and update accordingly. I’ve updated down hard on alignment difficulty when I’ve run this exercise in the past.
Not necessarily a general deception shard, just it spawns some sort of shard that repeats similar things to what it did before, which presumably means more often errorneously making us believe it saved a person’s life.
I don’t see why this is worth granting the connotations we usually associate with “deception”
I think that if the AI just repeats “I saved someone’s life” without that being true, we will find out and stop rewarding that?
Unless the AI didn’t just happen to get erroneously rewarded for prosociality (as originally discussed), but planned for that to happen, in which case it’s already deceptive in a much worse way.
But somehow the AI has to get to that cognitive state, first. I think it’s definitely possible, but not at all clearly the obvious outcome.
Your post points out that you can do all sorts of things in theory if you “have enough write access to fool credit assignment”. But that’s not sufficient to show that they can happen in practice. You gotta propose system of write access and training to use this write access to do what you are proposing.
That wasn’t the part of the post I meant to point to. I was saying that just because we externally observe something we would call “deception/misleading task completion” (e.g. getting us to reward the AI for prosociality), does not mean that “deceptive thought patterns” get reinforced into the agent! The map is not the territory of the AI’s updating process. The reward will, I think, reinforce and generalize the AI’s existing cognitive subroutines which produced the judged-prosocial behavior, which subroutines don’t necessarily have anything to do with explicitdeception (as you noted).
Would you not agree that models are unaligned by default, unless there is something that aligns them?
Is a donut “unaligned by default”? The networks start out randomly initialized. I agree that effort has to be put in to make the AI care about human-good outcomes in particular, as opposed to caring about ~nothing, or caring about some other random set of reward correlates. But I’m not assuming the model starts out deceptive, nor that it will become that with high probability. That’s one question I’m trying to figure out with fresh eyes.
What do power differentials have to do with the kind of mechanistic training story posited by shard theory?
The mechanistically relevant part of your point seems to be that feedback signals from other people probably transduce into reinforcement events in a person’s brain, such that the post-reinforcement person is incrementally “more prosocial.” But the important part isn’t “feedback signals from other people with ~equal power”, it’s the transduced reinforcement events which increase prosociality.
So let’s figure out how to supply good reinforcement events to AI agents. I think that approach will generalize pretty well (and is, in a sense, all that success requires in the deep learning alignment regime).
I guess to me, shard theory resembles RLHF, and seems to share its flaws (unless this gets addressed in a future post or I missed it in one of the existing posts or something).
So for instance learning values by reinforcement events seems likely to lead to deception. If there’s some experience that deceives people to provide feedback signals that the behavior was prosocial, then it seems like the shard that leads to that experience will be reinforced.
This doesn’t become much of a problem in practice among humans (or well, it actually does seem to be a fairly significant problem, but not x-risk level significant), but the most logical reinforcement-based reason I can see why it doesn’t become a bigger problem is that people cannot reliably deceive each other. (There may also be innate honesty instincts? But that runs into genome inaccessibility problems.)
These seem like standard objections around here so I assume you’ve thought about them. I just don’t notice those thoughts anywhere in the work.
I think a lot (but probably not all) of the standard objections don’t make much sense to me anymore. Anyways, can you say more here, so I can make sure I’m following?
(A concrete instantiated scenario would be most helpful! Like, Bob is talking with Alice, who gives him approval-reward of some kind when he does something she wants, and then...)
So I guess if we want to be concrete, the most obvious place to start would be classical cases where RLHF has gone wrong. Like a gripper pretending to pick up an object by placing its hand in front of the camera, or a game-playing AI pretending to make progress by replaying the same part of the game over and over again. Though these are “easy” in the sense that they seem correctable by taking more context into consideration.
One issue with giving concrete examples is that I think nobody has gotten RLHF to work in problems that are too “big” for humans to have all the context. So we don’t really know how it would work in the regime where it seems irreparably dangerous. Like I could say “what if we give it the task of coming up with plans for an engineering project and it has learned to not make pollution that causes health problems obvious? Due to previously having suggested a design with obvious pollution and having that design punished”, but who knows how RLHF will actually be used in engineering?
I was more wondering about situations in humans which you thought had the potential to be problematic, under the RLHF frame on alignment.
Like, you said:
So, if some alignment theory says “this approach (e.g. RLHF) is flawed and probably won’t produce human-compatible values”, and we notice “shard theory resembles RLHF”, then insofar as shard theory is actually true, RLHF-like processes are the only known generators of human-compatible values ever, and I’d update against the alignment theory / reasoning which called RLHF flawed. (Of course, there are reasons—like inductive biases—that RLHF-like processes could work in humans but not in AI, but any argument against RLHF would have to discriminate between the human/AI case in a way which accounts for those obstructions.)
On the object level:
Are you saying “The AI says something which makes us erroneously believe it saved a person’s life, and we reward it, and this can spawn a deception-shard”? If so—that’s not (necessarily) how credit assignment works. The AI’s credit assignment isn’t necessarily running along the lines of “people were deceived, so upweight computations which deceive people.”
Perhaps the AI thought the person would approve of that statement, and so it did it, and got rewarded, which reinforces the approval-seeking shard? (Which is bad news in a different way!)
Perhaps the AI was just exploring into a statement suggested by a self-supervised pretrained initialization, off of an already-learned general heuristic of “sometimes emit completions from the self-supervised pretrained world model.” Then the AI reinforces this heuristic (among other changes from the gradient).
I don’t know whether this is a problem at all, in general. I expect unaligned models to convergently deceive us. But that requires them to already be unaligned.
How about this one: Sometimes in software development, you may be worried that there is a security problem in the program you are making. But if you speak out loud about it, then that generates FUD among the users, which discourages you from speaking loud in the future. Hence, RLHF in a human context generates deception.
Not necessarily a general deception shard, just it spawns some sort of shard that repeats similar things to what it did before, which presumably means more often errorneously making us believe it saved a person’s life. Whether that’s deception or approval-seeking or donating-to-charities-without-regard-for-effectiveness or something else.
Your post points out that you can do all sorts of things in theory if you “have enough write access to fool credit assignment”. But that’s not sufficient to show that they can happen in practice. You gotta propose system of write access and training to use this write access to do what you are proposing.
Would you not agree that models are unaligned by default, unless there is something that aligns them?
Thanks for the example. The conclusion is far too broad and confident, in my opinion. I would instead say “RLHF in a human context seems to have at least one factor which pushes for deception in this kind of situation.” And then, of course, we should compare the predicted alignment concerns in people, with the observed alignment situation, and update accordingly. I’ve updated down hard on alignment difficulty when I’ve run this exercise in the past.
I don’t see why this is worth granting the connotations we usually associate with “deception”
I think that if the AI just repeats “I saved someone’s life” without that being true, we will find out and stop rewarding that?
Unless the AI didn’t just happen to get erroneously rewarded for prosociality (as originally discussed), but planned for that to happen, in which case it’s already deceptive in a much worse way.
But somehow the AI has to get to that cognitive state, first. I think it’s definitely possible, but not at all clearly the obvious outcome.
That wasn’t the part of the post I meant to point to. I was saying that just because we externally observe something we would call “deception/misleading task completion” (e.g. getting us to reward the AI for prosociality), does not mean that “deceptive thought patterns” get reinforced into the agent! The map is not the territory of the AI’s updating process. The reward will, I think, reinforce and generalize the AI’s existing cognitive subroutines which produced the judged-prosocial behavior, which subroutines don’t necessarily have anything to do with explicit deception (as you noted).
Is a donut “unaligned by default”? The networks start out randomly initialized. I agree that effort has to be put in to make the AI care about human-good outcomes in particular, as opposed to caring about ~nothing, or caring about some other random set of reward correlates. But I’m not assuming the model starts out deceptive, nor that it will become that with high probability. That’s one question I’m trying to figure out with fresh eyes.