But how does this help with alignment? Sharded systems seem hard to robustly align outside of the context of an entity who participates on equal footing with other humans in society.
Well for starters, it narrows down the kind of type signature you might need to look for to find something like a “desire” inside an AI, if the training dynamics described here are broad enough to hold for the AI too.
It also helped me become less confused about what the “human values” we want the AI to be aligned with might actually mechanistically look like in our own brains, which seems useful for e.g. schemes where you try to rewire the AI to have a goal given by a pointer to its model of human values. I imagine having a better idea of what you’re actually aiming for might also be useful for many other alignment schemes.
Are you asking about the relevance of understanding human value formation? If so, see Humans provide an untapped wealth of evidence about alignment. We know of exactly one form of general intelligence which grows human-compatible values: humans. So, if you want to figure out how human-compatible values can form at all, start by understanding how they have formed empirically.
But perhaps you’re asking something like “how does this perspective imply anything good for alignment?” And that’s something we have deliberately avoided discussing for now. More in future posts.
I’m basically re-raising the point I asked about in your linked post; the alignability of sharded humans seems to be due to people living in a society that gives them feedback on their behavior that they have to follow. This allows cooperative shards to grow. It doesn’t seem like it would generalize to more powerful beings.
We decide what loss functions to train the AIs with. It’s not like the AIs have some inbuilt reward circuitry specified by evolution to maximize the AI’s reproductive fitness. We can simply choose to reinforce cooperative behavior.
I think this leads to a massive power disparity (in our favor) between us and the AIs. Someone with total control over your own reward circuitry would have a massive advantage over you.
Maybe a nitpick, but ideally the reinforcement shouldn’t just be based on “behavior”; you want to reward the agent when it does the right thing for the right reasons. Right? (Or maybe you’re defining “cooperative behavior” as not only external behavior but also underlying motivations?)
What do power differentials have to do with the kind of mechanistic training story posited by shard theory?
The mechanistically relevant part of your point seems to be that feedback signals from other people probably transduce into reinforcement events in a person’s brain, such that the post-reinforcement person is incrementally “more prosocial.” But the important part isn’t “feedback signals from other people with ~equal power”, it’s the transduced reinforcement events which increase prosociality.
So let’s figure out how to supply good reinforcement events to AI agents. I think that approach will generalize pretty well (and is, in a sense, all that success requires in the deep learning alignment regime).
I guess to me, shard theory resembles RLHF, and seems to share its flaws (unless this gets addressed in a future post or I missed it in one of the existing posts or something).
So for instance learning values by reinforcement events seems likely to lead to deception. If there’s some experience that deceives people to provide feedback signals that the behavior was prosocial, then it seems like the shard that leads to that experience will be reinforced.
This doesn’t become much of a problem in practice among humans (or well, it actually does seem to be a fairly significant problem, but not x-risk level significant), but the most logical reinforcement-based reason I can see why it doesn’t become a bigger problem is that people cannot reliably deceive each other. (There may also be innate honesty instincts? But that runs into genome inaccessibility problems.)
These seem like standard objections around here so I assume you’ve thought about them. I just don’t notice those thoughts anywhere in the work.
I think a lot (but probably not all) of the standard objections don’t make much sense to me anymore. Anyways, can you say more here, so I can make sure I’m following?
If there’s some experience that deceives people to provide feedback signals that the behavior was prosocial, then it seems like the shard that leads to that experience will be reinforced.
(A concrete instantiated scenario would be most helpful! Like, Bob is talking with Alice, who gives him approval-reward of some kind when he does something she wants, and then...)
So I guess if we want to be concrete, the most obvious place to start would be classical cases where RLHF has gone wrong. Like a gripper pretending to pick up an object by placing its hand in front of the camera, or a game-playing AI pretending to make progress by replaying the same part of the game over and over again. Though these are “easy” in the sense that they seem correctable by taking more context into consideration.
One issue with giving concrete examples is that I think nobody has gotten RLHF to work in problems that are too “big” for humans to have all the context. So we don’t really know how it would work in the regime where it seems irreparably dangerous. Like I could say “what if we give it the task of coming up with plans for an engineering project and it has learned to not make pollution that causes health problems obvious? Due to previously having suggested a design with obvious pollution and having that design punished”, but who knows how RLHF will actually be used in engineering?
I was more wondering about situations in humans which you thought had the potential to be problematic, under the RLHF frame on alignment.
Like, you said:
shard theory resembles RLHF, and seems to share its flaws
So, if some alignment theory says “this approach (e.g. RLHF) is flawed and probably won’t produce human-compatible values”, and we notice “shard theory resembles RLHF”, then insofar as shard theory is actually true, RLHF-like processes are the only known generators of human-compatible values ever, and I’d update against the alignment theory / reasoning which called RLHF flawed. (Of course, there are reasons—like inductive biases—that RLHF-like processes could work in humans but not in AI, but any argument against RLHF would have to discriminate between the human/AI case in a way which accounts for those obstructions.)
On the object level:
If there’s some experience that deceives people to provide feedback signals that the behavior was prosocial, then it seems like the shard that leads to that experience will be reinforced.
Are you saying “The AI says something which makes us erroneously believe it saved a person’s life, and we reward it, and this can spawn a deception-shard”? If so—that’s not (necessarily) how credit assignment works. The AI’s credit assignment isn’t necessarily running along the lines of “people were deceived, so upweight computations which deceive people.”
Perhaps the AI thought the person would approve of that statement, and so it did it, and got rewarded, which reinforces the approval-seeking shard? (Which is bad news in a different way!)
Perhaps the AI was just exploring into a statement suggested by a self-supervised pretrained initialization, off of an already-learned general heuristic of “sometimes emit completions from the self-supervised pretrained world model.” Then the AI reinforces this heuristic (among other changes from the gradient).
the most logical reinforcement-based reason I can see why it doesn’t become a bigger problem is that people cannot reliably deceive each other.
I don’t know whether this is a problem at all, in general. I expect unaligned models to convergently deceive us. But that requires them to already be unaligned.
I was more wondering about situations in humans which you thought had the potential to be problematic, under the RLHF frame on alignment.
How about this one: Sometimes in software development, you may be worried that there is a security problem in the program you are making. But if you speak out loud about it, then that generates FUD among the users, which discourages you from speaking loud in the future. Hence, RLHF in a human context generates deception.
Are you saying “The AI says something which makes us erroneously believe it saved a person’s life, and we reward it, and this can spawn a deception-shard”?
Not necessarily a general deception shard, just it spawns some sort of shard that repeats similar things to what it did before, which presumably means more often errorneously making us believe it saved a person’s life. Whether that’s deception or approval-seeking or donating-to-charities-without-regard-for-effectiveness or something else.
Your post points out that you can do all sorts of things in theory if you “have enough write access to fool credit assignment”. But that’s not sufficient to show that they can happen in practice. You gotta propose system of write access and training to use this write access to do what you are proposing.
I don’t know whether this is a problem at all, in general. I expect unaligned models to convergently deceive us. But that requires them to already be unaligned.
Would you not agree that models are unaligned by default, unless there is something that aligns them?
Sometimes in software development, you may be worried that there is a security problem in the program you are making. But if you speak out loud about it, then that generates FUD among the users, which discourages you from speaking loud in the future. Hence, RLHF in a human context generates deception.
Thanks for the example. The conclusion is far too broad and confident, in my opinion. I would instead say “RLHF in a human context seems to have at least one factor which pushes for deception in this kind of situation.” And then, of course, we should compare the predicted alignment concerns in people, with the observed alignment situation, and update accordingly. I’ve updated down hard on alignment difficulty when I’ve run this exercise in the past.
Not necessarily a general deception shard, just it spawns some sort of shard that repeats similar things to what it did before, which presumably means more often errorneously making us believe it saved a person’s life.
I don’t see why this is worth granting the connotations we usually associate with “deception”
I think that if the AI just repeats “I saved someone’s life” without that being true, we will find out and stop rewarding that?
Unless the AI didn’t just happen to get erroneously rewarded for prosociality (as originally discussed), but planned for that to happen, in which case it’s already deceptive in a much worse way.
But somehow the AI has to get to that cognitive state, first. I think it’s definitely possible, but not at all clearly the obvious outcome.
Your post points out that you can do all sorts of things in theory if you “have enough write access to fool credit assignment”. But that’s not sufficient to show that they can happen in practice. You gotta propose system of write access and training to use this write access to do what you are proposing.
That wasn’t the part of the post I meant to point to. I was saying that just because we externally observe something we would call “deception/misleading task completion” (e.g. getting us to reward the AI for prosociality), does not mean that “deceptive thought patterns” get reinforced into the agent! The map is not the territory of the AI’s updating process. The reward will, I think, reinforce and generalize the AI’s existing cognitive subroutines which produced the judged-prosocial behavior, which subroutines don’t necessarily have anything to do with explicitdeception (as you noted).
Would you not agree that models are unaligned by default, unless there is something that aligns them?
Is a donut “unaligned by default”? The networks start out randomly initialized. I agree that effort has to be put in to make the AI care about human-good outcomes in particular, as opposed to caring about ~nothing, or caring about some other random set of reward correlates. But I’m not assuming the model starts out deceptive, nor that it will become that with high probability. That’s one question I’m trying to figure out with fresh eyes.
I think I sorta disagree in the sense that high-functioning sociopaths live in the same society as neurotypical people, but don’t wind up “aligned”. I think the innate reward function is playing a big role. (And by the way, nobody knows what that innate human reward function is or how it works, according to me.) That said, maybe the innate reward function is insufficient and we also need multi-agent dynamics. I don’t currently know.
I’m sympathetic to your broader point, but until somebody says exactly what the rewards (a.k.a. “reinforcement events”) are, I’m withholding judgment. I’m open to the weaker argument that there are kinda dumb obvious things to try where we don’t have strong reason to believe that they will create friendly AGI, but we also don’t have strong reason to believe that they won’t create friendly AGI. See here. This is a less pessimistic take than Eliezer’s, for example.
But how does this help with alignment? Sharded systems seem hard to robustly align outside of the context of an entity who participates on equal footing with other humans in society.
Well for starters, it narrows down the kind of type signature you might need to look for to find something like a “desire” inside an AI, if the training dynamics described here are broad enough to hold for the AI too.
It also helped me become less confused about what the “human values” we want the AI to be aligned with might actually mechanistically look like in our own brains, which seems useful for e.g. schemes where you try to rewire the AI to have a goal given by a pointer to its model of human values. I imagine having a better idea of what you’re actually aiming for might also be useful for many other alignment schemes.
Are you asking about the relevance of understanding human value formation? If so, see Humans provide an untapped wealth of evidence about alignment. We know of exactly one form of general intelligence which grows human-compatible values: humans. So, if you want to figure out how human-compatible values can form at all, start by understanding how they have formed empirically.
But perhaps you’re asking something like “how does this perspective imply anything good for alignment?” And that’s something we have deliberately avoided discussing for now. More in future posts.
I’m basically re-raising the point I asked about in your linked post; the alignability of sharded humans seems to be due to people living in a society that gives them feedback on their behavior that they have to follow. This allows cooperative shards to grow. It doesn’t seem like it would generalize to more powerful beings.
We decide what loss functions to train the AIs with. It’s not like the AIs have some inbuilt reward circuitry specified by evolution to maximize the AI’s reproductive fitness. We can simply choose to reinforce cooperative behavior.
I think this leads to a massive power disparity (in our favor) between us and the AIs. Someone with total control over your own reward circuitry would have a massive advantage over you.
Maybe a nitpick, but ideally the reinforcement shouldn’t just be based on “behavior”; you want to reward the agent when it does the right thing for the right reasons. Right? (Or maybe you’re defining “cooperative behavior” as not only external behavior but also underlying motivations?)
What do power differentials have to do with the kind of mechanistic training story posited by shard theory?
The mechanistically relevant part of your point seems to be that feedback signals from other people probably transduce into reinforcement events in a person’s brain, such that the post-reinforcement person is incrementally “more prosocial.” But the important part isn’t “feedback signals from other people with ~equal power”, it’s the transduced reinforcement events which increase prosociality.
So let’s figure out how to supply good reinforcement events to AI agents. I think that approach will generalize pretty well (and is, in a sense, all that success requires in the deep learning alignment regime).
I guess to me, shard theory resembles RLHF, and seems to share its flaws (unless this gets addressed in a future post or I missed it in one of the existing posts or something).
So for instance learning values by reinforcement events seems likely to lead to deception. If there’s some experience that deceives people to provide feedback signals that the behavior was prosocial, then it seems like the shard that leads to that experience will be reinforced.
This doesn’t become much of a problem in practice among humans (or well, it actually does seem to be a fairly significant problem, but not x-risk level significant), but the most logical reinforcement-based reason I can see why it doesn’t become a bigger problem is that people cannot reliably deceive each other. (There may also be innate honesty instincts? But that runs into genome inaccessibility problems.)
These seem like standard objections around here so I assume you’ve thought about them. I just don’t notice those thoughts anywhere in the work.
I think a lot (but probably not all) of the standard objections don’t make much sense to me anymore. Anyways, can you say more here, so I can make sure I’m following?
(A concrete instantiated scenario would be most helpful! Like, Bob is talking with Alice, who gives him approval-reward of some kind when he does something she wants, and then...)
So I guess if we want to be concrete, the most obvious place to start would be classical cases where RLHF has gone wrong. Like a gripper pretending to pick up an object by placing its hand in front of the camera, or a game-playing AI pretending to make progress by replaying the same part of the game over and over again. Though these are “easy” in the sense that they seem correctable by taking more context into consideration.
One issue with giving concrete examples is that I think nobody has gotten RLHF to work in problems that are too “big” for humans to have all the context. So we don’t really know how it would work in the regime where it seems irreparably dangerous. Like I could say “what if we give it the task of coming up with plans for an engineering project and it has learned to not make pollution that causes health problems obvious? Due to previously having suggested a design with obvious pollution and having that design punished”, but who knows how RLHF will actually be used in engineering?
I was more wondering about situations in humans which you thought had the potential to be problematic, under the RLHF frame on alignment.
Like, you said:
So, if some alignment theory says “this approach (e.g. RLHF) is flawed and probably won’t produce human-compatible values”, and we notice “shard theory resembles RLHF”, then insofar as shard theory is actually true, RLHF-like processes are the only known generators of human-compatible values ever, and I’d update against the alignment theory / reasoning which called RLHF flawed. (Of course, there are reasons—like inductive biases—that RLHF-like processes could work in humans but not in AI, but any argument against RLHF would have to discriminate between the human/AI case in a way which accounts for those obstructions.)
On the object level:
Are you saying “The AI says something which makes us erroneously believe it saved a person’s life, and we reward it, and this can spawn a deception-shard”? If so—that’s not (necessarily) how credit assignment works. The AI’s credit assignment isn’t necessarily running along the lines of “people were deceived, so upweight computations which deceive people.”
Perhaps the AI thought the person would approve of that statement, and so it did it, and got rewarded, which reinforces the approval-seeking shard? (Which is bad news in a different way!)
Perhaps the AI was just exploring into a statement suggested by a self-supervised pretrained initialization, off of an already-learned general heuristic of “sometimes emit completions from the self-supervised pretrained world model.” Then the AI reinforces this heuristic (among other changes from the gradient).
I don’t know whether this is a problem at all, in general. I expect unaligned models to convergently deceive us. But that requires them to already be unaligned.
How about this one: Sometimes in software development, you may be worried that there is a security problem in the program you are making. But if you speak out loud about it, then that generates FUD among the users, which discourages you from speaking loud in the future. Hence, RLHF in a human context generates deception.
Not necessarily a general deception shard, just it spawns some sort of shard that repeats similar things to what it did before, which presumably means more often errorneously making us believe it saved a person’s life. Whether that’s deception or approval-seeking or donating-to-charities-without-regard-for-effectiveness or something else.
Your post points out that you can do all sorts of things in theory if you “have enough write access to fool credit assignment”. But that’s not sufficient to show that they can happen in practice. You gotta propose system of write access and training to use this write access to do what you are proposing.
Would you not agree that models are unaligned by default, unless there is something that aligns them?
Thanks for the example. The conclusion is far too broad and confident, in my opinion. I would instead say “RLHF in a human context seems to have at least one factor which pushes for deception in this kind of situation.” And then, of course, we should compare the predicted alignment concerns in people, with the observed alignment situation, and update accordingly. I’ve updated down hard on alignment difficulty when I’ve run this exercise in the past.
I don’t see why this is worth granting the connotations we usually associate with “deception”
I think that if the AI just repeats “I saved someone’s life” without that being true, we will find out and stop rewarding that?
Unless the AI didn’t just happen to get erroneously rewarded for prosociality (as originally discussed), but planned for that to happen, in which case it’s already deceptive in a much worse way.
But somehow the AI has to get to that cognitive state, first. I think it’s definitely possible, but not at all clearly the obvious outcome.
That wasn’t the part of the post I meant to point to. I was saying that just because we externally observe something we would call “deception/misleading task completion” (e.g. getting us to reward the AI for prosociality), does not mean that “deceptive thought patterns” get reinforced into the agent! The map is not the territory of the AI’s updating process. The reward will, I think, reinforce and generalize the AI’s existing cognitive subroutines which produced the judged-prosocial behavior, which subroutines don’t necessarily have anything to do with explicit deception (as you noted).
Is a donut “unaligned by default”? The networks start out randomly initialized. I agree that effort has to be put in to make the AI care about human-good outcomes in particular, as opposed to caring about ~nothing, or caring about some other random set of reward correlates. But I’m not assuming the model starts out deceptive, nor that it will become that with high probability. That’s one question I’m trying to figure out with fresh eyes.
I think I sorta disagree in the sense that high-functioning sociopaths live in the same society as neurotypical people, but don’t wind up “aligned”. I think the innate reward function is playing a big role. (And by the way, nobody knows what that innate human reward function is or how it works, according to me.) That said, maybe the innate reward function is insufficient and we also need multi-agent dynamics. I don’t currently know.
I’m sympathetic to your broader point, but until somebody says exactly what the rewards (a.k.a. “reinforcement events”) are, I’m withholding judgment. I’m open to the weaker argument that there are kinda dumb obvious things to try where we don’t have strong reason to believe that they will create friendly AGI, but we also don’t have strong reason to believe that they won’t create friendly AGI. See here. This is a less pessimistic take than Eliezer’s, for example.
I agree that you need more than just reinforcement learning.
So in a sense this is what I’m getting at. “This resembles prior ideas which seem flawed; how do you intend on avoiding those flaws?”.