I’m confused about the example of Jim eating Brussel sprouts—I don’t see how it shows inner misalignment as written. I think a more complicated example will work—I try one out at the end of this comment. I’m interested if there’s something I’m missing about the Brussel sprouts example.
Basic value conflict
Jim’s “programmer” is Evolution. Evolution has the objective of maximizing inclusive genetic fitness. To do this you need to avoid eating poison before you make babies. For various reasons, directly providing the objective of maximizing inclusive genetic fitness to Jim as his fixed reward function does not maximize inclusive genetic fitness. Instead, Evolution gives Jim a fixed reward function consisting of two values: avoid bitter foods and keep your girlfriend happy. So we know there is outer misalignment, but that’s not what the example is about.
Jim goes into the world with this fixed value system and learns some facts about it:
Brussel sprouts are bitter.
Sam is my girlfriend.
Sam likes men who eat Brussel sprouts.
Jim weighs the conflict between the two values and decides whether to eat the Brussel sprouts. His values are in conflict. The question of which value wins on any particular day depends on Evolution’s choices about his fixed value system, plus facts about the world. This seems like normal inner alignment, working as designed. If Evolution wanted one value to always win in such a conflict, Evolution should weight the values appropriately.
Enter the learned value function
What does the learned value function add to this basic picture? While Jim is in the world, he updates his learned value function based on his fixed reward function. In particular whenever Sam serves him Brussel sprouts, he learns a higher value for:
Being the sort of person who eats Brussel sprouts (especially when with a girlfriend)
Being the sort of person who has boundaries with his girlfriend (especially about food)
The question of where the reward value function stabilizes depends on Evolution’s choices and facts about the world. It’s a more complicated alignment story, but Jim is still aligned with Evolution. It’s just a tough break for Jim and Evolution alike that the only woman who loves Jim also wants him to eat Brussel sprouts. Anyway, Jim ends up becoming the sort of person who eats Brussel sprouts with his girlfriend and wants to be that sort of person, and they make a baby or two.
So far there’s nothing in this story that makes girlfriend-pleasing higher-order and bitter-avoiding first-order. They are both in the fixed reward function and they both have complicated downstream effects in his learned value function.
Farewell Sam
So one day Sam leaves Jim for a man with a more complex fixed reward system. Jim’s learned value function doesn’t just reset to a clean state when she leaves him (although his brain plasticity might increase for a while). So he continues to eat Brussel sprouts and to want to be the type of person who eats Brussel sprouts. A few months later he gets into a relationship with Tina, who apparently doesn’t care about Brussel sprouts. They are happy and make a baby or two.
Finally some inner misalignment? Well, it looks like Jim is eating Brussel sprouts even though that is doing nothing to help please Tina. Evolution thinks he is taking a risk of being poisoned for no benefit. So in that case there is some transient misalignment until Jim’s learned value function updates to reflect the new reality. But this is only transient misalignment during learning. For their first year anniversary Tina and Jim go to a bar and Jim has his first ever beer. It’s bitter! So again, Jim has some transient misalignment during learning and never drinks beer again.
But the reward value function could be doing its job correctly. Since Sam wanted him to eat Brussel sprouts and Sam and Tina are both human females, it’s quite plausible that Tina actually wants him to eat Brussel sprouts too. The fixed reward functions of various humans are all set by Evolution and they are strongly correlated. In fact, Evolution gave women a fixed reward function that values men who eat Brussel sprouts because that turns out to be the most efficient way to filter out men whose reward functions don’t value their girlfriend happiness, or whose reward value learning system is sub-par.
Either way, I don’t see meaningful inner misalignment here.
Towards a better example
What if instead of training Jim to eat Brussel sprouts, Sam had instead converted Jim to her belief system, “Effective Altruism”. EAs believe in maximizing global utility, which means avoiding meat, avoiding conspicuous consumption, and eating healthy foods that boost productivity. It turns out that Brussel sprouts are the perfect EA food! Also, EA is a coherent belief system, so EAs should not self-modify to stop being EA because that would be value drift. Fortunately value drift can be avoided by hanging out with other EAs and pledging to give effectively and so forth. It’s possible that Sam hasn’t read all the Sequences and she doesn’t have EA quite right.
In this hypothetical Jim learns multiple updates to his learned value function and learned world-model and learned planner/actor. Jim thinks differently after being with Sam; his old friends say it’s like he’s not even the same person. When Sam leaves Jim, Jim remains an EA. His fixed reward function would reward him for eating cookies, but he never buys them because they’re not nutritionally efficient. His fixed reward function would reward him for making a girlfriend happy who wants him to kick puppies, but he doesn’t date those kind of women because they are clearly evil, and dating evil women causes value drift. Jim ends up eating Brussel sprouts and not having a girlfriend and Evolution realizes that its design needs some work.
I think this definitely works as an example of self-referential misalignment. It’s also an example of inner misalignment caused by parasitic (from Evolution’s perspective) memes. That’s going to be the most common example, I think. The misalignment is persistent and self-reinforcing, and that is much more likely to happen by memetic evolution than by random chance.
The way I would say (I think something like) your comment is: If Jim winds up “wanting to be the kind of person who likes brussels sprouts”, we can ask How did Jim wind up wanting that particular thing? The answer, one presumes, is that something in Jim’s reward function is pushing for it. I.e., something in Jim’s reward function painted positive valence onto the concept of “being the kind of person who likes brussels sprouts” inside Jim’s world-model.
Then we can ask the follow-up question: Was that the designer’s intention? If no, then it’s inner misalignment. If yes, then it’s not inner misalignment.
Actually, that last sentence is too simplistic. Suppose that the designer’s intention is that they want Jim to dislike brussels sprouts, but also they want Jim to want to be the kind of person who likes brussels sprouts. …I’m gonna stop right here and say: What on earth is the designer thinking here?? Why would they want that?? If Jim self-modifies to permanently like brussels sprouts from now on, was that the designer’s intention or not? I don’t know; it seems like the designer’s intentions here are weirdly incoherent, and maybe the designer ought to go back to the drawing board and stop trying to do things that are self-undermining. Granted, in the human case, there are social dynamics that lead to evolution wanting this kind of thing. But in the AGI case, I don’t see any reason for it. I think we should really be trying to design our AGIs such that they want to want the things that they want, which in turn are identical to the things that we humans want them to want.
Back to the other case where it’s obviously inner misalignment, because the designer both wanted Jim to dislike brussels sprouts, and wanted Jim to dislike being the kind of person who likes brussels sprouts, but nevertheless Jim somehow wound up wanting to be the kind of guy who likes brussels sprouts. Is there anything that could lead to that? I say: Yes! The existence of superstitions is evidence that people can wind up liking random things for no reason in particular. Basically, there’s a “credit assignment” process that links rewards to abstract concepts, and it’s a dumb noisy algorithm that will sometimes flag the wrong concept. Also, if the designer has other intentions besides brussels sprouts, there could be cross-talk between the corresponding rewards.
I’m confused about the example of Jim eating Brussel sprouts—I don’t see how it shows inner misalignment as written. I think a more complicated example will work—I try one out at the end of this comment. I’m interested if there’s something I’m missing about the Brussel sprouts example.
Basic value conflict
Jim’s “programmer” is Evolution. Evolution has the objective of maximizing inclusive genetic fitness. To do this you need to avoid eating poison before you make babies. For various reasons, directly providing the objective of maximizing inclusive genetic fitness to Jim as his fixed reward function does not maximize inclusive genetic fitness. Instead, Evolution gives Jim a fixed reward function consisting of two values: avoid bitter foods and keep your girlfriend happy. So we know there is outer misalignment, but that’s not what the example is about.
Jim goes into the world with this fixed value system and learns some facts about it:
Brussel sprouts are bitter.
Sam is my girlfriend.
Sam likes men who eat Brussel sprouts.
Jim weighs the conflict between the two values and decides whether to eat the Brussel sprouts. His values are in conflict. The question of which value wins on any particular day depends on Evolution’s choices about his fixed value system, plus facts about the world. This seems like normal inner alignment, working as designed. If Evolution wanted one value to always win in such a conflict, Evolution should weight the values appropriately.
Enter the learned value function
What does the learned value function add to this basic picture? While Jim is in the world, he updates his learned value function based on his fixed reward function. In particular whenever Sam serves him Brussel sprouts, he learns a higher value for:
Being the sort of person who eats Brussel sprouts (especially when with a girlfriend)
Being the sort of person who has boundaries with his girlfriend (especially about food)
The question of where the reward value function stabilizes depends on Evolution’s choices and facts about the world. It’s a more complicated alignment story, but Jim is still aligned with Evolution. It’s just a tough break for Jim and Evolution alike that the only woman who loves Jim also wants him to eat Brussel sprouts. Anyway, Jim ends up becoming the sort of person who eats Brussel sprouts with his girlfriend and wants to be that sort of person, and they make a baby or two.
So far there’s nothing in this story that makes girlfriend-pleasing higher-order and bitter-avoiding first-order. They are both in the fixed reward function and they both have complicated downstream effects in his learned value function.
Farewell Sam
So one day Sam leaves Jim for a man with a more complex fixed reward system. Jim’s learned value function doesn’t just reset to a clean state when she leaves him (although his brain plasticity might increase for a while). So he continues to eat Brussel sprouts and to want to be the type of person who eats Brussel sprouts. A few months later he gets into a relationship with Tina, who apparently doesn’t care about Brussel sprouts. They are happy and make a baby or two.
Finally some inner misalignment? Well, it looks like Jim is eating Brussel sprouts even though that is doing nothing to help please Tina. Evolution thinks he is taking a risk of being poisoned for no benefit. So in that case there is some transient misalignment until Jim’s learned value function updates to reflect the new reality. But this is only transient misalignment during learning. For their first year anniversary Tina and Jim go to a bar and Jim has his first ever beer. It’s bitter! So again, Jim has some transient misalignment during learning and never drinks beer again.
But the reward value function could be doing its job correctly. Since Sam wanted him to eat Brussel sprouts and Sam and Tina are both human females, it’s quite plausible that Tina actually wants him to eat Brussel sprouts too. The fixed reward functions of various humans are all set by Evolution and they are strongly correlated. In fact, Evolution gave women a fixed reward function that values men who eat Brussel sprouts because that turns out to be the most efficient way to filter out men whose reward functions don’t value their girlfriend happiness, or whose reward value learning system is sub-par.
Either way, I don’t see meaningful inner misalignment here.
Towards a better example
What if instead of training Jim to eat Brussel sprouts, Sam had instead converted Jim to her belief system, “Effective Altruism”. EAs believe in maximizing global utility, which means avoiding meat, avoiding conspicuous consumption, and eating healthy foods that boost productivity. It turns out that Brussel sprouts are the perfect EA food! Also, EA is a coherent belief system, so EAs should not self-modify to stop being EA because that would be value drift. Fortunately value drift can be avoided by hanging out with other EAs and pledging to give effectively and so forth. It’s possible that Sam hasn’t read all the Sequences and she doesn’t have EA quite right.
In this hypothetical Jim learns multiple updates to his learned value function and learned world-model and learned planner/actor. Jim thinks differently after being with Sam; his old friends say it’s like he’s not even the same person. When Sam leaves Jim, Jim remains an EA. His fixed reward function would reward him for eating cookies, but he never buys them because they’re not nutritionally efficient. His fixed reward function would reward him for making a girlfriend happy who wants him to kick puppies, but he doesn’t date those kind of women because they are clearly evil, and dating evil women causes value drift. Jim ends up eating Brussel sprouts and not having a girlfriend and Evolution realizes that its design needs some work.
I think this definitely works as an example of self-referential misalignment. It’s also an example of inner misalignment caused by parasitic (from Evolution’s perspective) memes. That’s going to be the most common example, I think. The misalignment is persistent and self-reinforcing, and that is much more likely to happen by memetic evolution than by random chance.
The way I would say (I think something like) your comment is: If Jim winds up “wanting to be the kind of person who likes brussels sprouts”, we can ask How did Jim wind up wanting that particular thing? The answer, one presumes, is that something in Jim’s reward function is pushing for it. I.e., something in Jim’s reward function painted positive valence onto the concept of “being the kind of person who likes brussels sprouts” inside Jim’s world-model.
Then we can ask the follow-up question: Was that the designer’s intention? If no, then it’s inner misalignment. If yes, then it’s not inner misalignment.
Actually, that last sentence is too simplistic. Suppose that the designer’s intention is that they want Jim to dislike brussels sprouts, but also they want Jim to want to be the kind of person who likes brussels sprouts. …I’m gonna stop right here and say: What on earth is the designer thinking here?? Why would they want that?? If Jim self-modifies to permanently like brussels sprouts from now on, was that the designer’s intention or not? I don’t know; it seems like the designer’s intentions here are weirdly incoherent, and maybe the designer ought to go back to the drawing board and stop trying to do things that are self-undermining. Granted, in the human case, there are social dynamics that lead to evolution wanting this kind of thing. But in the AGI case, I don’t see any reason for it. I think we should really be trying to design our AGIs such that they want to want the things that they want, which in turn are identical to the things that we humans want them to want.
Back to the other case where it’s obviously inner misalignment, because the designer both wanted Jim to dislike brussels sprouts, and wanted Jim to dislike being the kind of person who likes brussels sprouts, but nevertheless Jim somehow wound up wanting to be the kind of guy who likes brussels sprouts. Is there anything that could lead to that? I say: Yes! The existence of superstitions is evidence that people can wind up liking random things for no reason in particular. Basically, there’s a “credit assignment” process that links rewards to abstract concepts, and it’s a dumb noisy algorithm that will sometimes flag the wrong concept. Also, if the designer has other intentions besides brussels sprouts, there could be cross-talk between the corresponding rewards.