Steven Byrnes comments on Theoretical Neuroscience For Alignment Theory

Steven Byrnes 10 Dec 2021 15:21 UTC
4 points
The way I would say (I think something like) your comment is: If Jim winds up “wanting to be the kind of person who likes brussels sprouts”, we can ask How did Jim wind up wanting that particular thing? The answer, one presumes, is that something in Jim’s reward function is pushing for it. I.e., something in Jim’s reward function painted positive valence onto the concept of “being the kind of person who likes brussels sprouts” inside Jim’s world-model.
Then we can ask the follow-up question: Was that the designer’s intention? If no, then it’s inner misalignment. If yes, then it’s not inner misalignment.
Actually, that last sentence is too simplistic. Suppose that the designer’s intention is that they want Jim to dislike brussels sprouts, but also they want Jim to want to be the kind of person who likes brussels sprouts. …I’m gonna stop right here and say: What on earth is the designer thinking here?? Why would they want that?? If Jim self-modifies to permanently like brussels sprouts from now on, was that the designer’s intention or not? I don’t know; it seems like the designer’s intentions here are weirdly incoherent, and maybe the designer ought to go back to the drawing board and stop trying to do things that are self-undermining. Granted, in the human case, there are social dynamics that lead to evolution wanting this kind of thing. But in the AGI case, I don’t see any reason for it. I think we should really be trying to design our AGIs such that they want to want the things that they want, which in turn are identical to the things that we humans want them to want.
Back to the other case where it’s obviously inner misalignment, because the designer both wanted Jim to dislike brussels sprouts, and wanted Jim to dislike being the kind of person who likes brussels sprouts, but nevertheless Jim somehow wound up wanting to be the kind of guy who likes brussels sprouts. Is there anything that could lead to that? I say: Yes! The existence of superstitions is evidence that people can wind up liking random things for no reason in particular. Basically, there’s a “credit assignment” process that links rewards to abstract concepts, and it’s a dumb noisy algorithm that will sometimes flag the wrong concept. Also, if the designer has other intentions besides brussels sprouts, there could be cross-talk between the corresponding rewards.