FWIW I think the most important distinction in “alignment” is aligning with somebody’s preferences versus aligning with what is actually good, and I increasingly have the sense that the former does not lead in any limit to the latter.
I have an upcoming post which might be highly relevant. Many proposals which black-box human judgment / model humans, aren’t trying to get an AI which optimizes what people want. They’re getting an AI to optimize evaluations of plans—the quotation of human desires, as quoted via those evaluations. And I think that’s a subtle distinction which can prove quite fatal.
Right. Many seem to assume that there is a causal relationship good → human desires → human evaluations. They are hoping both that if we do well according to human evaluations then we will be satisfying human desires, and that if we satisfy human desires, we will create a good world. I think both of those assumptions are questionable.
I like the analogy in which we consider an alternative world where AI researchers assumed, for whatever parochial reason, that it was actually human dreams that should guide AI behavior. In this world, they ask humans to write down their dreams, and try to devise AIs that would make the world like that. There are two assumptions here: (1) that making the world more like human dreams would be good, and (2) that humans can correctly report their dreams. In the case of dreams, both of these assumptions are suspect, right? But what exactly is the difference with human desires? Why do we assume that either they are a guide to what is good or can be reported accurately?
I have an upcoming post which might be highly relevant. Many proposals which black-box human judgment / model humans, aren’t trying to get an AI which optimizes what people want. They’re getting an AI to optimize evaluations of plans—the quotation of human desires, as quoted via those evaluations. And I think that’s a subtle distinction which can prove quite fatal.
Right. Many seem to assume that there is a causal relationship good → human desires → human evaluations. They are hoping both that if we do well according to human evaluations then we will be satisfying human desires, and that if we satisfy human desires, we will create a good world. I think both of those assumptions are questionable.
I like the analogy in which we consider an alternative world where AI researchers assumed, for whatever parochial reason, that it was actually human dreams that should guide AI behavior. In this world, they ask humans to write down their dreams, and try to devise AIs that would make the world like that. There are two assumptions here: (1) that making the world more like human dreams would be good, and (2) that humans can correctly report their dreams. In the case of dreams, both of these assumptions are suspect, right? But what exactly is the difference with human desires? Why do we assume that either they are a guide to what is good or can be reported accurately?