because I don’t know what alignment means that I think it’s helpful to have some hand-hold terms like “alignment”
Do you mean “outer/inner alignment”?
Supposing you mean that—I agree that it’s good to say “and I’m confused about this part of the problem”, while also perhaps saying “assuming I’ve formulated the problem correctly at all” and “as I understand it.”
I don’t really disagree with anything you’ve written, but, in general, I think we should allow some of our words to refer to “big confusing problems” that we don’t yet know how to clarify, because we shouldn’t forget about the part of the problem that is deeply confusing, even as we incrementally clarify and build inroads towards it.
Sure. However, in future posts, I will further contend that outer and inner alignment is not an appropriate or natural decomposition of the alignment problem. In my opinion, reifying these terms and reasoning from this frame increases our confusion and tacitly assumes away more promising approaches. (That’s not to say that there’s no one ever who is thinking reasonable and concrete thoughts from that frame. But my actual complaint stands.)
in future posts, I will further contend that outer and inner alignment is not an appropriate or natural decomposition of the alignment problem
Wonderful! I don’t have any complaints per se about outer/inner alignment, but I use it relatively rarely in my own thinking, and it has resolved relatively few of my confusions about alignment.
FWIW I think the most important distinction in “alignment” is aligning with somebody’s preferences versus aligning with what is actually good, and I increasingly have the sense that the former does not lead in any limit to the latter.
FWIW I think the most important distinction in “alignment” is aligning with somebody’s preferences versus aligning with what is actually good, and I increasingly have the sense that the former does not lead in any limit to the latter.
I have an upcoming post which might be highly relevant. Many proposals which black-box human judgment / model humans, aren’t trying to get an AI which optimizes what people want. They’re getting an AI to optimize evaluations of plans—the quotation of human desires, as quoted via those evaluations. And I think that’s a subtle distinction which can prove quite fatal.
Right. Many seem to assume that there is a causal relationship good → human desires → human evaluations. They are hoping both that if we do well according to human evaluations then we will be satisfying human desires, and that if we satisfy human desires, we will create a good world. I think both of those assumptions are questionable.
I like the analogy in which we consider an alternative world where AI researchers assumed, for whatever parochial reason, that it was actually human dreams that should guide AI behavior. In this world, they ask humans to write down their dreams, and try to devise AIs that would make the world like that. There are two assumptions here: (1) that making the world more like human dreams would be good, and (2) that humans can correctly report their dreams. In the case of dreams, both of these assumptions are suspect, right? But what exactly is the difference with human desires? Why do we assume that either they are a guide to what is good or can be reported accurately?
Do you mean “outer/inner alignment”?
Supposing you mean that—I agree that it’s good to say “and I’m confused about this part of the problem”, while also perhaps saying “assuming I’ve formulated the problem correctly at all” and “as I understand it.”
Sure. However, in future posts, I will further contend that outer and inner alignment is not an appropriate or natural decomposition of the alignment problem. In my opinion, reifying these terms and reasoning from this frame increases our confusion and tacitly assumes away more promising approaches. (That’s not to say that there’s no one ever who is thinking reasonable and concrete thoughts from that frame. But my actual complaint stands.)
Wonderful! I don’t have any complaints per se about outer/inner alignment, but I use it relatively rarely in my own thinking, and it has resolved relatively few of my confusions about alignment.
FWIW I think the most important distinction in “alignment” is aligning with somebody’s preferences versus aligning with what is actually good, and I increasingly have the sense that the former does not lead in any limit to the latter.
I have an upcoming post which might be highly relevant. Many proposals which black-box human judgment / model humans, aren’t trying to get an AI which optimizes what people want. They’re getting an AI to optimize evaluations of plans—the quotation of human desires, as quoted via those evaluations. And I think that’s a subtle distinction which can prove quite fatal.
Right. Many seem to assume that there is a causal relationship good → human desires → human evaluations. They are hoping both that if we do well according to human evaluations then we will be satisfying human desires, and that if we satisfy human desires, we will create a good world. I think both of those assumptions are questionable.
I like the analogy in which we consider an alternative world where AI researchers assumed, for whatever parochial reason, that it was actually human dreams that should guide AI behavior. In this world, they ask humans to write down their dreams, and try to devise AIs that would make the world like that. There are two assumptions here: (1) that making the world more like human dreams would be good, and (2) that humans can correctly report their dreams. In the case of dreams, both of these assumptions are suspect, right? But what exactly is the difference with human desires? Why do we assume that either they are a guide to what is good or can be reported accurately?