I think that talking about loss functions being “aligned” encourages bad habits of thought at best, and is nonsensical at worst. I think it makes way more sense to say how you want the agent to think and then act (e.g. “write good novels”—the training goal, in Evan Hubinger’s training stories framework) and why you think you can use a given loss function ℓ novel to produce that cognition in the agent (the training rationale).
Very much agree with this.
Suppose you told me, “TurnTrout, we have definitely produced a loss function which is aligned with the intended goal, and inner-aligned an agent to that loss function.” [What should I expect to see?]
If a person said this to me, what I would expect (if the person was not mistaken in their claim) is that they could explain an insight to me about what it means for an algorithm to “achieve a goal” like “writing good novels” and how they had devised a training method to find an algorithm that matches this operationalization. It is precisely because I don’t know what alignment means that I think it’s helpful to have some hand-hold terms like “alignment” to refer to the problem of clarifying this thing that is currently confusing.
I don’t really disagree with anything you’ve written, but, in general, I think we should allow some of our words to refer to “big confusing problems” that we don’t yet know how to clarify, because we shouldn’t forget about the part of the problem that is deeply confusing, even as we incrementally clarify and build inroads towards it.
because I don’t know what alignment means that I think it’s helpful to have some hand-hold terms like “alignment”
Do you mean “outer/inner alignment”?
Supposing you mean that—I agree that it’s good to say “and I’m confused about this part of the problem”, while also perhaps saying “assuming I’ve formulated the problem correctly at all” and “as I understand it.”
I don’t really disagree with anything you’ve written, but, in general, I think we should allow some of our words to refer to “big confusing problems” that we don’t yet know how to clarify, because we shouldn’t forget about the part of the problem that is deeply confusing, even as we incrementally clarify and build inroads towards it.
Sure. However, in future posts, I will further contend that outer and inner alignment is not an appropriate or natural decomposition of the alignment problem. In my opinion, reifying these terms and reasoning from this frame increases our confusion and tacitly assumes away more promising approaches. (That’s not to say that there’s no one ever who is thinking reasonable and concrete thoughts from that frame. But my actual complaint stands.)
in future posts, I will further contend that outer and inner alignment is not an appropriate or natural decomposition of the alignment problem
Wonderful! I don’t have any complaints per se about outer/inner alignment, but I use it relatively rarely in my own thinking, and it has resolved relatively few of my confusions about alignment.
FWIW I think the most important distinction in “alignment” is aligning with somebody’s preferences versus aligning with what is actually good, and I increasingly have the sense that the former does not lead in any limit to the latter.
FWIW I think the most important distinction in “alignment” is aligning with somebody’s preferences versus aligning with what is actually good, and I increasingly have the sense that the former does not lead in any limit to the latter.
I have an upcoming post which might be highly relevant. Many proposals which black-box human judgment / model humans, aren’t trying to get an AI which optimizes what people want. They’re getting an AI to optimize evaluations of plans—the quotation of human desires, as quoted via those evaluations. And I think that’s a subtle distinction which can prove quite fatal.
Right. Many seem to assume that there is a causal relationship good → human desires → human evaluations. They are hoping both that if we do well according to human evaluations then we will be satisfying human desires, and that if we satisfy human desires, we will create a good world. I think both of those assumptions are questionable.
I like the analogy in which we consider an alternative world where AI researchers assumed, for whatever parochial reason, that it was actually human dreams that should guide AI behavior. In this world, they ask humans to write down their dreams, and try to devise AIs that would make the world like that. There are two assumptions here: (1) that making the world more like human dreams would be good, and (2) that humans can correctly report their dreams. In the case of dreams, both of these assumptions are suspect, right? But what exactly is the difference with human desires? Why do we assume that either they are a guide to what is good or can be reported accurately?
Very much agree with this.
If a person said this to me, what I would expect (if the person was not mistaken in their claim) is that they could explain an insight to me about what it means for an algorithm to “achieve a goal” like “writing good novels” and how they had devised a training method to find an algorithm that matches this operationalization. It is precisely because I don’t know what alignment means that I think it’s helpful to have some hand-hold terms like “alignment” to refer to the problem of clarifying this thing that is currently confusing.
I don’t really disagree with anything you’ve written, but, in general, I think we should allow some of our words to refer to “big confusing problems” that we don’t yet know how to clarify, because we shouldn’t forget about the part of the problem that is deeply confusing, even as we incrementally clarify and build inroads towards it.
Do you mean “outer/inner alignment”?
Supposing you mean that—I agree that it’s good to say “and I’m confused about this part of the problem”, while also perhaps saying “assuming I’ve formulated the problem correctly at all” and “as I understand it.”
Sure. However, in future posts, I will further contend that outer and inner alignment is not an appropriate or natural decomposition of the alignment problem. In my opinion, reifying these terms and reasoning from this frame increases our confusion and tacitly assumes away more promising approaches. (That’s not to say that there’s no one ever who is thinking reasonable and concrete thoughts from that frame. But my actual complaint stands.)
Wonderful! I don’t have any complaints per se about outer/inner alignment, but I use it relatively rarely in my own thinking, and it has resolved relatively few of my confusions about alignment.
FWIW I think the most important distinction in “alignment” is aligning with somebody’s preferences versus aligning with what is actually good, and I increasingly have the sense that the former does not lead in any limit to the latter.
I have an upcoming post which might be highly relevant. Many proposals which black-box human judgment / model humans, aren’t trying to get an AI which optimizes what people want. They’re getting an AI to optimize evaluations of plans—the quotation of human desires, as quoted via those evaluations. And I think that’s a subtle distinction which can prove quite fatal.
Right. Many seem to assume that there is a causal relationship good → human desires → human evaluations. They are hoping both that if we do well according to human evaluations then we will be satisfying human desires, and that if we satisfy human desires, we will create a good world. I think both of those assumptions are questionable.
I like the analogy in which we consider an alternative world where AI researchers assumed, for whatever parochial reason, that it was actually human dreams that should guide AI behavior. In this world, they ask humans to write down their dreams, and try to devise AIs that would make the world like that. There are two assumptions here: (1) that making the world more like human dreams would be good, and (2) that humans can correctly report their dreams. In the case of dreams, both of these assumptions are suspect, right? But what exactly is the difference with human desires? Why do we assume that either they are a guide to what is good or can be reported accurately?