I have a weird relation with this post. On the one hand, I don’t think the definition of outer alignment you’re using is the right one (as I mentioned in comments on your previous post); on the other hand, I do agree with one of your main points, that we should look for a behavioral property instead of an internal structure property.
Perfect. I was on the fence about posting this one, but decided that it did a better job than the other of expressing the substantive argument in a way that would be obvious despite definitional disagreements.
(Though I actually don’t think we should look for a behavioral property rather than a structural property; I think this whole thing is a bad way of framing the problem, and we shouldn’t be doing open-ended searches in policy space at all. But if we are going to do an open-ended search in policy space at all, then yeah, behavioral over structural.)
Why is this? As I argued in learning normativity, I think there are some problems which we can more easily point out structurally. For example, Paul’s proposal of relaxed adversarial training is one possible method (look for “pseudo-inputs” which lead to bad behavior, such as activations of some internal nodes which seem like plausible activation patterns, even if you don’t know how to hit them with data).
The argument in the post seems to be “you can’t incentivize virtue without incentivizing it behaviorally”, but this seems untrue.
I have a weird relation with this post. On the one hand, I don’t think the definition of outer alignment you’re using is the right one (as I mentioned in comments on your previous post); on the other hand, I do agree with one of your main points, that we should look for a behavioral property instead of an internal structure property.
Perfect. I was on the fence about posting this one, but decided that it did a better job than the other of expressing the substantive argument in a way that would be obvious despite definitional disagreements.
(Though I actually don’t think we should look for a behavioral property rather than a structural property; I think this whole thing is a bad way of framing the problem, and we shouldn’t be doing open-ended searches in policy space at all. But if we are going to do an open-ended search in policy space at all, then yeah, behavioral over structural.)
Why is this? As I argued in learning normativity, I think there are some problems which we can more easily point out structurally. For example, Paul’s proposal of relaxed adversarial training is one possible method (look for “pseudo-inputs” which lead to bad behavior, such as activations of some internal nodes which seem like plausible activation patterns, even if you don’t know how to hit them with data).
The argument in the post seems to be “you can’t incentivize virtue without incentivizing it behaviorally”, but this seems untrue.