Perfect. I was on the fence about posting this one, but decided that it did a better job than the other of expressing the substantive argument in a way that would be obvious despite definitional disagreements.
(Though I actually don’t think we should look for a behavioral property rather than a structural property; I think this whole thing is a bad way of framing the problem, and we shouldn’t be doing open-ended searches in policy space at all. But if we are going to do an open-ended search in policy space at all, then yeah, behavioral over structural.)
Why is this? As I argued in learning normativity, I think there are some problems which we can more easily point out structurally. For example, Paul’s proposal of relaxed adversarial training is one possible method (look for “pseudo-inputs” which lead to bad behavior, such as activations of some internal nodes which seem like plausible activation patterns, even if you don’t know how to hit them with data).
The argument in the post seems to be “you can’t incentivize virtue without incentivizing it behaviorally”, but this seems untrue.
Perfect. I was on the fence about posting this one, but decided that it did a better job than the other of expressing the substantive argument in a way that would be obvious despite definitional disagreements.
(Though I actually don’t think we should look for a behavioral property rather than a structural property; I think this whole thing is a bad way of framing the problem, and we shouldn’t be doing open-ended searches in policy space at all. But if we are going to do an open-ended search in policy space at all, then yeah, behavioral over structural.)
Why is this? As I argued in learning normativity, I think there are some problems which we can more easily point out structurally. For example, Paul’s proposal of relaxed adversarial training is one possible method (look for “pseudo-inputs” which lead to bad behavior, such as activations of some internal nodes which seem like plausible activation patterns, even if you don’t know how to hit them with data).
The argument in the post seems to be “you can’t incentivize virtue without incentivizing it behaviorally”, but this seems untrue.