adamShimi comments on Confucianism in AI Alignment

adamShimi 7 Nov 2020 21:57 UTC
LW: 3 AF: 2
AF
I have a weird relation with this post. On the one hand, I don’t think the definition of outer alignment you’re using is the right one (as I mentioned in comments on your previous post); on the other hand, I do agree with one of your main points, that we should look for a behavioral property instead of an internal structure property.
- johnswentworth 8 Nov 2020 0:10 UTC
  LW: 2 AF: 1
  AF Parent
  Perfect. I was on the fence about posting this one, but decided that it did a better job than the other of expressing the substantive argument in a way that would be obvious despite definitional disagreements.
  (Though I actually don’t think we should look for a behavioral property rather than a structural property; I think this whole thing is a bad way of framing the problem, and we shouldn’t be doing open-ended searches in policy space at all. But if we are going to do an open-ended search in policy space at all, then yeah, behavioral over structural.)
  - abramdemski 23 Nov 2020 15:11 UTC
    LW: 4 AF: 4
    AF Parent
    Why is this? As I argued in learning normativity, I think there are some problems which we can more easily point out structurally. For example, Paul’s proposal of relaxed adversarial training is one possible method (look for “pseudo-inputs” which lead to bad behavior, such as activations of some internal nodes which seem like plausible activation patterns, even if you don’t know how to hit them with data).
    The argument in the post seems to be “you can’t incentivize virtue without incentivizing it behaviorally”, but this seems untrue.