I wasn’t expecting such a detailed answer, I guess I should have asked a more specific question. This is great though. The thing I was confused about was: “Capabilities generalize further than alignment” makes it sound like capabilities-properties or capabilities-skills (such as accurate beliefs, heuristics for arriving at accurate beliefs, useful convergent instrumental strategies, etc.) will work in a wider range of environments than alignment-properties like honesty, niceness, etc. But I don’t think what you’ve said establishes that.
But I think what you mean is different—you mean “If you train an AI using human feedback on diverse tasks, hoping it’ll acquire both general-purpose capabilities and also robust alignment properties, what’ll happen by default is that it DOES acquire the former but it does not acquire the latter.” (And the reason for this is basically that capabilities properties are more simple/natural/universal/convergent than alignment properties; with alignment properties there are all sorts of other similarly-simple properties that perform just as well in training, but for capabilities properties there generally aren’t (at least not for sufficiently diverse challenging environments; in simple environments they just ‘memorize’ or otherwise learn ‘simple heuristics that don’t generalize’))
Thanks!
I wasn’t expecting such a detailed answer, I guess I should have asked a more specific question. This is great though. The thing I was confused about was: “Capabilities generalize further than alignment” makes it sound like capabilities-properties or capabilities-skills (such as accurate beliefs, heuristics for arriving at accurate beliefs, useful convergent instrumental strategies, etc.) will work in a wider range of environments than alignment-properties like honesty, niceness, etc. But I don’t think what you’ve said establishes that.
But I think what you mean is different—you mean “If you train an AI using human feedback on diverse tasks, hoping it’ll acquire both general-purpose capabilities and also robust alignment properties, what’ll happen by default is that it DOES acquire the former but it does not acquire the latter.” (And the reason for this is basically that capabilities properties are more simple/natural/universal/convergent than alignment properties; with alignment properties there are all sorts of other similarly-simple properties that perform just as well in training, but for capabilities properties there generally aren’t (at least not for sufficiently diverse challenging environments; in simple environments they just ‘memorize’ or otherwise learn ‘simple heuristics that don’t generalize’))
Is this an accurate summary of your view?
Yes, as far as I can tell. “Alignment properties” do not seem to me to be convergent or universal in any way.