It seems like it would be cleaner to discuss the whole thing in the context of transfer to a new domain, rather than talking about directly using the learned representation, unless I am missing some advantage of this framing.
I agree with this. Problems with learning this preference should cause the system to make bad predictions (I think I was confused when I wrote that this problem only shows up with internal representations). Now that I think about it, it seems like you’re right that a system that correctly learns abstract human preferences would also learn the preference for autonomy. So this is really a special case of zero-shot transfer learning of abstract preferences. My main motivation for specifically studying the preference for autonomy is that maybe you can turn a simple version of it into a model for corrigibility.
Are you hoping to do transfer learning for human preferences in a way that depends on having a detailed understanding of those preferences (e.g. that depends in particular on a detailed understanding of the human preference for autonomy)?
I think I mostly want some story for why the preference for autonomy is even in the model’s hypothesis space. It seems that if we’re already confident that the system can learn abstract preferences, then we could also be confident that the system can learn the preference for autonomy; but maybe it’s more of a problem if we aren’t confident of this (e.g. the system is only supposed to learn and optimize for fairly concrete preferences).
I agree with this. Problems with learning this preference should cause the system to make bad predictions (I think I was confused when I wrote that this problem only shows up with internal representations). Now that I think about it, it seems like you’re right that a system that correctly learns abstract human preferences would also learn the preference for autonomy. So this is really a special case of zero-shot transfer learning of abstract preferences. My main motivation for specifically studying the preference for autonomy is that maybe you can turn a simple version of it into a model for corrigibility.
I think I mostly want some story for why the preference for autonomy is even in the model’s hypothesis space. It seems that if we’re already confident that the system can learn abstract preferences, then we could also be confident that the system can learn the preference for autonomy; but maybe it’s more of a problem if we aren’t confident of this (e.g. the system is only supposed to learn and optimize for fairly concrete preferences).