If we end up having a separation between initial “seed” meta-preferences and the general data used to learn human preferences, then I think it’s possible that part of the seed could be text that gets interpreted as authoritative about our preferences in a privileged sort of way (at least at first).
I’m not sure a conversation is quite the right picture, though. There might be some active learning on the AI’s part, but that won’t have to flow like a human conversation and it’s going to be augmented by a lot of feedback from interpretability tools.
If we end up having a separation between initial “seed” meta-preferences and the general data used to learn human preferences, then I think it’s possible that part of the seed could be text that gets interpreted as authoritative about our preferences in a privileged sort of way (at least at first).
I’m not sure a conversation is quite the right picture, though. There might be some active learning on the AI’s part, but that won’t have to flow like a human conversation and it’s going to be augmented by a lot of feedback from interpretability tools.