This happens during fine-tuning training already, selecting for weights that give the higher human-rated response of two (or more) options. It’s a starting point that can be lost later on, but we do have it now with respect to configurations of weights giving different observed behaviors.
This happens during fine-tuning training already, selecting for weights that give the higher human-rated response of two (or more) options. It’s a starting point that can be lost later on, but we do have it now with respect to configurations of weights giving different observed behaviors.