CarlShulman comments on Godzilla Strategies

CarlShulman 11 Jun 2022 20:51 UTC
4 points
−5
This happens during fine-tuning training already, selecting for weights that give the higher human-rated response of two (or more) options. It’s a starting point that can be lost later on, but we do have it now with respect to configurations of weights giving different observed behaviors.