evhub comments on How likely is deceptive alignment?

evhub 1 Dec 2023 21:58 UTC
LW: 2 AF: 2
0
AF
I mean “training signal” quite broadly there to include anything that might affect the model’s ability to preserve its goals during training—probably I should have just used a different phrase, though I’m not exactly sure what the best phrase would be. To be clear, I think a deceptive model would likely be attempting to fool both the direct training signals like loss and the indirect training signals like developer perceptions.