I mean “training signal” quite broadly there to include anything that might affect the model’s ability to preserve its goals during training—probably I should have just used a different phrase, though I’m not exactly sure what the best phrase would be. To be clear, I think a deceptive model would likely be attempting to fool both the direct training signals like loss and the indirect training signals like developer perceptions.
I mean “training signal” quite broadly there to include anything that might affect the model’s ability to preserve its goals during training—probably I should have just used a different phrase, though I’m not exactly sure what the best phrase would be. To be clear, I think a deceptive model would likely be attempting to fool both the direct training signals like loss and the indirect training signals like developer perceptions.