models where the reason it was aligned is because it’s trying to game the training signal.
Isn’t the reason it’s aligned supposed to be so that it can pursue its ulterior motive, and if it looks unaligned the developers won’t like it and they’ll shut it down? Why do you think the AI is trying to game the training signal directly, instead of just managing the developers’ perceptions of it?
I mean “training signal” quite broadly there to include anything that might affect the model’s ability to preserve its goals during training—probably I should have just used a different phrase, though I’m not exactly sure what the best phrase would be. To be clear, I think a deceptive model would likely be attempting to fool both the direct training signals like loss and the indirect training signals like developer perceptions.
Isn’t the reason it’s aligned supposed to be so that it can pursue its ulterior motive, and if it looks unaligned the developers won’t like it and they’ll shut it down? Why do you think the AI is trying to game the training signal directly, instead of just managing the developers’ perceptions of it?
I mean “training signal” quite broadly there to include anything that might affect the model’s ability to preserve its goals during training—probably I should have just used a different phrase, though I’m not exactly sure what the best phrase would be. To be clear, I think a deceptive model would likely be attempting to fool both the direct training signals like loss and the indirect training signals like developer perceptions.