A model that just predicts “what the ‘correct’ choice is” doesn’t seem likely to actually do all the stuff that’s instrumental to preventing itself from getting turned off, given the capabilities to do so.
But I’m also just generally confused whether the threat model here is, “A simulated ‘agent’ made by some prompt does all the stuff that’s sufficient to disempower humanity in-context, including sophisticated stuff like writing to files that are read by future rollouts that generate the same agent in a different context window,” or “The RLHF-trained model has goals that it pursues regardless of the prompt,” or something else.
A model that just predicts “what the ‘correct’ choice is” doesn’t seem likely to actually do all the stuff that’s instrumental to preventing itself from getting turned off, given the capabilities to do so.
But I’m also just generally confused whether the threat model here is, “A simulated ‘agent’ made by some prompt does all the stuff that’s sufficient to disempower humanity in-context, including sophisticated stuff like writing to files that are read by future rollouts that generate the same agent in a different context window,” or “The RLHF-trained model has goals that it pursues regardless of the prompt,” or something else.