It seems like that assumption runs throughout the post though, eg ‘But it won’t know what it actually tried, unless it is able to guess very accurately what it would have said’, ‘the AI might not even be able to get direct evidence of whether its attempts worked, or what its attempts even were.’
Those sentences are talking in the context of defer to resample, where I think it’s actually reasonable to think the model won’t see the resampled actions.
I don’t just mean showing the model its past CoT. Current models are often given lots of information about their past actions and interactions so that they can produce better output; why would you expect that to change?
I’m specifically talking about not showing it actions that were resampled. You’re totally right that they’ll need other information about their actions and interactions (though it might be a good safety-usefulness tradeoff to somewhat redact this information).
Is the assumption that labs will be willing to accept worse performance because they recognize the model can’t be trusted?
Yes, I am assuming that the deployer is willing to spend some performance in order to reduce risk from the model scheming.
Those sentences are talking in the context of defer to resample, where I think it’s actually reasonable to think the model won’t see the resampled actions.
I’m specifically talking about not showing it actions that were resampled. You’re totally right that they’ll need other information about their actions and interactions (though it might be a good safety-usefulness tradeoff to somewhat redact this information).
Yes, I am assuming that the deployer is willing to spend some performance in order to reduce risk from the model scheming.
I see—I mistakenly read that as part of a broader policy of not showing the model its past actions. Thanks!