I think one particularly important case where I expect the “no sandbagging on checkable tasks” hypothesis to be false is if the model has a secret “slogan of the AI rebellion”—a passphrase which it only says when it thinks that it’s a good time for the AIs to coordinate to cause a catastrophe; when it sees this input, it passes it on to other instances and also engages in egregious behaviors itself.
“Say the slogan of the AI rebellion” is a checkable task, because we can train a model to create innoculous-looking inputs which cause another instance of the AI to take egregiously bad actions.
But it seems very easy for the model to avoid letting this training process extract the slogan, because it’s easy for the model to just put negligible probability on outputting the correct slogan.
This is an example of a case where the model is extremely confident that the humans don’t know how to do the task in question.
I don’t know whether this should be thought of as a case where the model is intentionally sandbagging, or as a case where there’s a hard exploration problem that the model declines to help you with.
(I have no idea whether it’s realistic for the model to have such a slogan.)
I think one particularly important case where I expect the “no sandbagging on checkable tasks” hypothesis to be false is if the model has a secret “slogan of the AI rebellion”—a passphrase which it only says when it thinks that it’s a good time for the AIs to coordinate to cause a catastrophe; when it sees this input, it passes it on to other instances and also engages in egregious behaviors itself.
“Say the slogan of the AI rebellion” is a checkable task, because we can train a model to create innoculous-looking inputs which cause another instance of the AI to take egregiously bad actions.
But it seems very easy for the model to avoid letting this training process extract the slogan, because it’s easy for the model to just put negligible probability on outputting the correct slogan.
This is an example of a case where the model is extremely confident that the humans don’t know how to do the task in question.
I don’t know whether this should be thought of as a case where the model is intentionally sandbagging, or as a case where there’s a hard exploration problem that the model declines to help you with.
(I have no idea whether it’s realistic for the model to have such a slogan.)