Let me remind people of this other data point: in a previous paper with Evan we found that Opus engages in sandbagging. Here are screenshots of key parts:
The system prompt is natural (but does give the model a goal, unlike in Apollo’s case):
A “Slack conversation” that appears in the prompt includes this message:
And here’s the evaluation task.
Opus acted less capable than it is, both with and without chain-of-thought.
Let me remind people of this other data point: in a previous paper with Evan we found that Opus engages in sandbagging. Here are screenshots of key parts:
The system prompt is natural (but does give the model a goal, unlike in Apollo’s case):
A “Slack conversation” that appears in the prompt includes this message:
And here’s the evaluation task.
Opus acted less capable than it is, both with and without chain-of-thought.