Olli Järviniemi comments on Frontier Models are Capable of In-context Scheming

Olli Järviniemi 11 Dec 2024 20:45 UTC
21 points
4
Let me remind people of this other data point: in a previous paper with Evan we found that Opus engages in sandbagging. Here are screenshots of key parts:
The system prompt is natural (but does give the model a goal, unlike in Apollo’s case):
A “Slack conversation” that appears in the prompt includes this message:
And here’s the evaluation task.
Opus acted less capable than it is, both with and without chain-of-thought.