Sodium comments on Frontier Models are Capable of In-context Scheming

Sodium 6 Dec 2024 23:24 UTC
2 points
0
Almost certainly not original idea: Given the increasing fine-tuning access to models (see also the recent reinforcement fine tuning thing from OpenAI), see if fine tuning on goal directed agent tasks for a while leads to the types of scheming seen in the paper. You could maybe just fine tune on the model’s own actions when successfully solving SWE-Bench problems or something.

(I think some of the Redwood folks might have already done something similar but haven’t published it yet?)