I definitely agree that a lot of the purported capabilities for scheming could definitely be because of the prompts talking about AIs/​employees in context, and a big problem for basically all capabilities evaluations at this point is that with a few exceptions, AIs are basically unable to do anything that doesn’t have to do with language, and this sort of evaluation is plausibly plagued by this.
2 things to say here:
This is still a concerning thing if it keeps holding, and for more capable models, would look exactly like scheming, because the data are a lot of what makes it meaningful to talk about a simulacra from a model being aligned.
This does have an expensive fix, but OpenAI might not pay those costs.
I definitely agree that a lot of the purported capabilities for scheming could definitely be because of the prompts talking about AIs/​employees in context, and a big problem for basically all capabilities evaluations at this point is that with a few exceptions, AIs are basically unable to do anything that doesn’t have to do with language, and this sort of evaluation is plausibly plagued by this.
2 things to say here:
This is still a concerning thing if it keeps holding, and for more capable models, would look exactly like scheming, because the data are a lot of what makes it meaningful to talk about a simulacra from a model being aligned.
This does have an expensive fix, but OpenAI might not pay those costs.