Noosphere89 comments on Frontier Models are Capable of In-context Scheming

Noosphere89 6 Dec 2024 16:08 UTC
4 points
4
I definitely agree that a lot of the purported capabilities for scheming could definitely be because of the prompts talking about AIs/employees in context, and a big problem for basically all capabilities evaluations at this point is that with a few exceptions, AIs are basically unable to do anything that doesn’t have to do with language, and this sort of evaluation is plausibly plagued by this.

2 things to say here:
1. This is still a concerning thing if it keeps holding, and for more capable models, would look exactly like scheming, because the data are a lot of what makes it meaningful to talk about a simulacra from a model being aligned.
2. This does have an expensive fix, but OpenAI might not pay those costs.