GPT-4 is insufficiently capable, even if it were given an agent structure, memory and goal set to match, to pull off a treacherous turn. The whole point of the treacherous turn argument is that the AI will wait until it can win to turn against you, and until then play along.
I don’t get why actual ability matters. It’s sufficiently capable to pull it off in some simulated environments. Are you claiming that we can’t decieve GPT-4 and it is actually waiting and playing along just because it can’t really win?
It sure doesn’t seem to generalize in GPT-4o case. But what’s the hypothesis for Sonnet 3.5 refusing in 85% of cases? And CoT improving score and o1 being better in browser suggests the problem is in models not understanding consequences, not in them not trying to be good. What’s the rate of capability generalization to agent environment? Are we going to conclude that Sonnet is just demonstrates reasoning, instead of doing it for real, if it solves only 85% of tasks it correctly talks about?
Also, what’s the rate of generalization of unprompted problematic behaviour avoidance? It’s much less of a problem if your AI does what you tell it to do—you can just don’t give it to users, tell it to invent nanotechnology, and win.