The ARC evals showing that when given help and a general directive to replicate a GPT-4 based agent was able to figure out that it ought to lie to a TaskRabbit worker is an example of it figuring out a self-preservation/power-seeking subgoal which is on the road to general self-preservation. But it doesn’t demonstrate an AI spontaneously developing self-preservation or power-seeking, as an instrumental subgoal to something that superficially has nothing to do with gaining power or replicating.
Of course we have some real-world examples of specification-gaming like you linked in your answer: those have always existed and we see more ‘intelligent’ examples like AIs convinced of false facts trying to convince people they’re true.
The ARC evals showing that when given help and a general directive to replicate a GPT-4 based agent was able to figure out that it ought to lie to a TaskRabbit worker is an example of it figuring out a self-preservation/power-seeking subgoal which is on the road to general self-preservation. But it doesn’t demonstrate an AI spontaneously developing self-preservation or power-seeking, as an instrumental subgoal to something that superficially has nothing to do with gaining power or replicating.
Of course we have some real-world examples of specification-gaming like you linked in your answer: those have always existed and we see more ‘intelligent’ examples like AIs convinced of false facts trying to convince people they’re true.
There’s supposedly some evidence here that we see power-seeking instrumental subgoals developing spontaneously but how spontaneous this actually was is debatable so I’d call that evidence ambiguous since it wasn’t in the wild.