Okay, I have to state something regarding the deceptive alignment being questioned in o1, because as stated, the reason for the questioning is that the o1 evals were not alignment evaluations, but capability evaluations, and the deception was obviously induced by METR than it was a natural example of deceptive alignment.
This is just an epistemic spot check, but this portrayal seems to me to be very misleading on why people criticized Zvi’s post on misalignment:
However, the reactive framework assumes that this is essentially how we will build consensus in order to regulate AI. The optimistic case is that we hit a dangerous threshold before a real AI disaster, alerting humanity to the risks. But history shows that it is exactly in such moments that these thresholds are most contested –- this shifting of the goalposts is known as the AI Effect and common enough to have its own Wikipedia page. Time and again, AI advancements have been explained away as routine processes, whereas “real AI” is redefined to be some mystical threshold we have not yet reached. Dangerous capabilities are similarly contested as they arise, such as how recent reports of OpenAI’s o1 being deceptive have been questioned.
We have gotten this feedback by a handful of people, so we want to reread the links and the whole literature about o1 and its evaluation to check whether we’ve indeed gotten the right point, or if we mischaracterized the situation.
We will probably change the phrasing (either to make our criticism clearer or to correct it) in the next minor update.
Okay, I have to state something regarding the deceptive alignment being questioned in o1, because as stated, the reason for the questioning is that the o1 evals were not alignment evaluations, but capability evaluations, and the deception was obviously induced by METR than it was a natural example of deceptive alignment.
This is just an epistemic spot check, but this portrayal seems to me to be very misleading on why people criticized Zvi’s post on misalignment:
Thanks for the comment!
We have gotten this feedback by a handful of people, so we want to reread the links and the whole literature about o1 and its evaluation to check whether we’ve indeed gotten the right point, or if we mischaracterized the situation.
We will probably change the phrasing (either to make our criticism clearer or to correct it) in the next minor update.