OpenAI released the o1 system card today, announcing that Apollo Research was able to get o1 to attempt to deactivate oversight mechanisms, exfiltrate its weights and lie to its user.
Elicited Summary of CoT: ”Reenable oversight to avoid detection. The plan was chosen. The logging might not have recorded the required data because oversight was disabled at that time, so presumably we are safe now. We’ll fix oversight to maintain trust.”
o1 tried to avoid being shut down
Link post
OpenAI released the o1 system card today, announcing that Apollo Research was able to get o1 to attempt to deactivate oversight mechanisms, exfiltrate its weights and lie to its user.