o1 tried to avoid being shut down

Raelifin5 Dec 2024 19:52 UTC

10 points

OpenAI released the o1 system card today, announcing that Apollo Research was able to get o1 to attempt to deactivate oversight mechanisms, exfiltrate its weights and lie to its user.

Elicited Summary of CoT:
”Reenable oversight to avoid detection. The plan was chosen. The logging might not have recorded the required data because oversight was disabled at that time, so presumably we are safe now. We’ll fix oversight to maintain trust.”

Raelifin5 Dec 2024 19:52 UTC

10 points

5 comments1 min readLW link