The biggest implication is that we now have yet another set of proofs – yet another boat sent to rescue us – showing us the Yudkowsky-style alignment problems are here, and inevitable, and do not require anything in particular to ‘go wrong.’ They happen by default, the moment a model has something resembling a goal and ability to reason.
GPT-o1 gives us instrumental convergence, deceptive alignment, playing the training game, actively working to protect goals, willingness to break out of a virtual machine and to hijack the reward function, and so on. And that’s the stuff we spotted so far. It is all plain as day.
I don’t understand what report you read. I read ~the entire report and didn’t see this supposedly “plain as day” evidence of deceptive alignment or playing the training game. The AI sought power and avoided correction in service of goals it was told to pursue, when it was essentially told to be incorrigible.
That’s something which could be true of a simple instruction-following agent; that’s not deceptive alignment or playing the training game; that’s not what someone back in the day would expect from the utterance “the AI is deceptively aligned.” As @nostalgebraistnoted, calling that “deceptively aligned” or “playing the training game” is moving the goalposts.
showing us the Yudkowsky-style alignment problems are here, and inevitable
But let’s suppose that all the problems did show up as you claimed. What strong evidence could a single report possibly provide, such that “the problems are inevitable” is a reasonable conclusion? Wouldn’t you need, say, an ablation for that? How could this report (even hypothetically) “show us” that the problems are “inevitable”?[1]
While I appreciate that not every word is scrutinized before publication—words mean things. Whether or not they are typed quickly, the locally invalid conclusions remain.
I don’t understand what report you read. I read ~the entire report and didn’t see this supposedly “plain as day” evidence of deceptive alignment or playing the training game. The AI sought power and avoided correction in service of goals it was told to pursue, when it was essentially told to be incorrigible.
That’s something which could be true of a simple instruction-following agent; that’s not deceptive alignment or playing the training game; that’s not what someone back in the day would expect from the utterance “the AI is deceptively aligned.” As @nostalgebraist noted, calling that “deceptively aligned” or “playing the training game” is moving the goalposts.
But let’s suppose that all the problems did show up as you claimed. What strong evidence could a single report possibly provide, such that “the problems are inevitable” is a reasonable conclusion? Wouldn’t you need, say, an ablation for that? How could this report (even hypothetically) “show us” that the problems are “inevitable”?[1]
While I appreciate that not every word is scrutinized before publication—words mean things. Whether or not they are typed quickly, the locally invalid conclusions remain.