If future more capable models are indeed actively resisting their alignment training, and this is happening consistently, that seems like an important update to be making?
Could someone explain to me what this resisting behavior during alignment training looked like in practice?
Did the model outright say “I don’t want to do this?”, did it produce nonsensical results, did it become deceptive, did it just … not work?
This claim seems very interesting if true, is there any further information on this?
I just tried multiplying 13-digit numbers with o3-mini (high). My approach was to ask it to explain a basic multiplication algorithm to me, and then carry it out. On the first try it was lazy and didn’t actually follow the algorithm (it just told me “it would take a long time to actually carry out all the shifts and multiplications...”, and it got the result wrong.
Then I told it to follow the algorithm, even if it is time consuming, and it did, and the result was correct.
So I’m not sure about the take that
The model got lazy, did some part of the calculation “in its head” (i.e. not actually following the algorithm but guesstimating the result, like we would do if we were asked to do a task like that without pencil and paper), and got the result slightly wrong—but when you ask it to actually follow the multiplication algorithm it just explained to me, it can absolutely do it.
I’d be interested in the CoT that led to the incorrect conclusion. If the model actually believed that it’s lazy estimation leads to the correct result, that shows that it’s overestimating its own capabilities—one could call this a fundamental misunderstanding of multiplication. I know that I’m incorrect when I’m estimating the result in my head, because I understand stuff about multiplication—or one could call it a failure of introspection.
The other possibility is that it didn’t care to produce an entirely correct result, and just didn’t bother and got lazy.