If future more capable models are indeed actively resisting their alignment training, and this is happening consistently, that seems like an important update to be making?
Could someone explain to me what this resisting behavior during alignment training looked like in practice?
Did the model outright say “I don’t want to do this?”, did it produce nonsensical results, did it become deceptive, did it just … not work?
This claim seems very interesting if true, is there any further information on this?
Could someone explain to me what this resisting behavior during alignment training looked like in practice?
Did the model outright say “I don’t want to do this?”, did it produce nonsensical results, did it become deceptive, did it just … not work?
This claim seems very interesting if true, is there any further information on this?