paulfchristiano comments on [Linkpost] Introducing Superalignment

paulfchristiano 7 Jul 2023 16:45 UTC
2 points
0
Yeah, I would say a model that carries out test phishing attacks (when asked to do so) is not misaligned in the relevant sense I don’t think “refuses to carry out phishing attacks when asked” is part of the definition of alignment. (Also note that fine-tuning models to be more willing to carry out phishing isn’t part of the evaluations ARC has done so far, the models are just jailbroken in the normal way for the small number of tasks that require it.)
I think relevant forms of “misalignment in the lab” are more like: we created a situation where the model does what the (simulated) human wants during training, but then abruptly switches to overpowering the simulated human once it has the opportunity to do so. Or we created a situation where a model really wants to do X, is fine-tuned to do Y, and then abruptly switches to pursuing X once it thinks it is no longer being trained. Or we created an AI that deliberately tries to answer questions dishonestly and tested whether we were able to catch it. Or so on.