The issue is that you need the model’s capability to be a smooth function of the artificial impairment. If there’s a phase change as the impairment decreases, that’s basically the same as a phase change during training. (Indeed, in some sense we already start new models out severely impaired and then very gradually reduce the impairment: that’s the training process.) What would be different about this hypothetical impairment, such that we would not expect a phase change in capability as the impairment is reduced, even in worlds where we do expect a phase change in capability during ordinary training? (And bear in mind that we need to construct this impairment-reduction-process without access to an already-trained model which has undergone the phase change, e.g. we can’t just fully train a dangerous model and then add noise, because even training it is dangerous.)
I do think it’s pretty plausible we could develop techniques which grant a very smooth impairment/capability curve.
Sneak peek of my recent work on this in GPT-2-xl:
[the labels correspond to topics of the datasets tested. I’d hoped for more topic-specificity in the impairment, but it is nice and smooth at least. I’m still hopeful that combining my work with Garret’s work could greatly improve the topic-specificity of the impairment.]
Regarding the safety of training and evaluation:
I think that training can be made more safe than testing, which in turn can be made MUCH more safe than deployment. Therefore, I’m not too worried about phase changes within training, so long as the training is done with reasonable precautions.
The issue is that you need the model’s capability to be a smooth function of the artificial impairment. If there’s a phase change as the impairment decreases, that’s basically the same as a phase change during training. (Indeed, in some sense we already start new models out severely impaired and then very gradually reduce the impairment: that’s the training process.) What would be different about this hypothetical impairment, such that we would not expect a phase change in capability as the impairment is reduced, even in worlds where we do expect a phase change in capability during ordinary training? (And bear in mind that we need to construct this impairment-reduction-process without access to an already-trained model which has undergone the phase change, e.g. we can’t just fully train a dangerous model and then add noise, because even training it is dangerous.)
I do think it’s pretty plausible we could develop techniques which grant a very smooth impairment/capability curve.
Sneak peek of my recent work on this in GPT-2-xl:
[the labels correspond to topics of the datasets tested. I’d hoped for more topic-specificity in the impairment, but it is nice and smooth at least. I’m still hopeful that combining my work with Garret’s work could greatly improve the topic-specificity of the impairment.]
Regarding the safety of training and evaluation:
I think that training can be made more safe than testing, which in turn can be made MUCH more safe than deployment. Therefore, I’m not too worried about phase changes within training, so long as the training is done with reasonable precautions.