to determine whether your models are capable of gradient hacking, do evaluations to determine whether your models (and smaller versions of them) are able to do various tasks related to understanding/influencing their own internals that seem easier than gradient hacking.
What if you have a gradient hacking model that gradient hacks your attempts to get them to gradient hack?
Great question. Our basic response is: it’s probably strictly harder for models to gradient hack than to do various other supervised tasks related to their internals. (One concrete example might be: predicting whether training the model on a datapoint will improve the performance of a linear probe for a particular concept at a particular layer of the model.) It seems likely that the model will be able to do these strictly-easier tasks before it’s able to pull off full gradient hacking. So we recommend measuring a battery of these tasks on the model throughout training (as well as measuring them on smaller models). It seems unlikely to us that models will get good enough at gradient hacking to gradient hack the gradient hacking evals before one of these other evals alerts us to the possibility of this.
Separately, we just don’t think it’s that likely that models are able to gradient hack before they’re able to help us get really substantial R&D speedups. So it seems pretty likely that it will be okay to just do these conservative measurements and hope that they never indicate gradient hacking capabilities before human labor is obsolete.
What if you have a gradient hacking model that gradient hacks your attempts to get them to gradient hack?
Great question. Our basic response is: it’s probably strictly harder for models to gradient hack than to do various other supervised tasks related to their internals. (One concrete example might be: predicting whether training the model on a datapoint will improve the performance of a linear probe for a particular concept at a particular layer of the model.) It seems likely that the model will be able to do these strictly-easier tasks before it’s able to pull off full gradient hacking. So we recommend measuring a battery of these tasks on the model throughout training (as well as measuring them on smaller models). It seems unlikely to us that models will get good enough at gradient hacking to gradient hack the gradient hacking evals before one of these other evals alerts us to the possibility of this.
Separately, we just don’t think it’s that likely that models are able to gradient hack before they’re able to help us get really substantial R&D speedups. So it seems pretty likely that it will be okay to just do these conservative measurements and hope that they never indicate gradient hacking capabilities before human labor is obsolete.