Buck comments on Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy

Buck 27 Jul 2023 15:45 UTC
LW: 10 AF: 3
11
AF
Great question. Our basic response is: it’s probably strictly harder for models to gradient hack than to do various other supervised tasks related to their internals. (One concrete example might be: predicting whether training the model on a datapoint will improve the performance of a linear probe for a particular concept at a particular layer of the model.) It seems likely that the model will be able to do these strictly-easier tasks before it’s able to pull off full gradient hacking. So we recommend measuring a battery of these tasks on the model throughout training (as well as measuring them on smaller models). It seems unlikely to us that models will get good enough at gradient hacking to gradient hack the gradient hacking evals before one of these other evals alerts us to the possibility of this.
Separately, we just don’t think it’s that likely that models are able to gradient hack before they’re able to help us get really substantial R&D speedups. So it seems pretty likely that it will be okay to just do these conservative measurements and hope that they never indicate gradient hacking capabilities before human labor is obsolete.