I’m confused by this argument from Redwood’s AI control paper.
5.2 WHY WE DIDN’T CONSIDER STRATEGIES THAT INVOLVE TRAINING THE MODEL
[...]
Any protocol that involves training the model can be transformed into a validation-style protocol that merely validates the model’s performance on the dataset that was used as a train set. In any case where the validation-style protocol fails, we know that the red-team model is not distinguished from an aligned model by the measurements done by the protocol. Given that none of the training examples allow the reward-generating process to determine whether the policy it is training is undesirable, we don’t see why training would transform the model from an undesirable policy to a desirable policy.
It is conceivable that we could find a training scheme that systematically converts undesirable policies to desirable ones, despite the lack of a reward differential causing it to do so. But we don’t know how.
If I understand correctly, the argument is:
In order for training on dataset D to convince you a model is desirable, good performance on D needs to be a sufficient condition for being desirable.
If good performance on D is a sufficient condition for being desirable, you can just test performance on D to decide whether a model is desirable.
But 2 seems wrong—if good performance on D is sufficient but not neccesary for being desirable, then you could have two models, one desirable and one undesirable, which are indistinguishable on D because they both perform badly, and then turn them both into desirable models by training on D.
As an extreme example, suppose your desirable model outputs correct code and your undesirable model backdoored code. Using D you train them both to output machine checkable proofs of the correctness of the code they write. Now both models are desirable because you can just check correctness before deploying the code.
I’m confused by this argument from Redwood’s AI control paper.
If I understand correctly, the argument is:
In order for training on dataset D to convince you a model is desirable, good performance on D needs to be a sufficient condition for being desirable.
If good performance on D is a sufficient condition for being desirable, you can just test performance on D to decide whether a model is desirable.
But 2 seems wrong—if good performance on D is sufficient but not neccesary for being desirable, then you could have two models, one desirable and one undesirable, which are indistinguishable on D because they both perform badly, and then turn them both into desirable models by training on D.
As an extreme example, suppose your desirable model outputs correct code and your undesirable model backdoored code. Using D you train them both to output machine checkable proofs of the correctness of the code they write. Now both models are desirable because you can just check correctness before deploying the code.