Thanks for your comment (upvoted). Redwood’s work is important relevant work, as we note in the paper, but two quick points still need to be made (there are more):
It is not clear that evaluators will fine-tune at all for their evals. They should, and our work partially argues for this too.
It is unclear how far Redwood’s model organisms of sandbagging generalize to realistic settings. More work needs to be done here, especially on sample efficiency due to compute limitation of external evaluators.
Relevant: Redwood found that fine-tuning and RL are both capable of restoring the full performance of sandbagging (password-locked) models created using fine-tuning.
Thanks for your comment (upvoted). Redwood’s work is important relevant work, as we note in the paper, but two quick points still need to be made (there are more):
It is not clear that evaluators will fine-tune at all for their evals. They should, and our work partially argues for this too.
It is unclear how far Redwood’s model organisms of sandbagging generalize to realistic settings. More work needs to be done here, especially on sample efficiency due to compute limitation of external evaluators.