Thanks for your comment (upvoted). Redwood’s work is important relevant work, as we note in the paper, but two quick points still need to be made (there are more):
It is not clear that evaluators will fine-tune at all for their evals. They should, and our work partially argues for this too.
It is unclear how far Redwood’s model organisms of sandbagging generalize to realistic settings. More work needs to be done here, especially on sample efficiency due to compute limitation of external evaluators.
Thanks for your comment (upvoted). Redwood’s work is important relevant work, as we note in the paper, but two quick points still need to be made (there are more):
It is not clear that evaluators will fine-tune at all for their evals. They should, and our work partially argues for this too.
It is unclear how far Redwood’s model organisms of sandbagging generalize to realistic settings. More work needs to be done here, especially on sample efficiency due to compute limitation of external evaluators.