Teun van der Weij comments on [Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Teun van der Weij 14 Jun 2024 15:50 UTC
4 points
0
Thanks for your comment (upvoted). Redwood’s work is important relevant work, as we note in the paper, but two quick points still need to be made (there are more):
1. It is not clear that evaluators will fine-tune at all for their evals. They should, and our work partially argues for this too.
2. It is unclear how far Redwood’s model organisms of sandbagging generalize to realistic settings. More work needs to be done here, especially on sample efficiency due to compute limitation of external evaluators.