Looks like Evan Hubinger has done some very similar research just recently: https://www.lesswrong.com/posts/qXYLvjGL9QvD3aFSW/training-on-documents-about-reward-hacking-induces-reward
The concerns about data filtering raised in that post’s comments[1] suggest doing aligned-CoT-seeding on the pretraining data may be a better thing to try instead.
ex.: Jozdien citing gwern
This is indeed pretty relevant.
Looks like Evan Hubinger has done some very similar research just recently: https://www.lesswrong.com/posts/qXYLvjGL9QvD3aFSW/training-on-documents-about-reward-hacking-induces-reward
The concerns about data filtering raised in that post’s comments[1] suggest doing aligned-CoT-seeding on the pretraining data may be a better thing to try instead.
ex.: Jozdien citing gwern
This is indeed pretty relevant.