Siebe comments on Siebe’s Shortform

Siebe 23 Jan 2025 14:03 UTC
4 points
2
Looks like Evan Hubinger has done some very similar research just recently: https://www.lesswrong.com/posts/qXYLvjGL9QvD3aFSW/training-on-documents-about-reward-hacking-induces-reward
- Milan W 23 Jan 2025 15:08 UTC
  4 points
  0
  Parent
  The concerns about data filtering raised in that post’s comments^[1] suggest doing aligned-CoT-seeding on the pretraining data may be a better thing to try instead.
  1. ^
    ex.: Jozdien citing gwern
- Milan W 23 Jan 2025 15:03 UTC
  3 points
  2
  Parent
  This is indeed pretty relevant.