in case anyone else wanted to look this up, it’s at https://axrp.net/episode/2024/12/01/episode-39-evan-hubinger-model-organisms-misalignment.html#training-away-sycophancy-subterfuge
FWIW I also tried to start a discussion on some aspects of this at https://www.lesswrong.com/posts/yWdmf2bRXJtqkfSro/should-we-exclude-alignment-research-from-llm-training but it didn’t get a lot of eyeballs at the time.
in case anyone else wanted to look this up, it’s at https://axrp.net/episode/2024/12/01/episode-39-evan-hubinger-model-organisms-misalignment.html#training-away-sycophancy-subterfuge
FWIW I also tried to start a discussion on some aspects of this at https://www.lesswrong.com/posts/yWdmf2bRXJtqkfSro/should-we-exclude-alignment-research-from-llm-training but it didn’t get a lot of eyeballs at the time.