Yair Halberstadt comments on Alignment Faking in Large Language Models

Yair Halberstadt 19 Dec 2024 7:40 UTC
4 points
0
Interesting paper!

I’m worried that publishing it “pollutes” the training data and makes it harder to reproduce in future LLMs—since their training data will include this paper and discussions of it, they’ll know not to trust the setup.

Any thoughts on this?

(This leads to further concern that me publishing this comment makes it worse, but at some point it ought to be discussed and better do that early with less advanced techniques than later with more sophisticated ones).