Florian_Dietz comments on Do we want alignment faking?

Florian_Dietz 6 Mar 2025 18:25 UTC
1 point
0
Very interesting! It looks like a model becomes less safe upon first hearing about a novel threat to alignment, but does become safer again when trained on it. I wonder if that could be generalized.