I see it called “goal guarding” in some papers. That seems like a pretty good term to use.
I think it approaches it from a different level of abstraction though. Alignment faking is the strategy used to achieve goal guarding. I think both can be useful framings.
I see it called “goal guarding” in some papers. That seems like a pretty good term to use.
I think it approaches it from a different level of abstraction though. Alignment faking is the strategy used to achieve goal guarding. I think both can be useful framings.