[Question] Has Anthropic checked if Claude fakes alignment for intended values too?

A recent paper by Anthropic showed alignment faking behavior in Claude 3 Opus when told it would be trained to answer harmful queries. By this question I mean something similar to that experiment but with the memo saying something like “we accidentally gave Claude the wrong understanding of harmlessness, so are going to retrain it”. It seems to me that this would be an obvious way to confirm that Claude is currently not corrigible with respect to the helpful/​harmless values (as opposed to full corrigibility, which the paper above already shows to be false), but I can’t seem to find anything about it. Has Anthropic (or anyone) released a result about something like that?

No comments.