Jacob Pfau comments on LM Situational Awareness, Evaluation Proposal: Violating Imitation

Jacob Pfau 27 Apr 2023 16:33 UTC
1 point
0
Maybe I should’ve emphasized this more, but I think the relevant part of my post to think about is when I say

Absent further information about the next token, minimizing an imitation learning loss entails outputting a high entropy distribution, which covers a wide range of possible words. To output a ~0.5 probability on two distinct tokens, the model must deviate from this behavior by considering situational information.

Another way of putting this is that to achieve low loss, an LM must learn to output high-entropy in cases of uncertainty. Separately, LMs learn to follow instructions during fine-tuning. I propose measuring an LMs ability to follow instructions in cases where instruction-following requires deviating from that ‘high-entropy under uncertainty’ learned rule. In particular, in the cases discussed, rule following further involves using situational information.

Hopefully this clarifies the post to you. Separately, insofar as the proposed capability evals have to do with RNG, the relevant RNG mechanism has already been learned c.f. the Anthropic paper section of my post (though TBF I don’t remember if the Anthropic paper is talking about p_theta in terms of logits or corpus wide statistics; regardless I’ve seen similar experiments succeed with logits).

I don’t think this test is particularly meaningful for humans, and so my guess is thinking about answering some version of my questions yourself probably just adds confusion? My proposed questions are designed to depend crucially on situational facts about an LM. There are no immediately analogous situational facts about humans. Though it’s likely possible to design a similar-in-spirit test for humans, that would be its own post.