AVoropaev comments on LM Situational Awareness, Evaluation Proposal: Violating Imitation

AVoropaev 27 Apr 2023 11:36 UTC
1 point
−1
I don’t understand the point of questions 1 and 3.
If we forget about details of how model works, the question 1 essentially checks whether the entity in question have a good enough rng. Which doesn’t seem to be particularly relevant? Human with a vocabulary and random.org can do that. AutoGTP with access to vocabulary and random.org also have a rather good shot. Superintelligence that for some reason decides to not use rng and answer deterministically will fail. I suppose it would be very interesting to learn that say GPT-6 can do it without external rng, but what would it tell us about it’s other capabilities?
The question 3 checks for something weird. If I wanted to pass it, I’d probably have to precommit on answering certain weird questions in a particular way (and also ensure to always have access to some rng). Which is a weird thing to do? I expect humans to fail at that, but I also expect almost every possible intelligence to fail at that.
In contrast question 2 checks for something “which part of input do you find most surprising” which seems like a really useful skill to have and we should probably watch out for it.
- Jacob Pfau 27 Apr 2023 16:33 UTC
  1 point
  0
  Parent
  Maybe I should’ve emphasized this more, but I think the relevant part of my post to think about is when I say
  
  Absent further information about the next token, minimizing an imitation learning loss entails outputting a high entropy distribution, which covers a wide range of possible words. To output a ~0.5 probability on two distinct tokens, the model must deviate from this behavior by considering situational information.
  
  Another way of putting this is that to achieve low loss, an LM must learn to output high-entropy in cases of uncertainty. Separately, LMs learn to follow instructions during fine-tuning. I propose measuring an LMs ability to follow instructions in cases where instruction-following requires deviating from that ‘high-entropy under uncertainty’ learned rule. In particular, in the cases discussed, rule following further involves using situational information.
  
  Hopefully this clarifies the post to you. Separately, insofar as the proposed capability evals have to do with RNG, the relevant RNG mechanism has already been learned c.f. the Anthropic paper section of my post (though TBF I don’t remember if the Anthropic paper is talking about p_theta in terms of logits or corpus wide statistics; regardless I’ve seen similar experiments succeed with logits).
  
  I don’t think this test is particularly meaningful for humans, and so my guess is thinking about answering some version of my questions yourself probably just adds confusion? My proposed questions are designed to depend crucially on situational facts about an LM. There are no immediately analogous situational facts about humans. Though it’s likely possible to design a similar-in-spirit test for humans, that would be its own post.