Lovre comments on Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Lovre 9 Jul 2024 21:56 UTC
9 points
0
Correlation (Pearson’s r) is $\approx 0.62$ .
Another way, possibly more intuitive, to state the results is that, for two messages which were generated with respective temperature $t_{1}$ and $t_{2}$ , if $t_{1} > t_{2}$ then the probability of having $p_{1} > p_{2}$ for their respective guesses by GPT-4 is $73 %$ , with guesses being equal counting as satisfying the above inequality $50 %$ of the time. (This “correction” being applied because GPT-4 likes round numbers, and is equivalent to adding $N (0, ε^{2})$ noise to GPT-4′s guesses.) If $t_{1} > t_{2} + 0.3$ , then the probability of $p_{1} > p_{2}$ is $83 %$ .
The reason why I restricted it to $[0.5, 1.5]$ when the available range in OpenAI’s API is $[0, 2]$ , is that
- For temperature $< 0.5$ , all the stories are very similar (to the temperature $0$ story), so GPT-4′s distribution on them ends up being just very similar to what it gives to temperature $0$ story.
- For temperature $> 1.5$ , GPT-4 (at least the gpt-4-0613 checkpoint) loses coherence really, really often and fast, really falls off the cliff at those temperatures. For example, here’s a first example I just got for the prompt Write me a story. with temperature $= 1.6$ :
  Once upon a time, in Eruanna; a charming grand country circled by glistening rivers and crowned with cloudy landscapes lain somewhere heavenly up high. It was often quite concealed aboard the waves rolled flora thicket ascended canodia montre jack clamoring Hardy Riding Ridian Mountains blown by winsome whipping winds softened jejuner rattling waters DateTime reflecting among tillings hot science tall dawn funnel articulation ado schemes enchant belly enormous multiposer disse crown slightly eightraw cour correctamente reference held Captain Vincent Caleb ancestors 错 javafx mang ha stout unten bloke ext mejong iy proof elect tend 내 continuity africa city aggressive cav him inherit practice detailing conception(assert);errorMessage batchSize presets Bangalore backbone clean contempor caring NY thick opting titfilm russ comicus inning losses fencing Roisset without enc mascul ф){// sonic AK
So stories generated with temperature $< 0.5$ are in a sense too hard to recognize as such, and those with temperature $> 1.5$ are in a sense too easy, which is why I left out both.
If I were doing this anew, I think I would scrap the numerical prediction and instead query the model on pairs of stories, and ask it to guess which of the two was generated with higher temperature. That would be cleaner and more natural, and would allow one to compute pure accuracy.
- gwern 11 Jul 2024 1:50 UTC
  5 points
  1
  Parent
  I was surprised there was any signal here because of the “flattened logits” mode collapse effect where ChatGPT-4 loses calibration and diversity after the RLHF tuning compared to GPT-4-base, but I guess if you’re going all the way up to 1.5, that restores some range and something to measure.
- Owain_Evans 10 Jul 2024 17:55 UTC
  3 points
  0
  Parent
  Thanks for the breakdown! The idea for using pairs makes sense.