Lovre comments on Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Lovre 9 Jul 2024 15:39 UTC
17 points
0
Love this work. About a year ago I ran a small experiment in a similar direction: how good is GPT-4 at inferring at which temperature was its answer generated? Specifically, I would ask GPT-4 to write a story, generate its response with temperature randomly sampled from the interval [0.5, 1.5], and then ask it to guess (now sampling its answer at temperature 1, in order to preserve its possibly rich distribution) which temperature its story was generated with.
See below for a quick illustration of the results for 200 stories – “Temperature” is the temperature the story was sampled with, “Predicted temperature” is its guess.
- L Rudolf L 9 Jul 2024 18:25 UTC
  6 points
  0
  Parent
  Did you explain to GPT-4 what temperature is? GPT-4, especially before November, knew very little about LLMs due to training data cut-offs (e.g. the pre-November GPT-4 didn’t even know that the acronym “LLM” stood for “Large Language Model”).
  Either way, it’s interesting that there is a signal. This feels similar in spirit to the self-recognition tasks in SAD (since in both cases the model has to pick up on subtle cues in the text to make some inference about the AI that generated it).
  - Lovre 9 Jul 2024 19:10 UTC
    5 points
    0
    Parent
    I didn’t explain it, but from playing with it I had the impression that it did understand what “temperature” was reasonably well (e.g. gpt-4-0613, which is the checkpoint I tested, answers In the context of large language models like GPT-3, "temperature" refers to a parameter that controls the randomness of the model's responses. A higher temperature (e.g., 0.8) would make the output more random, whereas a lower temperature (e.g., 0.2) makes the output more focused and deterministic. [...] to the question What is "temperature", in context of large language models?).
    Another thing I wanted to do was compare GPT-4′s performance to people’s performance on this task, but I never got around to doing it.
- Owain_Evans 9 Jul 2024 20:29 UTC
  3 points
  0
  Parent
  Do you have results for a measure of accuracy or correlation? It would also be worth comparing results for two different distributions on the temperature, e.g. the uniform on [0.5,1.5] that you tried, and other interval like [0,2] or a non-uniform distribution.
  - Lovre 9 Jul 2024 21:56 UTC
    9 points
    0
    Parent
    Correlation (Pearson’s r) is $\approx 0.62$ .
    Another way, possibly more intuitive, to state the results is that, for two messages which were generated with respective temperature $t_{1}$ and $t_{2}$ , if $t_{1} > t_{2}$ then the probability of having $p_{1} > p_{2}$ for their respective guesses by GPT-4 is $73 %$ , with guesses being equal counting as satisfying the above inequality $50 %$ of the time. (This “correction” being applied because GPT-4 likes round numbers, and is equivalent to adding $N (0, ε^{2})$ noise to GPT-4′s guesses.) If $t_{1} > t_{2} + 0.3$ , then the probability of $p_{1} > p_{2}$ is $83 %$ .
    The reason why I restricted it to $[0.5, 1.5]$ when the available range in OpenAI’s API is $[0, 2]$ , is that
    For temperature $< 0.5$ , all the stories are very similar (to the temperature $0$ story), so GPT-4′s distribution on them ends up being just very similar to what it gives to temperature $0$ story.
    For temperature $> 1.5$ , GPT-4 (at least the gpt-4-0613 checkpoint) loses coherence really, really often and fast, really falls off the cliff at those temperatures. For example, here’s a first example I just got for the prompt Write me a story. with temperature $= 1.6$ :
    Once upon a time, in Eruanna; a charming grand country circled by glistening rivers and crowned with cloudy landscapes lain somewhere heavenly up high. It was often quite concealed aboard the waves rolled flora thicket ascended canodia montre jack clamoring Hardy Riding Ridian Mountains blown by winsome whipping winds softened jejuner rattling waters DateTime reflecting among tillings hot science tall dawn funnel articulation ado schemes enchant belly enormous multiposer disse crown slightly eightraw cour correctamente reference held Captain Vincent Caleb ancestors 错 javafx mang ha stout unten bloke ext mejong iy proof elect tend 내 continuity africa city aggressive cav him inherit practice detailing conception(assert);errorMessage batchSize presets Bangalore backbone clean contempor caring NY thick opting titfilm russ comicus inning losses fencing Roisset without enc mascul ф){// sonic AK
    So stories generated with temperature $< 0.5$ are in a sense too hard to recognize as such, and those with temperature $> 1.5$ are in a sense too easy, which is why I left out both.
    If I were doing this anew, I think I would scrap the numerical prediction and instead query the model on pairs of stories, and ask it to guess which of the two was generated with higher temperature. That would be cleaner and more natural, and would allow one to compute pure accuracy.
    - gwern 11 Jul 2024 1:50 UTC
      5 points
      1
      Parent
      I was surprised there was any signal here because of the “flattened logits” mode collapse effect where ChatGPT-4 loses calibration and diversity after the RLHF tuning compared to GPT-4-base, but I guess if you’re going all the way up to 1.5, that restores some range and something to measure.
    - Owain_Evans 10 Jul 2024 17:55 UTC
      4 points
      0
      Parent
      Thanks for the breakdown! The idea for using pairs makes sense.