Do you have results for a measure of accuracy or correlation? It would also be worth comparing results for two different distributions on the temperature, e.g. the uniform on [0.5,1.5] that you tried, and other interval like [0,2] or a non-uniform distribution.
Another way, possibly more intuitive, to state the results is that, for two messages which were generated with respective temperature t1 and t2, if t1>t2 then the probability of having p1>p2 for their respective guesses by GPT-4 is 73%, with guesses being equal counting as satisfying the above inequality 50% of the time. (This “correction” being applied because GPT-4 likes round numbers, and is equivalent to adding N(0,ε2) noise to GPT-4′s guesses.) If t1>t2+0.3, then the probability of p1>p2 is 83%.
The reason why I restricted it to [0.5,1.5] when the available range in OpenAI’s API is [0,2], is that
For temperature <0.5, all the stories are very similar (to the temperature 0 story), so GPT-4′s distribution on them ends up being just very similar to what it gives to temperature 0 story.
For temperature >1.5, GPT-4 (at least the gpt-4-0613 checkpoint) loses coherence really, really often and fast, really falls off the cliff at those temperatures. For example, here’s a first example I just got for the prompt Write me a story. with temperature =1.6: Once upon a time, in Eruanna; a charming grand country circled by glistening rivers and crowned with cloudy landscapes lain somewhere heavenly up high. It was often quite concealed aboard the waves rolled flora thicket ascended canodia montre jack clamoring Hardy Riding Ridian Mountains blown by winsome whipping winds softened jejuner rattling waters DateTime reflecting among tillings hot science tall dawn funnel articulation ado schemes enchant belly enormous multiposer disse crown slightly eightraw cour correctamente reference held Captain Vincent Caleb ancestors 错 javafx mang ha stout unten bloke ext mejong iy proof elect tend 내 continuity africa city aggressive cav him inherit practice detailing conception(assert);errorMessage batchSize presets Bangalore backbone clean contempor caring NY thick opting titfilm russ comicus inning losses fencing Roisset without enc mascul ф){// sonic AK
So stories generated with temperature <0.5 are in a sense too hard to recognize as such, and those with temperature >1.5 are in a sense too easy, which is why I left out both.
If I were doing this anew, I think I would scrap the numerical prediction and instead query the model on pairs of stories, and ask it to guess which of the two was generated with higher temperature. That would be cleaner and more natural, and would allow one to compute pure accuracy.
I was surprised there was any signal here because of the “flattened logits”mode collapse effect where ChatGPT-4 loses calibration and diversity after the RLHF tuning compared to GPT-4-base, but I guess if you’re going all the way up to 1.5, that restores some range and something to measure.
Do you have results for a measure of accuracy or correlation? It would also be worth comparing results for two different distributions on the temperature, e.g. the uniform on [0.5,1.5] that you tried, and other interval like [0,2] or a non-uniform distribution.
Correlation (Pearson’s r) is ≈0.62.
Another way, possibly more intuitive, to state the results is that, for two messages which were generated with respective temperature t1 and t2, if t1>t2 then the probability of having p1>p2 for their respective guesses by GPT-4 is 73%, with guesses being equal counting as satisfying the above inequality 50% of the time. (This “correction” being applied because GPT-4 likes round numbers, and is equivalent to adding N(0,ε2) noise to GPT-4′s guesses.) If t1>t2+0.3, then the probability of p1>p2 is 83%.
The reason why I restricted it to [0.5,1.5] when the available range in OpenAI’s API is [0,2], is that
For temperature <0.5, all the stories are very similar (to the temperature 0 story), so GPT-4′s distribution on them ends up being just very similar to what it gives to temperature 0 story.
For temperature >1.5, GPT-4 (at least the
gpt-4-0613
checkpoint) loses coherence really, really often and fast, really falls off the cliff at those temperatures. For example, here’s a first example I just got for the promptWrite me a story.
with temperature =1.6:Once upon a time, in Eruanna; a charming grand country circled by glistening rivers and crowned with cloudy landscapes lain somewhere heavenly up high. It was often quite concealed aboard the waves rolled flora thicket ascended canodia montre jack clamoring Hardy Riding Ridian Mountains blown by winsome whipping winds softened jejuner rattling waters DateTime reflecting among tillings hot science tall dawn funnel articulation ado schemes enchant belly enormous multiposer disse crown slightly eightraw cour correctamente reference held Captain Vincent Caleb ancestors 错 javafx mang ha stout unten bloke ext mejong iy proof elect tend 내 continuity africa city aggressive cav him inherit practice detailing conception(assert);errorMessage batchSize presets Bangalore backbone clean contempor caring NY thick opting titfilm russ comicus inning losses fencing Roisset without enc mascul ф){// sonic AK
So stories generated with temperature <0.5 are in a sense too hard to recognize as such, and those with temperature >1.5 are in a sense too easy, which is why I left out both.
If I were doing this anew, I think I would scrap the numerical prediction and instead query the model on pairs of stories, and ask it to guess which of the two was generated with higher temperature. That would be cleaner and more natural, and would allow one to compute pure accuracy.
I was surprised there was any signal here because of the “flattened logits” mode collapse effect where ChatGPT-4 loses calibration and diversity after the RLHF tuning compared to GPT-4-base, but I guess if you’re going all the way up to 1.5, that restores some range and something to measure.
Thanks for the breakdown! The idea for using pairs makes sense.