gwern comments on Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

gwern 11 Jul 2024 1:50 UTC
5 points
1
I was surprised there was any signal here because of the “flattened logits” mode collapse effect where ChatGPT-4 loses calibration and diversity after the RLHF tuning compared to GPT-4-base, but I guess if you’re going all the way up to 1.5, that restores some range and something to measure.