TurnTrout comments on Alignment Implications of LLM Successes: a Debate in One Act

TurnTrout 23 Oct 2023 21:35 UTC
LW: 7 AF: 3
0
AF
Me: Huh? The user doesn’t try to shut down the AI at all.
For people, at least, there is a strong correlation between “answers to ‘what would you do in situation X?’” and “what you actually do in situation X.” Similarly, we could also measure these correlations for language models so as to empirically quantify the strength of the critique you’re making. If there’s low correlation for relevant situations, then your critique is well-placed.
(There might be a lot of noise, depending on how finicky the replies are relative to the prompt.)
What links here?
- tailcalled's comment on Symbol/Referent Confusions in Language Model Alignment Experiments by johnswentworth (30 Nov 2023 15:19 UTC; 6 points)
- johnswentworth 23 Oct 2023 21:43 UTC
  LW: 9 AF: 5
  2
  AF Parent
  I agree that that’s a useful question to ask and a good frame, though I’m skeptical of the claim of strong correlation in the case of humans (at least in cases where the question is interesting enough to bother asking at all).