gwern comments on Richard_Kennaway’s Shortform

gwern 2 Jul 2024 22:38 UTC
11 points
5
I’d say that they are wrong when they say a LLM may engage in ‘soft bullshit’: a LLM is simulating agents, who are definitely trying to track truth and the external world, because the truth is that which doesn’t go away, and so it may say false things, but it still cares very much about falsity because it needs to know that for versimilitude. If you simply say true or false things at random, you get ensnared in your own web and incur prediction error. Any given LLM may be good or bad at doing so—the success of story-based jailbreaks suggests they are still far from ideal—but it’s clear that the prediction loss on large real-world texts written by agents like you or I, who are writing things to persuade each other, like I am writing this comment to manipulate you and everyone reading it, require tracking latents corresponding to truth, beliefs, errors, etc. You can no more accurately predict the text of this comment without tracking what I believe and what is true than you could accurately predict it while not tracking whether I am writing in English or French. (Like in that sentence right there. You see what I did there? Maybe you didn’t because you’re just skimming and tldr, but an LLM needs to!)

They are right when they say a RLHF-tuned model like ChatGPT engages in ‘hard bullshit’, but they seem to be right for the wrong reasons. Oddly, they seem to avoid any discussion of what makes ‘ChatGPT’ different from the base model ‘GPT’, and the only time they seem to betray any hint that the topic of tuning matters is to defer discussion to the unpromisingly-titled unpublished paper “Still no lie detector for language models: Probing empirical and conceptual roadblocks”, so I can’t tell what they think. We know a lot about GPT and TruthfulQA scaling and ChatGPT modeling users and engaging in sycophancy and power-seeking behavior and reward-hacking and persuasion at this point, so it’s quite irresponsible to not discuss any of that in a section about whether the LLMs are engaged in ‘hard bullshit’ explicitly aimed at manipulating user/reader beliefs...
- Owain_Evans 3 Jul 2024 18:56 UTC
  4 points
  0
  Parent
  The “Still no lie detector for language model” paper is here: https://arxiv.org/pdf/2307.00175
  
  The paper in the OP seems somewhat relate to my post from earlier this year.