Thus the title of a recent paper. It appeared three weeks ago, but I haven’t seen it mentioned on LW yet.
The abstract: “Recently, there has been considerable interest in large language models: machine learning systems which produce human-like text and dialogue. Applications of these systems have been plagued by persistent inaccuracies in their output; these are often called “AI hallucinations”. We argue that these falsehoods, and the overall activity of large language models, is better understood as bullshit in the sense explored by Frankfurt (On Bullshit, Princeton, 2005): the models are in an important way indifferent to the truth of their outputs. We distinguish two ways in which the models can be said to be bullshitters, and argue that they clearly meet at least one of these definitions. We further argue that describing AI misrepresentations as bullshit is both a more useful and more accurate way of predicting and discussing the behaviour of these systems.”
I’d say that they are wrong when they say a LLM may engage in ‘soft bullshit’: a LLM is simulating agents, who are definitely trying to track truth and the external world, because the truth is that which doesn’t go away, and so it may say false things, but it still cares very much about falsity because it needs to know that for versimilitude. If you simply say true or false things at random, you get ensnared in your own web and incur prediction error. Any given LLM may be good or bad at doing so—the success of story-based jailbreaks suggests they are still far from ideal—but it’s clear that the prediction loss on large real-world texts written by agents like you or I, who are writing things to persuade each other, like I am writing this comment to manipulate you and everyone reading it, require tracking latents corresponding to truth, beliefs, errors, etc. You can no more accurately predict the text of this comment without tracking what I believe and what is true than you could accurately predict it while not tracking whether I am writing in English or French. (Like in that sentence right there. You see what I did there? Maybe you didn’t because you’re just skimming and tldr, but an LLM needs to!)
They are right when they say a RLHF-tuned model like ChatGPT engages in ‘hard bullshit’, but they seem to be right for the wrong reasons. Oddly, they seem to avoid any discussion of what makes ‘ChatGPT’ different from the base model ‘GPT’, and the only time they seem to betray any hint that the topic of tuning matters is to defer discussion to the unpromisingly-titled unpublished paper “Still no lie detector for language models: Probing empirical and conceptual roadblocks”, so I can’t tell what they think. We know a lot about GPT and TruthfulQA scaling and ChatGPT modeling users and engaging in sycophancy and power-seeking behavior and reward-hacking and persuasion at this point, so it’s quite irresponsible to not discuss any of that in a section about whether the LLMs are engaged in ‘hard bullshit’ explicitly aimed at manipulating user/reader beliefs...
I think that’s true, but not very important (in the short term). On Bullshit—Wikipedia was first published in 1986, and was a humorous, but useful, categorizaton of a whole lot of human communication output. ChatGPT is truth-agnostic (except for fine-tuning and output tuning), but still pretty good on a whole lot of general topics. Human choice of what GPT outputs to highlight or use in further communication can be bullshit or truth-seeking, depending on the human intent.
In the long-term, of course, the idea is absolutely core to all the alignment fears and to the expectation that AI will steamroller human civilization because it doesn’t care.
“ChatGPT is Bullshit”
Thus the title of a recent paper. It appeared three weeks ago, but I haven’t seen it mentioned on LW yet.
The abstract: “Recently, there has been considerable interest in large language models: machine learning systems which produce human-like text and dialogue. Applications of these systems have been plagued by persistent inaccuracies in their output; these are often called “AI hallucinations”. We argue that these falsehoods, and the overall activity of large language models, is better understood as bullshit in the sense explored by Frankfurt (On Bullshit, Princeton, 2005): the models are in an important way indifferent to the truth of their outputs. We distinguish two ways in which the models can be said to be bullshitters, and argue that they clearly meet at least one of these definitions. We further argue that describing AI misrepresentations as bullshit is both a more useful and more accurate way of predicting and discussing the behaviour of these systems.”
I’d say that they are wrong when they say a LLM may engage in ‘soft bullshit’: a LLM is simulating agents, who are definitely trying to track truth and the external world, because the truth is that which doesn’t go away, and so it may say false things, but it still cares very much about falsity because it needs to know that for versimilitude. If you simply say true or false things at random, you get ensnared in your own web and incur prediction error. Any given LLM may be good or bad at doing so—the success of story-based jailbreaks suggests they are still far from ideal—but it’s clear that the prediction loss on large real-world texts written by agents like you or I, who are writing things to persuade each other, like I am writing this comment to manipulate you and everyone reading it, require tracking latents corresponding to truth, beliefs, errors, etc. You can no more accurately predict the text of this comment without tracking what I believe and what is true than you could accurately predict it while not tracking whether I am writing in English or French. (Like in that sentence right there. You see what I did there? Maybe you didn’t because you’re just skimming and tldr, but an LLM needs to!)
They are right when they say a RLHF-tuned model like ChatGPT engages in ‘hard bullshit’, but they seem to be right for the wrong reasons. Oddly, they seem to avoid any discussion of what makes ‘ChatGPT’ different from the base model ‘GPT’, and the only time they seem to betray any hint that the topic of tuning matters is to defer discussion to the unpromisingly-titled unpublished paper “Still no lie detector for language models: Probing empirical and conceptual roadblocks”, so I can’t tell what they think. We know a lot about GPT and TruthfulQA scaling and ChatGPT modeling users and engaging in sycophancy and power-seeking behavior and reward-hacking and persuasion at this point, so it’s quite irresponsible to not discuss any of that in a section about whether the LLMs are engaged in ‘hard bullshit’ explicitly aimed at manipulating user/reader beliefs...
The “Still no lie detector for language model” paper is here: https://arxiv.org/pdf/2307.00175
The paper in the OP seems somewhat relate to my post from earlier this year.
I think that’s true, but not very important (in the short term). On Bullshit—Wikipedia was first published in 1986, and was a humorous, but useful, categorizaton of a whole lot of human communication output. ChatGPT is truth-agnostic (except for fine-tuning and output tuning), but still pretty good on a whole lot of general topics. Human choice of what GPT outputs to highlight or use in further communication can be bullshit or truth-seeking, depending on the human intent.
In the long-term, of course, the idea is absolutely core to all the alignment fears and to the expectation that AI will steamroller human civilization because it doesn’t care.