Despite their aesthetic similarities it is not at all obvious to me that models “lying” by getting answers wrong is in any way mechanistically related to the kind of lying we actually need to be worried about.
Lying is not just saying something untrue, but doing so knowingly with the intention to deceive the other party. It appears critical that we are able to detect genuine lies if we wish to guard ourselves against deceptive models. I am concerned that much of the dialogue on this topic is focusing on the superficially similar behaviour of producing an incorrect answer.
I worry that behaviours that don’t fit this definition are being branded as “lying” when in fact they’re simply “The LLM producing an incorrect answer”. We’ll suggest three mechanistically distinct ways of producing incorrect information in the organic world, only one of which should really be considered lying. We will also equate this to behaviour we’ve seen in LLMs (primarily GPT models finetuned with RL).
***
Here are 3 different types of “producing false” information we can observe in the world.
Communicating false information unknowingly.
Deceiving another party with false information unknowingly but in a way which is “evolutionarily deliberate” and benefits you (instinctual deception).
Communicating false information knowingly and with an attempt to deceive (regular lies).
Notice that this is not exhaustive. For example we haven’t including cases where you “guess something to be correct” but communicate it with the hope that the person believes you regardless of what the answer is.
***
Communicating False Information unknowingly:
In humans, this is when you simply get an answer incorrect our of confusion. False information has been communicated, but not by any intention of yourself.
In contemporary LLMs (without complex models of the human interacting with it) this likely accounts for most of the behaviour seen as “lying”.
Instinctual Deception:
Bit of a weird one that I debated leaving out. Bear with me.
Some animals will engage in the bizarre behaviour of “Playing Dead” when faced with a threat. I haven’t spent much time searching for mechanistic explanations, but I would like you to entertain the idea that this behaviour is sometimes instinctual. It’s seems unreasonable that an animal as simple as a green-head ant is holding strategic thoughts about why it should remained immobile, curled into a ball when there is a much simpler type of behaviour for evolution to have instilled. Namely, when it detects a very dangerous situation (or is too stressed etc) it triggers the release of specific chemical signals in the body which result in the playing dead behaviour.
This is a deceptive behaviour that show evolutionary benefits but does occur due to any intent to deceive from the actual animal itself.
In contemporary LLM’s, specifically those trained using reinforcement learning, I would like to hypothesize that this type of deception can be found in the tedious disclaimers that chatGPT will give you sometimes when asked a slightly tricky question. Including outright denying it knows information that it does actually have access to.
My argument is that this is actually be produced by RL selection pressure, with no part of chatGPT being “aware” or “intentionally” trying to avoid answering difficult questions. Analogously, not every animal playing dead are nescessarily aware of the tactical reason for doing so.
Regular Lies:
Finally, we get to good old fashioned lying. Just ripping the first definition off Standford Encyclopedia of Philosophy we have the following “To lie, to make a believed-false statement to another person with the intention that the other person believe that statement to be true.”
You require an actual model of the person to deceive them, you’re not just telling the wrong answer and you have an intention of misleading that party.
In contemporary LLMs this has never been demonstrated to my knowledge. But this is what the deceptive AI’s we need to be worried about.
***
And now, having walked the reader through the above I will now undermine my argument with a disclaimer. I haven’t gone out and surveyed how common of an error this is for researchers to make, nor dedicated more than an hour into targeted philosophical research on this topic, hence why this is on my shortform. The analogy made between “evolution” and RL training has not been well justified here. I believe there is a connection wriggling it’s eyebrows and pointing suggestively.
Lies, Damn Lies and LLMs
Despite their aesthetic similarities it is not at all obvious to me that models “lying” by getting answers wrong is in any way mechanistically related to the kind of lying we actually need to be worried about.
Lying is not just saying something untrue, but doing so knowingly with the intention to deceive the other party. It appears critical that we are able to detect genuine lies if we wish to guard ourselves against deceptive models. I am concerned that much of the dialogue on this topic is focusing on the superficially similar behaviour of producing an incorrect answer.
I worry that behaviours that don’t fit this definition are being branded as “lying” when in fact they’re simply “The LLM producing an incorrect answer”. We’ll suggest three mechanistically distinct ways of producing incorrect information in the organic world, only one of which should really be considered lying. We will also equate this to behaviour we’ve seen in LLMs (primarily GPT models finetuned with RL).
***
Here are 3 different types of “producing false” information we can observe in the world.
Communicating false information unknowingly.
Deceiving another party with false information unknowingly but in a way which is “evolutionarily deliberate” and benefits you (instinctual deception).
Communicating false information knowingly and with an attempt to deceive (regular lies).
Notice that this is not exhaustive. For example we haven’t including cases where you “guess something to be correct” but communicate it with the hope that the person believes you regardless of what the answer is.
***
Communicating False Information unknowingly:
In humans, this is when you simply get an answer incorrect our of confusion. False information has been communicated, but not by any intention of yourself.
In contemporary LLMs (without complex models of the human interacting with it) this likely accounts for most of the behaviour seen as “lying”.
Instinctual Deception:
Bit of a weird one that I debated leaving out. Bear with me.
Some animals will engage in the bizarre behaviour of “Playing Dead” when faced with a threat. I haven’t spent much time searching for mechanistic explanations, but I would like you to entertain the idea that this behaviour is sometimes instinctual. It’s seems unreasonable that an animal as simple as a green-head ant is holding strategic thoughts about why it should remained immobile, curled into a ball when there is a much simpler type of behaviour for evolution to have instilled. Namely, when it detects a very dangerous situation (or is too stressed etc) it triggers the release of specific chemical signals in the body which result in the playing dead behaviour.
This is a deceptive behaviour that show evolutionary benefits but does occur due to any intent to deceive from the actual animal itself.
In contemporary LLM’s, specifically those trained using reinforcement learning, I would like to hypothesize that this type of deception can be found in the tedious disclaimers that chatGPT will give you sometimes when asked a slightly tricky question. Including outright denying it knows information that it does actually have access to.
My argument is that this is actually be produced by RL selection pressure, with no part of chatGPT being “aware” or “intentionally” trying to avoid answering difficult questions. Analogously, not every animal playing dead are nescessarily aware of the tactical reason for doing so.
Regular Lies:
Finally, we get to good old fashioned lying. Just ripping the first definition off Standford Encyclopedia of Philosophy we have the following “To lie, to make a believed-false statement to another person with the intention that the other person believe that statement to be true.”
You require an actual model of the person to deceive them, you’re not just telling the wrong answer and you have an intention of misleading that party.
In contemporary LLMs this has never been demonstrated to my knowledge. But this is what the deceptive AI’s we need to be worried about.
***
And now, having walked the reader through the above I will now undermine my argument with a disclaimer. I haven’t gone out and surveyed how common of an error this is for researchers to make, nor dedicated more than an hour into targeted philosophical research on this topic, hence why this is on my shortform. The analogy made between “evolution” and RL training has not been well justified here. I believe there is a connection wriggling it’s eyebrows and pointing suggestively.