Scattered thoughts on what it means for an LLM to believe

I had a 2-hour mini-sprint with Max Heitmann (a co-founder of Aether) and Miles Kodama about whether large language models (LLMs) or LLM agents have beliefs, and the relevance of this to AI safety.

The conversation was mostly free-form, with the three of us bouncing ideas and resources with each other. This is my attempt at recalling key discussion points. I have certainly missed many points, and the Aether team plan to write a thorough summary from all the mini-sprints they organised.

I write this for three reasons. First, as a way to clarify my own thinking. Second, many of the ideas and resources we were sharing were new to each other, so good chance this will be useful for many LessWrong readers. Third, you might be able to contribute to the early stages of Aether and their strategy.

What is a belief?

Max provided three definitions of beliefs:

Representations. A system believes P if there is an explicit representation of ‘P is true’ in the system
Behaviour. A system believes P if its behaviour is consistent with something that believes P. (Known as dispositionalism)
Predictive power. A system believes P if this is the best predictive explanation of the system.

Many of our discussions during the two hours boiled down to what we considered to be a belief or not. This is something I am still confused about. To help explain my confusion, I came up with the example statement ‘The Eiffel Tower is in Paris’ and asked when the information ‘The Eiffel Tower is in Paris’ corresponds to a belief.

System 1: The Eiffel tower itself. The Eiffel tower itself contains the information that ‘The Eiffel Tower is in Paris’ (the properties of the physical object is what determines the truth of the statement), but the Eiffel tower ‘obviously’ has no beliefs.

System 2: An LLM. The information ‘The Eiffel Tower is in Paris’ is encoded in a (capable enough) foundation model, but we did not agree on whether this was a belief. Max said the LLM cannot act on this information so it cannot have beliefs, whereas I think that the ability to correctly complete the sentence “The Eiffel Tower is in the city ” corresponds to the LLM having some kind of beliefs.

System 3: An LLM agent. Suppose there is a capable LLM based agent that can control my laptop. I ask it to book a trip to the Eiffel Tower, and then it books travel and hotels in Paris. We all agreed that the information ‘The Eiffel Tower is in Paris’ is a belief for the agent.

Does it matter if an LLM or LLM agent has beliefs or not?

Going into the discussion, I was primed to think not as I had recently heard CGP Grey’s comparison of AI systems to biological viruses or memes. The relevant quote is:

This is why I prefer the biological weapon analogy—no one is debating the intent of a lab-created smallpox strain. No one wonders if the smallpox virus is “thinking” or “does it have any thoughts of its own?”. Instead, people understand that it doesn’t matter. Smallpox germs, in some sense, “want” something: they want to spread, they want to reproduce, they want to be successful in the world, and are competing with other germs for space in human bodies. They’re competing for resources. The fact that they’re not conscious doesn’t change any of that.
So I feel like these AI systems act as though they are thinking, and fundamentally it doesn’t really matter whether they are actually thinking or not because externally the effect on the world is the same either way. That’s my main concern here: I think these systems are real dangerous because it is truly autonomous in ways that other tools we have ever built are not.

I think this perspective puts more weight on Definition 2 of beliefs (dispositionalism) than the other two definitions. [Edit: Max Heitmann in the comments says this is more in line with Definition 3. On reflection I actually do not fully understand the distinction between 2 and 3.]

But why are the Aether team organising these mini-sprints? The short summary is that deception is a big risk in future AI systems, and they believe that nailing down what it means for LLMs and LLM agents to believe something is an important step to detecting and intervening on deceptive systems. [EDIT: See RohanS’s comment for clarification: not only interested in beliefs but analyzing classic AI safety concepts in the context of foundation model agents.]

This intuitively sounds reasonable, but I am persuaded by Nate Soares’ Deep Deceptiveness argument (which I think is a special case of the memetic argument CGP Grey is making above):

Deceptiveness is not a simple property of thoughts. The reason the AI is deceiving you is not that it has some “deception” property, it’s that (barring some great alignment feat) it’s a fact about the world rather than the AI that deceiving you forwards its objectives, and you’ve built a general engine that’s good at taking advantage of advantageous facts in general.
As the AI learns more general and flexible cognitive moves, those cognitive moves (insofar as they are useful) will tend to recombine in ways that exploit this fact-about-reality, despite how none of the individual abstract moves look deceptive in isolation.

Nate Soares goes in detail on a fictional but plausible story of what this explicitly might look like, with the AI system taking actions that would not be identified as deceptive in isolation, but only in retrospect when we see the full sequence of actions would we describe it as being deceptive. In particular, if we tried inspecting the systems beliefs, we would not find an inconsistency between its ‘true beliefs’ and its behaviour / ‘stated beliefs’.

Nevertheless, I do still think it is valuable to understand beliefs, because it should reduce the odds of “explicit deception” of happening.

Can CoT reveal beliefs?

This question arose in our discussions. There are two reasons I am skeptical.

First, humans’ stated beliefs do not match the sub-conscious beliefs that actually determine our behaviour. This is likely explained in various places, but I know about this idea from the book The Elephant in the Brain. It for example makes the case that people (in the US) put significant resources into healthcare not because it will increase the health of their loved ones but instead because it is how you show you care about your loved ones.

Second, there is some evidence that CoT does not help the largest LLMs. I do not remember the paper, but there is research showing that CoT seems to help medium size models, but not small or large models. One proposed story is that small models are too dumb to make use of step by step thinking, and large models have reached alien levels of intelligence that having to explain in human language is not helpful for its reasoning.

[EDIT: RohanS in comments gives OpenAI o1 as excellent counter-example to CoT not being helpful for large LLMs.]

Additionally, when trying to search for the CoT paper above, I found this paper on arxiv which finds situations where the CoT is just rationalizing the decision made by the LLM. If you look at papers which cite this paper, you will find other research in this vain.

[EDIT: RohanS highly recommends the post The Case of CoT Unfaithfulness is Overstated.]

Emergent beliefs

Because many AI systems consist of smaller sub-systems put together, the question of how the beliefs of the full system compares to the beliefs of the individual sub-systems came up. In particular, is the beliefs of the full system equal to the union or intersection of the beliefs of the sub systems. Three interesting observations came up.

Beliefs of agents are not closed under logical deduction. One argument is that agents will have finite memory and true sentences are arbitrarily long, therefore, there are true sentences that are unknown to the agent. Presumably there are more sophisticated and insightful arguments, but we did not go into them.
Individual systems can all have a shared belief, but the system does not. Example from Miles is that there are often situations in companies or teams in which everybody privately believes the same thing (e.g. this project will not succeed) but is unwilling to state it out loud, so the group of people as a whole behave as if they do not believe that thing (e.g. by continuing with the project).
The system can believe things that no individual believes. An example of this is if sub-system 1 believes ‘A’ and sub-system 2 believes ‘A implies B’, and it is only the full system that by combining the knowledge believes ‘B’.

Final thoughts

As the title says, these are scattered thoughts, and I am not sure how useful they will be for the average LessWrong reader. I felt it was worth sharing because each of us were presenting examples and resources that the others were not aware of, so presumably this summary contains information that is new to readers, even if none of the ideas are original.

If you found any of this interesting or you have any thoughts, please share in comments and/or reach out to the Aether team. Their post on the EA forum includes links to expression of interest forms and an email address where you can get in touch.