Scattered thoughts on what it means for an LLM to believe
I had a 2-hour mini-sprint with Max Heitmann (a co-founder of Aether) and Miles Kodama about whether large language models (LLMs) or LLM agents have beliefs, and the relevance of this to AI safety.
The conversation was mostly free-form, with the three of us bouncing ideas and resources with each other. This is my attempt at recalling key discussion points. I have certainly missed many points, and the Aether team plan to write a thorough summary from all the mini-sprints they organised.
I write this for three reasons. First, as a way to clarify my own thinking. Second, many of the ideas and resources we were sharing were new to each other, so good chance this will be useful for many LessWrong readers. Third, you might be able to contribute to the early stages of Aether and their strategy.
What is a belief?
Max provided three definitions of beliefs:
Representations. A system believes P if there is an explicit representation of ‘P is true’ in the system
Behaviour. A system believes P if its behaviour is consistent with something that believes P. (Known as dispositionalism)
Predictive power. A system believes P if this is the best predictive explanation of the system.
Many of our discussions during the two hours boiled down to what we considered to be a belief or not. This is something I am still confused about. To help explain my confusion, I came up with the example statement ‘The Eiffel Tower is in Paris’ and asked when the information ‘The Eiffel Tower is in Paris’ corresponds to a belief.
System 1: The Eiffel tower itself. The Eiffel tower itself contains the information that ‘The Eiffel Tower is in Paris’ (the properties of the physical object is what determines the truth of the statement), but the Eiffel tower ‘obviously’ has no beliefs.
System 2: An LLM. The information ‘The Eiffel Tower is in Paris’ is encoded in a (capable enough) foundation model, but we did not agree on whether this was a belief. Max said the LLM cannot act on this information so it cannot have beliefs, whereas I think that the ability to correctly complete the sentence “The Eiffel Tower is in the city ” corresponds to the LLM having some kind of beliefs.
System 3: An LLM agent. Suppose there is a capable LLM based agent that can control my laptop. I ask it to book a trip to the Eiffel Tower, and then it books travel and hotels in Paris. We all agreed that the information ‘The Eiffel Tower is in Paris’ is a belief for the agent.
Does it matter if an LLM or LLM agent has beliefs or not?
Going into the discussion, I was primed to think not as I had recently heard CGP Grey’s comparison of AI systems to biological viruses or memes. The relevant quote is:
This is why I prefer the biological weapon analogy—no one is debating the intent of a lab-created smallpox strain. No one wonders if the smallpox virus is “thinking” or “does it have any thoughts of its own?”. Instead, people understand that it doesn’t matter. Smallpox germs, in some sense, “want” something: they want to spread, they want to reproduce, they want to be successful in the world, and are competing with other germs for space in human bodies. They’re competing for resources. The fact that they’re not conscious doesn’t change any of that.
So I feel like these AI systems act as though they are thinking, and fundamentally it doesn’t really matter whether they are actually thinking or not because externally the effect on the world is the same either way. That’s my main concern here: I think these systems are real dangerous because it is truly autonomous in ways that other tools we have ever built are not.
I think this perspective puts more weight on Definition 2 of beliefs (dispositionalism) than the other two definitions. [Edit: Max Heitmann in the comments says this is more in line with Definition 3. On reflection I actually do not fully understand the distinction between 2 and 3.]
But why are the Aether team organising these mini-sprints? The short summary is that deception is a big risk in future AI systems, and they believe that nailing down what it means for LLMs and LLM agents to believe something is an important step to detecting and intervening on deceptive systems. [EDIT: See RohanS’s comment for clarification: not only interested in beliefs but analyzing classic AI safety concepts in the context of foundation model agents.]
This intuitively sounds reasonable, but I am persuaded by Nate Soares’ Deep Deceptiveness argument (which I think is a special case of the memetic argument CGP Grey is making above):
Deceptiveness is not a simple property of thoughts. The reason the AI is deceiving you is not that it has some “deception” property, it’s that (barring some great alignment feat) it’s a fact about the world rather than the AI that deceiving you forwards its objectives, and you’ve built a general engine that’s good at taking advantage of advantageous facts in general.
As the AI learns more general and flexible cognitive moves, those cognitive moves (insofar as they are useful) will tend to recombine in ways that exploit this fact-about-reality, despite how none of the individual abstract moves look deceptive in isolation.
Nate Soares goes in detail on a fictional but plausible story of what this explicitly might look like, with the AI system taking actions that would not be identified as deceptive in isolation, but only in retrospect when we see the full sequence of actions would we describe it as being deceptive. In particular, if we tried inspecting the systems beliefs, we would not find an inconsistency between its ‘true beliefs’ and its behaviour / ‘stated beliefs’.
Nevertheless, I do still think it is valuable to understand beliefs, because it should reduce the odds of “explicit deception” of happening.
Can CoT reveal beliefs?
This question arose in our discussions. There are two reasons I am skeptical.
First, humans’ stated beliefs do not match the sub-conscious beliefs that actually determine our behaviour. This is likely explained in various places, but I know about this idea from the book The Elephant in the Brain. It for example makes the case that people (in the US) put significant resources into healthcare not because it will increase the health of their loved ones but instead because it is how you show you care about your loved ones.
Second, there is some evidence that CoT does not help the largest LLMs. I do not remember the paper, but there is research showing that CoT seems to help medium size models, but not small or large models. One proposed story is that small models are too dumb to make use of step by step thinking, and large models have reached alien levels of intelligence that having to explain in human language is not helpful for its reasoning.
[EDIT: RohanS in comments gives OpenAI o1 as excellent counter-example to CoT not being helpful for large LLMs.]
Additionally, when trying to search for the CoT paper above, I found this paper on arxiv which finds situations where the CoT is just rationalizing the decision made by the LLM. If you look at papers which cite this paper, you will find other research in this vain.
[EDIT: RohanS highly recommends the post The Case of CoT Unfaithfulness is Overstated.]
Emergent beliefs
Because many AI systems consist of smaller sub-systems put together, the question of how the beliefs of the full system compares to the beliefs of the individual sub-systems came up. In particular, is the beliefs of the full system equal to the union or intersection of the beliefs of the sub systems. Three interesting observations came up.
Beliefs of agents are not closed under logical deduction. One argument is that agents will have finite memory and true sentences are arbitrarily long, therefore, there are true sentences that are unknown to the agent. Presumably there are more sophisticated and insightful arguments, but we did not go into them.
Individual systems can all have a shared belief, but the system does not. Example from Miles is that there are often situations in companies or teams in which everybody privately believes the same thing (e.g. this project will not succeed) but is unwilling to state it out loud, so the group of people as a whole behave as if they do not believe that thing (e.g. by continuing with the project).
The system can believe things that no individual believes. An example of this is if sub-system 1 believes ‘A’ and sub-system 2 believes ‘A implies B’, and it is only the full system that by combining the knowledge believes ‘B’.
Final thoughts
As the title says, these are scattered thoughts, and I am not sure how useful they will be for the average LessWrong reader. I felt it was worth sharing because each of us were presenting examples and resources that the others were not aware of, so presumably this summary contains information that is new to readers, even if none of the ideas are original.
If you found any of this interesting or you have any thoughts, please share in comments and/or reach out to the Aether team. Their post on the EA forum includes links to expression of interest forms and an email address where you can get in touch.
Thanks for writing this up! Sad to have missed this sprint. This comment mainly has pushback against things you’ve said, but I agreed with a lot of the things I’m not responding to here.
I think this is clearly wrong, or at least way too strong. The most intuitively obvious way I’ve seen this is reading the cipher example in the o1 blog post, in the section titled “Chain of Thought.” If you click on where it says “Thought for 5 seconds” for o1, it reveals the whole chain of thought. It’s pretty long, maybe takes 5 mins to skim, but it’s well worth the time for building intuition about how the most cutting edge model thinks imo. The model uses CoT to figure out a cipher and decode it. I think it’s intuitively obvious that the model could not have solved this problem without CoT.
True. I trust post-hoc explanations much less than pre-answer reasoning for problems that seem to require a lot of serial reasoning, like the o1 cipher problem. This post and this comment on it discuss different types of CoT unfaithfulness in a way similar to how I’m thinking about it, highly recommend.
Fwiw only that one sprint was specifically on beliefs. I think I’m more interested in what the agents believe, and less in figuring out exactly what it means to believe things (although the latter might be necessary in some confusing cases). I’d say the sprints are more generally aimed at analyzing classic AI risk concepts in the context of foundation model agents, and getting people outside the core team to contribute to that effort.
Thanks for the feedback! Have editted the post to include your remarks.
Great summary, thanks for writing this up! A few questions / quibbles / clarifications:
I’m not clear on exactly how this example is supposed to work. I think Miles was making the point that bare information is merely a measure of correlation, so even the states of simple physical objects can carry information about something else without thereby having beliefs (consider, e.g., a thermometer). But in this Eiffel Tower example, I’m not sure what is correlating with what—could you explain what you mean?
I’m not sure I want to say that the inability of a bare LLM to act on the information that the Eiffel Tower is in Paris means that it can’t have that as a belief. I think it might suffice for the LLM to be disposed to act on this information if it could act at all. (In the same way, someone with full body paralysis might not be able to act on many of their beliefs, but these are still perfectly good beliefs.) However, I think the basic ability of an LLM to correctly complete the sentence “the Eiffel Tower is in the city of…” is not very strong evidence of having the relevant kinds of dispositions. Better evidence would show that the LLM somehow relies on this fact in its reasoning (to the extent that you can make it do reasoning).
Regarding the CGP Grey quote, I think this could be a red herring. Speaking in a very loose sense, perhaps a smallpox virus “wants to spread”. But speaking in a very loose sense, so too does a fire (it also has that tendency). Yet the dangers posed by a fire are of a relevantly different kind than those posed by an intelligent adversary, as are the strategies that are appropriate for each. So I think the question about whether current AI systems have real goals and beliefs does indeed matter, since it tells us whether we are dealing with a hazard like a fire or a hazard like an adversarial human being. (And all this has nothing really to do with consciousness.)
Lastly, just a small quibble:
I actually think the CPG Grey perspective puts more weight on Definition 3, which is a kind of instrumentalism, than it does on dispositionalism. Daniel Dennett, who adopts something like Definition 3, argues that even thermostats have beliefs in this sense, and I suppose he might say the same about a smallpox virus. But although it might be predictively efficient to think of smallpox viruses as having beliefs and desires (Definition 3), viruses seem to lack many of the dispositions that usually characterise believers (Definition 2).
The physical object Eiffel Tower is correlated with itself.
It is highly predictive of the ability of the LLM to book flights to Paris, when I create an LLM-agent out of it and ask it to book a trip to see the Eiffel Tower.
I dont think we disagree here. To clarify, my belief is there are threat models / solutions that are not affected by whether the AI has ‘real’ beliefs, and there are other threats/solutions where it does matter.
I actually do not understand the distinction between Definition 2 and Definition 3. Don’t need to resolve it here. I’ve editted post to include my uncertainty on this.