Owain Evans on Situational Awareness and Out-of-Context Reasoning in LLMs

Owain Evans is an AI Alignment researcher, research associate at the Center of Human Compatible AI at UC Berkeley, and now leading a new AI safety research group.

In this episode we discuss two of his recent papers, “Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs” (LW) and “Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data” (LW), alongside some Twitter questions.

Below are some highlighted quotes from our conversation (available on Youtube, Spotify, Apple Podcast). For the full context for each of these quotes, you can find the accompanying transcript.

Situational Awareness

Figure 1 from Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs

Definition

“What is situational awareness? The idea is the model’s kind of self-awareness, that is its knowledge of its own identity, and then its awareness of its environment. What are the basic interfaces that it is connected to? [...] And then there’s a final point with situational awareness, which is, can the model use knowledge of its identity and environment to take rational actions?” (full context)

“Situational awareness is crucial for an AI system acting as an agent, doing long-term planning. If you don’t understand what kind of thing you are, your capabilities and limitations, it’s very hard to make complicated plans. The risks of AI mostly come from agentic models able to do planning.” (full context)

Motivation

“We wanted to measure situational awareness in large language models with a benchmark similar to Big Bench or MMLU. The motivation is that situational awareness is important for thinking about AI risks, especially deceptive alignment, and we lacked ways to measure and break it down into components.” (full context)

“Situational awareness is relevant to any situation where the model needs to do agentic long-term planning. [...] A model confused about itself and its situation would likely struggle to pull off such a strategy.” (full context)

On Claude 3 Opus Insightful Answers

“Let me explain [the Long Monologue task]. Most of our dataset is typical multiple-choice question answering, but we added a task where models write long answers describing themselves and their situation. The idea is to see if the model can combine different pieces of information about itself coherently and make good inferences about why we’re asking these questions.”

“Claude 3 Opus was particularly insightful, guessing it might be part of a research study testing self-awareness in LLMs. These were true inferences not stated in the question. The model was reading between the lines, guessing this wasn’t a typical ChatGPT-style interaction.”

“I was moderately surprised, but I’d already seen Opus be very insightful and score well on our benchmark. It’s worth noting we sample answers with temperature 1, so there’s some randomness. We saw these insights often enough that I don’t think it’s just luck. Anthropic’s post-training RLHF seems good at giving the model situational awareness. The GPT-4 base results were more surprising to us.” (full context)

What Would Saturating The Situational Awareness Benchmark Imply For Safety And Governance

“If models can do as well or better than humans who are AI experts, who know the whole setup, who are trying to do well on this task, and they’re doing well on all the tasks including some of these very hard ones, that would be one piece of evidence. [...] We should consider how aligned it is, what evidence we have for alignment. We should maybe try to understand the skills it’s using.” (full context)

“If the model did really well on the benchmark, it seems like it has some of the skills that would help with deceptive alignment. This includes being able to reliably work out when it’s being evaluated by humans, when it has a lot of oversight, and when it needs to act in a nice way. It would also be able to recognize when it’s getting less oversight or has an opportunity to take a harmful action that humans don’t want.” (full context)

Out-of-context reasoning

Figure 1 from Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data

Definition

“Out-of-context reasoning is where this reasoning process—the premises and intermediate steps—are not written down. They’re not in the prompt or the context window. We can’t just read off what the model is thinking about. The action of reasoning is happening in the model’s activations and weights.(full context)

Experimental Setup

“The setup is that we give the model a bunch of data points. In one example, it’s a function learning task. We give the model some x, y pairs from a function, and it has to learn to predict the function. [...] At test time, we want to see if the model can verbalize that function. We don’t tell the model what the function is, but we show it x, y pairs from, say, 3x + 1.(full context)

Difference With In-Context Learning

“In-context learning would involve giving the model some examples of x and y for a few different inputs x. The model could then do some chain of thought reasoning and solve for the equation, assuming it’s a linear function and solving for the coefficients. [...] But there’s a different way that we explore. In the fine-tuning for a model like GPT-4 or GPT-3.5, each fine-tuning document is just a single x, y pair. Each of those individual examples is not enough to learn the function.(full context)

Safety implications

“The concern is that because you’ve just crossed this stuff out, the model can still see the context around this information. If you have many examples that have been crossed out, there could be thousands or hundreds of thousands of examples where you’ve crossed out the dangerous information. If the model puts together all these different examples and reads between the lines, maybe it would be able to work out what information has been crossed out.(full context)

The Results Were Surprising

“I should say that most of the results in the paper were surprising to me. I did informally poll various alignment researchers before and asked them if they thought this would work, if models could do this kind of out-of-context reasoning. For most of the results in the paper, they said no.” (full context)

Alignment Research Advice

Owain’s research process

“I devote time to thinking through questions about how LLMs work. This might involve creating documents or presentations, but it’s mostly solo work with a pen and paper or whiteboard, not running experiments or reading other material. Conversations can be really useful, talking to people outside of the project collaborators, like others in AI safety. This can trigger new ideas.” (full context)

Interplay between theory and practice

I’m always trying to think about how to run experiments, how to have good experiment paradigms where we can learn a lot from experiments. Because I do think that part is just super important. And then, yeah. Then there’s an interplay of the experiments with the conceptual side and also just thinking about, yeah, thinking about what experiments to run, what would you learn from them, how would you communicate that, and also like trying to devote like serious time to that, not getting too caught up in the experiments.” (full context)

Research style and background

“I look for areas where there’s some kind of conceptual or philosophical work to be done. For example, you have the idea of situational awareness or self-awareness for AIs or LLMs, but you don’t have a full definition and you don’t necessarily have a way of measuring this. One approach is to come up with definitions and experiments where you can start to measure these things, trying to capture concepts that have been discussed on Less Wrong or in more conceptual discussions.(full context)

On Research Rigor

“I think communicating things in a format that looks like a publishable paper is useful. It doesn’t necessarily need to be published, but it should have that degree of being understandable, systematic, and considering different explanations—the kind of rigor you see in the best ML papers. This level of detail and rigor is important for people to trust the results.” (full context)

On Accelerating AI capabilities

“Any work trying to understand LLMs or deep learning systems—be it mechanistic interpretability, understanding grokking, optimization, or RLHF-type things—could make models more useful and generally more capable. So improvements in these areas might speed up the process. [...] Up to this point, my guess is there’s been relatively small impact on cutting-edge capabilities.(full context)

On balancing safety benefits with potential risks

“I consider the benefits for safety, the benefits for actually understanding these systems better, and how they compare to how much you speed things up in general. Up to this point, I think it’s been a reasonable trade-off. The benefits of understanding the system better and some marginal improvement in usefulness of the systems ends up being a win for safety, so it’s worth publishing these things.(full context)

On the reception of his work

For situation awareness benchmarking, there’s interest from AI labs and AI safety institutes. They want to build scaling policies like RSP-type things, and measuring situation awareness, especially with an easy-to-use evaluation, might be quite useful for the evaluations they’re already doing. [...] When it comes to academia, on average, academics are more skeptical about using concepts like situation awareness or self-awareness, or even knowledge as applied to LLMs.” (full context)

For more context surrounding these quotes, please see the transcript here, or watch directly on Youtube.

No comments.