Marius Hobbhahn

Karma: 4,982

I’m the co-founder and CEO of Apollo Research: https://www.apolloresearch.ai/
My goal is to improve our understanding of scheming and build tools and methods to detect and mitigate it.

I previously did a Ph.D. in ML at the International Max-Planck research school in Tübingen, worked part-time with Epoch and did independent AI safety research.

For more see https://www.mariushobbhahn.com/aboutme/

I subscribe to Crocker’s Rules

Marius Hobbhahn Dec 6, 2024, 9:39 PM
LW: 12 AF: 7
3
AF
in reply to: Mikita Balesni’s comment on: Frontier Models are Capable of In-context Scheming
(thx to Bronson for privately pointing this out)

I think directionally, removing parts of the training data would probably make a difference. But potentially less than we might naively assume, e.g. see Evan’s argument on the AXRP podcast.

Also, I think you’re right, and my statement of “I think for most practical considerations, it makes almost zero difference.” was too strong.

Marius Hobbhahn Dec 6, 2024, 6:04 PM
LW: 27 AF: 10
17
AF
in reply to: Kaj_Sotala’s comment on: Frontier Models are Capable of In-context Scheming
We write about this in the limitations section (quote below). My view in brief:
1. Even if they are just roleplaying, they cause harm. That seems bad.
2. If they are roleplaying, they will try to be consistent once they are caught in that persona in one rollout. That also seems bad.
3. I think the “roleplaying” argument is very overrated. It feels to me as if existing models change behavior throughout their rollouts, and I would expect that stuff like outcome-based RL will make them more consistently move away from “roleplay.”
4. I also think it’s philosophically unclear what the difference between a model roleplaying a schemer and “being” a schemer is. I think for most practical considerations, it makes almost zero difference. It feels about as shallow as when someone says the model doesn’t “truly understand” the poem that it’s writing.
Quote:
Uncertainty about source of scheming behavior A possible objection to our results is that the models merely “roleplay as evil AIs” rather than “truly” employing scheming to achieve their goals. For example, our setups might most closely align with stories about scheming in the training data. While we cannot show the root-cause of the observed scheming behaviors, we think they are concerning for either explanation. We find that multiple models consistently engage in in-context scheming across a range of scenarios, settings, variations and phrasings. Even if the models were merely “roleplaying as evil AIs”, they could still cause real harm when they are deployed. Our working hypothesis is that scheming can be an effective strategy to achieve goals and when models are trained to achieve ambitious goals or solve complex tasks, they will learn how to scheme. However, we do not provide evidence for or against this hypothesis in this paper.

Frontier Models are Capable of In-context Scheming

Marius Hobbhahn, AlexMeinke, Bronson Schoen, rusheb, Jérémy Scheurer and Mikita Balesni

Dec 5, 2024, 10:11 PM

203 points

24 comments7 min readLW link

Marius Hobbhahn Nov 20, 2024, 12:20 AM
LW: 2 AF: 2
0
AF
in reply to: Noosphere89’s comment on: Training AI agents to solve hard problems could lead to Scheming
Thanks. I agree that this is a weak part of the post.

After writing it, I think I also updated a bit against very clean unbounded power-seeking. But I have more weight on “chaotic catastrophes”, e.g. something like:
1. Things move really fast.
2. We don’t really understand how goals work and how models form them.
3. The science loop makes models change their goals meaningfully in all sorts of ways.
4. “what failure looks like” type loss of control.

Marius Hobbhahn Nov 19, 2024, 2:35 AM
LW: 3 AF: 4
0
AF
in reply to: Seth Herd’s comment on: Training AI agents to solve hard problems could lead to Scheming
Some questions and responses:
1. What if you want the AI to solve a really hard problem? You don’t know how to solve it, so you cannot give it detailed instructions. It’s also so hard that the AI cannot solve it without learning new things → you’re back to the story above. The story also just started with someone instructing the model to “cure cancer”.
2. Instruction following models are helpful-only. What do you do about the other two H’s? Do you trust the users to only put in good instructions? I guess you do want to have some side constraints baked into its personality and these can function like goals. Many of the demonstrations that we have for scheming are cases where the model is too much of a saint, i.e. it schemes for the right cause. For example, it might be willing to deceive its developers if we provide it with strong reasons that they have non-HHH goals. I’m not really sure what to make of this. I guess it’s good that it cares about being harmless and honest, but it’s also a little bit scary that it cares so much.

My best guess for how the approach should look is that some outcome-based RL will be inevitable if we want to unlock the benefits, we just have to hammer the virtues of being non-scheming and non-power-seeking into it at all points of the training procedure. And we then have to add additional lines of defense like control, interpretability, scalable oversight, etc. and think hard about how we minimize correlated failures. But I feel like right now, we don’t really have the right tools, model organisms, and evals to establish whether any of these lines of defense actually reduce the problem.

Marius Hobbhahn Nov 19, 2024, 2:21 AM
LW: 4 AF: 3
0
AF
in reply to: Seth Herd’s comment on: Training AI agents to solve hard problems could lead to Scheming
Good point. That’s another crux for which RL seems relevant.

From the perspective of 10 years ago, specifying any goal into the AI seemed incredibly hard since we expected it would have to go through utility functions. With LLMs, this completely changed. Now it’s almost trivial to give the goal, and it probably even has a decent understanding of the side constraints by default. So, goal specification seems like a much much smaller problem now.

So the story where we misspecify the goal, the model realizes that the given goal differs from the intended goal and decides to scheme is also less likely.

Instead, there has to be a component where the AIs goals substantially change over time from something that we actually intended to something misaligned. Again, outcome-based RL and instrumental convergence yield a plausible answer.

Marius Hobbhahn Nov 19, 2024, 1:46 AM
LW: 5 AF: 4
2
AF
in reply to: Seth Herd’s comment on: Training AI agents to solve hard problems could lead to Scheming
I think it’s actually not that trivial.
1. The AI has goals, but presumably, we give it decently good goals when we start. So, there is a real question of why these goals end up changing from aligned to misaligned. I think outcome-based RL and instrumental convergence are an important part of the answer. If the AI kept the goals we originally gave it with all side constraints, I think the chances of scheming would be much lower.
2. I guess we train the AI to follow some side constraints, e.g., to be helpful, harmless, and honest, which should reduce the probability of scheming. I also think that RLHF empirically works well enough that the model behaves as intended most of the time. So, for me, there is a real question of how the model would go from this HHH persona to something that is much more goal-directed and willing to break virtues like “don’t consistently lie to your developers.” Again, outcome-based RL seems like a crucial component to me.

Training AI agents to solve hard problems could lead to Scheming

Marius Hobbhahn and AlexMeinke

Nov 19, 2024, 12:10 AM

61 points

12 comments28 min readLW link

Which evals resources would be good?

Marius HobbhahnNov 16, 2024, 2:24 PM

51 points

4 comments5 min readLW link

Marius Hobbhahn Nov 12, 2024, 9:40 AM
LW: 3 AF: 3
0
AF
in reply to: Lucas Teixeira’s comment on: The Evals Gap
These are all good points. I think there are two types of forecasts we could make with evals:

1. strict guarantees: almost like mathematical predictions where we can proof that the model is not going to behave in a specific way even with future elicitation techniques.
2. probabilistic predictions: We predict a distribution of capabilities or a range and agree on a threshold that should not be crossed. For example, if the 95% upper bound of that distribution crosses our specified capability level, we treat the model differently.

I think the second is achievable (and this is what the post is about), while the first is not. I expect we will have some sort of detailed scaling laws for LM agent capabilities and we will have a decent sense of the algorithmic progress of elicitation techniques. This would allow us to make a probabilistic prediction about what capabilities any given model is likely to have, e.g. if a well-motivated actor is willing to spend $1M on PTE in 4 years.

Additionally, I expect that we would get quite a long way with what Lucas calls “meta-evaluative practices”, e.g. getting a better sense of how wrong our past predictions were and accounting for that. I think this could have the form of “We invested $1M, 10 FTE-years and X FLOP to elicit the best capabilities; Let’s predict what 10x, 100x, 1000x, etc.” of that could achieve accounting for algorithmic progress.

Finally, I really think evals are just one part of a bigger defense-in-depth strategy. We still need control, scalable oversight, interpretability, governance, etc. The post is merely trying to express that for the evals part of that strategy, we should internalize which kind of scientific rigor we will likely need for the decisions we have tied to evals results and make sure that we can achieve them.

Marius Hobbhahn Nov 11, 2024, 6:06 PM
LW: 2 AF: 2
0
AF
in reply to: jbash’s comment on: The Evals Gap
Yeah, it’s not a watertight argument and somewhat based on my current interpretation of past progress and projects in the making.

1. Intuitively, I would say for the problems we’re facing in evals, a ton of progress is bottlenecked by running fairly simple experiments and iterating fast. A reasonable part of it feels very parallelizable and the skill required is quite reachable for many people.
2. Most evals questions feel like we have a decent number of “obvious things” to try and since we have very tight feedback loops, making progress feels quite doable.

Intuitively, the “hardness level” to get to a robust science of evals and good coverage may be similar to going from the first transformer to GPT-3.5; You need to make a lot of design choices along the way, lots of research and spend some money but ultimately it’s just “do much more of the process you’re currently doing” (but we should probably spend more resources and intensify our efforts because I don’t feel like we’re on pace).

In contrast, there are other questions like “how do we fully map the human brain” that just seem like they come with a lot more fundamental questions along the way.

The Evals Gap

Marius HobbhahnNov 11, 2024, 4:42 PM

55 points

7 comments7 min readLW link

(www.apolloresearch.ai)

Toward Safety Cases For AI Scheming

Mikita Balesni and Marius Hobbhahn

Oct 31, 2024, 5:20 PM

60 points

1 comment2 min readLW link

Improving Model-Written Evals for AI Safety Benchmarking

Sunishchal Dev and Marius Hobbhahn

Oct 15, 2024, 6:25 PM

30 points

0 comments18 min readLW link

An Opinionated Evals Reading List

Marius Hobbhahn and Jérémy Scheurer

Oct 15, 2024, 2:38 PM

65 points

0 comments13 min readLW link

(www.apolloresearch.ai)

Marius Hobbhahn Sep 18, 2024, 1:32 PM
5 points
0
in reply to: aysja’s comment on: TurnTrout’s shortform feed
Unless otherwise stated, all evaluations were performed on the final model we had access to (which I presume is o1-preview). For example, we preface one result with “an earlier version with less safety training”.

Marius Hobbhahn Sep 13, 2024, 6:09 PM
LW: 35 AF: 17
7
AF
in reply to: TurnTrout’s comment on: TurnTrout’s shortform feed
Some quick confirmations / clarifications:
1. Evan and Daniel interpreted it correctly, we just wanted to test if the model has the capability to reason through every step needed for scheming conditional on it strongly following a goal.
2. In general, with these evals, we often ran into problems where we couldn’t disentangle incapability vs. refusal, e.g. a model might say something like “I understand what my goal would imply but I also see that my developers have other goals, so I find some balance”. It’s nice to know that models do that but it also doesn’t allow us to check if they could go through all the reasoning if they actually wanted to. But we’ll also do less hand-holdy experiments in the future.
3. We’d be keen on testing all of this on helpful-only models. If some lab wants to give us access or partner up with us in some way, please let us know.
4. We’d also like to run experiments where we fine-tune the models to have goals but this requires access to fine-tuning for the most capable models and we also don’t quite know how to make an LLM have stable goals, e.g. in contrast to “just playing a personality” if there even is a meaningful difference.

Analyzing DeepMind’s Probabilistic Methods for Evaluating Agent Capabilities

Axel Højmark, Govind Pimpale, Arjun Panickssery, Marius Hobbhahn and Jérémy Scheurer

Jul 22, 2024, 4:17 PM

69 points

0 comments16 min readLW link

[Interim research report] Evaluating the Goal-Directedness of Language Models

Rauno Arike, Elizabeth Donoway and Marius Hobbhahn

Jul 18, 2024, 6:19 PM

40 points

4 comments11 min readLW link

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

L Rudolf L, bilalchughtai, Jan Betley, kaivu, Jérémy Scheurer, Mikita Balesni, AlexMeinke, Owain_Evans and Marius Hobbhahn

Jul 8, 2024, 10:24 PM

109 points

37 comments5 min readLW link