Marius Hobbhahn

Karma: 4,982

I’m the co-founder and CEO of Apollo Research: https://www.apolloresearch.ai/
My goal is to improve our understanding of scheming and build tools and methods to detect and mitigate it.

I previously did a Ph.D. in ML at the International Max-Planck research school in Tübingen, worked part-time with Epoch and did independent AI safety research.

For more see https://www.mariushobbhahn.com/aboutme/

I subscribe to Crocker’s Rules

100+ concrete projects and open problems in evals

Marius HobbhahnMar 22, 2025, 3:21 PM

73 points

1 comment1 min readLW link

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

Nicholas Goldowsky-Dill, Mikita Balesni, Jérémy Scheurer and Marius Hobbhahn

Mar 17, 2025, 7:11 PM

177 points

7 comments6 min readLW link

Marius Hobbhahn Mar 7, 2025, 9:14 AM
LW: 4 AF: 4
2
AF
in reply to: maxnadeau’s comment on: We should start looking for scheming “in the wild”
Good point!

Yes, I use the term scheming in a much broader way, similar to how we use it in the in-context scheming paper. I would assume that our scheming term is even broader than Joe’s alignment-faking because it also includes taking direct covert action like disabling oversight (which arguably is not alignment-faking).

We should start looking for scheming “in the wild”

Marius HobbhahnMar 6, 2025, 1:49 PM

89 points

4 comments5 min readLW link

For scheming, we should first focus on detection and then on prevention

Marius HobbhahnMar 4, 2025, 3:22 PM

47 points

7 comments5 min readLW link

Forecasting Frontier Language Model Agent Capabilities

Govind Pimpale, Axel Højmark, Jérémy Scheurer and Marius Hobbhahn

Feb 24, 2025, 4:51 PM

35 points

0 comments5 min readLW link

(www.apolloresearch.ai)

Do models know when they are being evaluated?

Govind Pimpale, Giles, Joe Needham and Marius Hobbhahn

Feb 17, 2025, 11:13 PM

54 points

3 comments12 min readLW link

Detecting Strategic Deception Using Linear Probes

Nicholas Goldowsky-Dill, bilalchughtai, StefanHex and Marius Hobbhahn

Feb 6, 2025, 3:46 PM

102 points

9 comments2 min readLW link

(arxiv.org)

Marius Hobbhahn Feb 2, 2025, 4:39 PM
2 points
0
in reply to: Petropolitan’s comment on: Catastrophe through Chaos
There are two sections that I think make this explicit:

1. No failure mode is sufficient to justify bigger actions.
2. Some scheming is totally normal.

My main point is that even things that would seem like warning shots today, e.g. severe loss of life, will look small in comparison to the benefits at the time, thus not providing any reason to pause.

Marius Hobbhahn Jan 31, 2025, 6:27 PM
8 points
3
in reply to: Noosphere89’s comment on: Catastrophe through Chaos
what made you update towards longer timelines? My understanding was that most people updated toward shorter timelines based on o3 and reasoning models more broadly.

Catastrophe through Chaos

Marius HobbhahnJan 31, 2025, 2:19 PM

182 points

17 comments12 min readLW link

Marius Hobbhahn Jan 13, 2025, 8:40 AM
5 points
3
in reply to: Liface’s comment on: What’s the short timeline plan?
If I had more time, I would have written a shorter post ;)

Marius Hobbhahn Jan 4, 2025, 10:10 AM
LW: 25 AF: 12
12
AF
in reply to: Rohin Shah’s comment on: What’s the short timeline plan?
That’s fair. I think the more accurate way of phrasing this is not “we will get catastrophe” and more “it clearly exceeds the risk threshold I’m willing to take / I think humanity should clearly not take” which is significantly lower than 100% of catastrophe.

Marius Hobbhahn Jan 3, 2025, 3:05 PM
LW: 15 AF: 6
0
AF
in reply to: aysja’s comment on: What’s the short timeline plan?
I think this is a very important question and the answer should NOT be based on common-sense reasoning. My guess is that we could get evidence about the hidden reasoning capabilities of LLMs in a variety of ways both from theoretical considerations, e.g. a refined version of the two-hop curse or extensive black box experiments, e.g. comparing performance on evals with and without CoT, or with modified CoT that changes the logic (and thus tests whether the models internal reasoning aligns with the revealed reasoning).

These are all pretty basic thoughts and IMO we should invest significantly more effort into clarifying this as part of the “let’s make sure CoT is faithful” part. A lot of safety strategies rest on CoT faithfulness, so we should not leave this to shallow investigations and vibes.

Marius Hobbhahn Jan 3, 2025, 8:44 AM
2 points
0
in reply to: Ryan Kidd’s comment on: What’s the short timeline plan?
Go for it. I have some names in mind for potential experts. DM if you’re interested.

Marius Hobbhahn Jan 2, 2025, 10:11 PM
LW: 21 AF: 9
11
AF
in reply to: Ryan Kidd’s comment on: What’s the short timeline plan?
Something like the OpenPhil AI worldview contest: https://www.openphilanthropy.org/research/announcing-the-winners-of-the-2023-open-philanthropy-ai-worldviews-contest/
Or the ARC ELK prize: https://www.alignment.org/blog/prizes-for-elk-proposals/
In general, I wouldn’t make it too complicated and accept some arbitrariness. There is a predetermined panel of e.g. 5 experts and e.g. 3 categories (feasibility, effectiveness, everything else). All submissions first get scored by 2 experts with a shallow judgment (e.g., 5-10 minutes). Maybe there is some “saving” mechanism if an overeager expert wants to read plans that weren’t assigned to them. Everything in the top N% then gets scored by all experts with a more detailed review. Then, there is a final ranking.

I’d hope that the time spent per expert is only 5-10 hours in total. I’d be fine with missing a bunch of posts that contain good ideas that are badly communicated or otherwise easy to miss on the shallow review.

My main goal with the contest would be that writing a good plan and communicating it clearly is incentivized.

Marius Hobbhahn Jan 2, 2025, 9:59 PM
LW: 14 AF: 7
13
AF
in reply to: Orpheus16’s comment on: What’s the short timeline plan?
I would love to see a post laying this out in more detail. I found writing my post a good exercise for prioritization. Maybe writing a similar piece where governance is the main lever brings out good insights into what to prioritize in governance efforts.

What’s the short timeline plan?

Marius HobbhahnJan 2, 2025, 2:59 PM

349 points

49 comments23 min readLW link

Ablations for “Frontier Models are Capable of In-context Scheming”

AlexMeinke, Bronson Schoen, Marius Hobbhahn, Mikita Balesni, Jérémy Scheurer and rusheb

Dec 17, 2024, 11:58 PM

115 points

1 comment2 min readLW link

Marius Hobbhahn Dec 10, 2024, 11:50 AM
LW: 16 AF: 11
2
AF
in reply to: Joe Carlsmith’s comment on: Frontier Models are Capable of In-context Scheming
Brief comments (shared in private with Joe earlier):
1. We agree. We also found the sandbagging with no CoT results the most concerning in expectation.
2. They are still early results, and we didn’t have a lot of time to investigate them, so we didn’t want to make them the headline result. Due to the natural deadline of the o1 release, we couldn’t do a proper investigation.
3. The main goal of the paper was to investigate scheming inability arguments for scheming safety cases. Therefore, shifting focus to propensity-based finding would have watered down the main purpose IMO.

We will potentially further look into these findings in 2025.