Marius Hobbhahn

Karma: 5,048

I’m the co-founder and CEO of Apollo Research: https://www.apolloresearch.ai/
My goal is to improve our understanding of scheming and build tools and methods to detect and mitigate it.

I previously did a Ph.D. in ML at the International Max-Planck research school in Tübingen, worked part-time with Epoch and did independent AI safety research.

For more see https://www.mariushobbhahn.com/aboutme/

I subscribe to Crocker’s Rules

We should try to automate AI safety work asap

Marius HobbhahnApr 26, 2025, 4:35 PM

67 points

8 comments15 min readLW link

100+ concrete projects and open problems in evals

Marius HobbhahnMar 22, 2025, 3:21 PM

73 points

1 comment1 min readLW link

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

Nicholas Goldowsky-Dill, Mikita Balesni, Jérémy Scheurer and Marius Hobbhahn

Mar 17, 2025, 7:11 PM

177 points

7 comments6 min readLW link

We should start looking for scheming “in the wild”

Marius HobbhahnMar 6, 2025, 1:49 PM

89 points

4 comments5 min readLW link

For scheming, we should first focus on detection and then on prevention

Marius HobbhahnMar 4, 2025, 3:22 PM

47 points

7 comments5 min readLW link

Forecasting Frontier Language Model Agent Capabilities

Govind Pimpale, Axel Højmark, Jérémy Scheurer and Marius Hobbhahn

Feb 24, 2025, 4:51 PM

35 points

0 comments5 min readLW link

(www.apolloresearch.ai)

Do models know when they are being evaluated?

Govind Pimpale, Giles, Joe Needham and Marius Hobbhahn

Feb 17, 2025, 11:13 PM

57 points

3 comments12 min readLW link

Detecting Strategic Deception Using Linear Probes

Nicholas Goldowsky-Dill, bilalchughtai, StefanHex and Marius Hobbhahn

Feb 6, 2025, 3:46 PM

102 points

9 comments2 min readLW link

(arxiv.org)

Catastrophe through Chaos

Marius HobbhahnJan 31, 2025, 2:19 PM

183 points

17 comments12 min readLW link

What’s the short timeline plan?

Marius HobbhahnJan 2, 2025, 2:59 PM

351 points

49 comments23 min readLW link

Ablations for “Frontier Models are Capable of In-context Scheming”

AlexMeinke, Bronson Schoen, Marius Hobbhahn, Mikita Balesni, Jérémy Scheurer and rusheb

Dec 17, 2024, 11:58 PM

115 points

1 comment2 min readLW link

Frontier Models are Capable of In-context Scheming

Marius Hobbhahn, AlexMeinke, Bronson Schoen, rusheb, Jérémy Scheurer and Mikita Balesni

Dec 5, 2024, 10:11 PM

203 points

24 comments7 min readLW link

Training AI agents to solve hard problems could lead to Scheming

Marius Hobbhahn and AlexMeinke

Nov 19, 2024, 12:10 AM

61 points

12 comments28 min readLW link

Which evals resources would be good?

Marius HobbhahnNov 16, 2024, 2:24 PM

51 points

4 comments5 min readLW link

The Evals Gap

Marius HobbhahnNov 11, 2024, 4:42 PM

55 points

7 comments7 min readLW link

(www.apolloresearch.ai)

Toward Safety Cases For AI Scheming

Mikita Balesni and Marius Hobbhahn

Oct 31, 2024, 5:20 PM

60 points

1 comment2 min readLW link

Improving Model-Written Evals for AI Safety Benchmarking

Sunishchal Dev and Marius Hobbhahn

Oct 15, 2024, 6:25 PM

30 points

0 comments18 min readLW link

An Opinionated Evals Reading List

Marius Hobbhahn and Jérémy Scheurer

Oct 15, 2024, 2:38 PM

65 points

0 comments13 min readLW link

(www.apolloresearch.ai)

Analyzing DeepMind’s Probabilistic Methods for Evaluating Agent Capabilities

Axel Højmark, Govind Pimpale, Arjun Panickssery, Marius Hobbhahn and Jérémy Scheurer

Jul 22, 2024, 4:17 PM

69 points

0 comments16 min readLW link

[Interim research report] Evaluating the Goal-Directedness of Language Models

Rauno Arike, Elizabeth Donoway and Marius Hobbhahn

Jul 18, 2024, 6:19 PM

40 points

4 comments11 min readLW link