Marius Hobbhahn

Karma: 3,842

I’m the co-founder and CEO of Apollo Research: https://www.apolloresearch.ai/
I mostly work on evals, but I am also interested in interpretability. My goal is to improve our understanding of scheming and build tools and methods to detect it.

I previously did a Ph.D. in ML at the International Max-Planck research school in Tübingen, worked part-time with Epoch and did independent AI safety research.

For more see https://www.mariushobbhahn.com/aboutme/

I subscribe to Crocker’s Rules

Ablations for “Frontier Models are Capable of In-context Scheming”

AlexMeinke, Bronson Schoen, Marius Hobbhahn, Mikita Balesni, Jérémy Scheurer and rusheb

17 Dec 2024 23:58 UTC

87 points

1 comment2 min readLW link

Frontier Models are Capable of In-context Scheming

Marius Hobbhahn, AlexMeinke, Bronson Schoen, rusheb, Jérémy Scheurer and Mikita Balesni

5 Dec 2024 22:11 UTC

201 points

24 comments7 min readLW link

Training AI agents to solve hard problems could lead to Scheming

Marius Hobbhahn and AlexMeinke

19 Nov 2024 0:10 UTC

61 points

12 comments28 min readLW link

Which evals resources would be good?

Marius Hobbhahn16 Nov 2024 14:24 UTC

47 points

4 comments5 min readLW link

The Evals Gap

Marius Hobbhahn11 Nov 2024 16:42 UTC

55 points

7 comments7 min readLW link

(www.apolloresearch.ai)

Toward Safety Cases For AI Scheming

Mikita Balesni and Marius Hobbhahn

31 Oct 2024 17:20 UTC

60 points

1 comment2 min readLW link

Improving Model-Written Evals for AI Safety Benchmarking

Sunishchal Dev and Marius Hobbhahn

15 Oct 2024 18:25 UTC

27 points

0 comments18 min readLW link

An Opinionated Evals Reading List

Marius Hobbhahn and Jérémy Scheurer

15 Oct 2024 14:38 UTC

65 points

0 comments13 min readLW link

(www.apolloresearch.ai)

Analyzing DeepMind’s Probabilistic Methods for Evaluating Agent Capabilities

Axel Højmark, Govind Pimpale, Arjun Panickssery, Marius Hobbhahn and Jérémy Scheurer

22 Jul 2024 16:17 UTC

69 points

0 comments16 min readLW link

[Interim research report] Evaluating the Goal-Directedness of Language Models

Rauno Arike, Elizabeth Donoway and Marius Hobbhahn

18 Jul 2024 18:19 UTC

39 points

4 comments11 min readLW link

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

L Rudolf L, bilalchughtai, jan betley, kaivu, Jérémy Scheurer, Mikita Balesni, AlexMeinke, Owain_Evans and Marius Hobbhahn

8 Jul 2024 22:24 UTC

106 points

28 comments5 min readLW link

Apollo Research 1-year update

Marius Hobbhahn, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer, Nicholas Goldowsky-Dill, StefanHex, jake_mendel, AlexMeinke and rusheb

29 May 2024 17:44 UTC

93 points

0 comments7 min readLW link

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Lucius Bushnaq, jake_mendel, Dan Braun, StefanHex, Nicholas Goldowsky-Dill, Kaarel, Avery, Joern Stoehler, debrevitatevitae, Magdalena Wache and Marius Hobbhahn

20 May 2024 17:53 UTC

105 points

4 comments3 min readLW link

We need a Science of Evals

Marius Hobbhahn and Jérémy Scheurer

22 Jan 2024 20:30 UTC

71 points

13 comments9 min readLW link

A starter guide for evals

Marius Hobbhahn, Jérémy Scheurer, Mikita Balesni, rusheb and AlexMeinke

8 Jan 2024 18:24 UTC

50 points

2 comments12 min readLW link

(www.apolloresearch.ai)

Experiences and learnings from both sides of the AI safety job market

Marius Hobbhahn15 Nov 2023 15:40 UTC

110 points

4 comments18 min readLW link

Theories of Change for AI Auditing

Lee Sharkey, beren and Marius Hobbhahn

13 Nov 2023 19:33 UTC

54 points

0 comments18 min readLW link

(www.apolloresearch.ai)

Understanding strategic deception and deceptive alignment

Marius Hobbhahn, Mikita Balesni, Jérémy Scheurer and Dan Braun

25 Sep 2023 16:27 UTC

64 points

16 comments7 min readLW link

(www.apolloresearch.ai)

There should be more AI safety orgs

Marius Hobbhahn21 Sep 2023 14:53 UTC

181 points

25 comments17 min readLW link

Apollo Research is hiring evals and interpretability engineers & scientists

Marius Hobbhahn4 Aug 2023 10:54 UTC

25 points

0 comments2 min readLW link