RSS

Marius Hobbhahn

Karma: 5,048

I’m the co-founder and CEO of Apollo Research: https://​​www.apolloresearch.ai/​​
My goal is to improve our understanding of scheming and build tools and methods to detect and mitigate it.

I previously did a Ph.D. in ML at the International Max-Planck research school in Tübingen, worked part-time with Epoch and did independent AI safety research.

For more see https://​​www.mariushobbhahn.com/​​aboutme/​​

I subscribe to Crocker’s Rules

We should try to au­to­mate AI safety work asap

Marius HobbhahnApr 26, 2025, 4:35 PM
67 points
8 comments15 min readLW link

100+ con­crete pro­jects and open prob­lems in evals

Marius HobbhahnMar 22, 2025, 3:21 PM
73 points
1 comment1 min readLW link

Claude Son­net 3.7 (of­ten) knows when it’s in al­ign­ment evaluations

Mar 17, 2025, 7:11 PM
177 points
7 comments6 min readLW link

We should start look­ing for schem­ing “in the wild”

Marius HobbhahnMar 6, 2025, 1:49 PM
89 points
4 comments5 min readLW link

For schem­ing, we should first fo­cus on de­tec­tion and then on prevention

Marius HobbhahnMar 4, 2025, 3:22 PM
47 points
7 comments5 min readLW link

Fore­cast­ing Fron­tier Lan­guage Model Agent Capabilities

Feb 24, 2025, 4:51 PM
35 points
0 comments5 min readLW link
(www.apolloresearch.ai)

Do mod­els know when they are be­ing eval­u­ated?

Feb 17, 2025, 11:13 PM
57 points
3 comments12 min readLW link

De­tect­ing Strate­gic De­cep­tion Us­ing Lin­ear Probes

Feb 6, 2025, 3:46 PM
102 points
9 comments2 min readLW link
(arxiv.org)

Catas­tro­phe through Chaos

Marius HobbhahnJan 31, 2025, 2:19 PM
183 points
17 comments12 min readLW link

What’s the short timeline plan?

Marius HobbhahnJan 2, 2025, 2:59 PM
351 points
49 comments23 min readLW link

Abla­tions for “Fron­tier Models are Ca­pable of In-con­text Schem­ing”

Dec 17, 2024, 11:58 PM
115 points
1 comment2 min readLW link

Fron­tier Models are Ca­pable of In-con­text Scheming

Dec 5, 2024, 10:11 PM
203 points
24 comments7 min readLW link

Train­ing AI agents to solve hard prob­lems could lead to Scheming

Nov 19, 2024, 12:10 AM
61 points
12 comments28 min readLW link

Which evals re­sources would be good?

Marius HobbhahnNov 16, 2024, 2:24 PM
51 points
4 comments5 min readLW link

The Evals Gap

Marius HobbhahnNov 11, 2024, 4:42 PM
55 points
7 comments7 min readLW link
(www.apolloresearch.ai)

Toward Safety Cases For AI Scheming

Oct 31, 2024, 5:20 PM
60 points
1 comment2 min readLW link

Im­prov­ing Model-Writ­ten Evals for AI Safety Benchmarking

Oct 15, 2024, 6:25 PM
30 points
0 comments18 min readLW link

An Opinionated Evals Read­ing List

Oct 15, 2024, 2:38 PM
65 points
0 comments13 min readLW link
(www.apolloresearch.ai)

An­a­lyz­ing Deep­Mind’s Prob­a­bil­is­tic Meth­ods for Eval­u­at­ing Agent Capabilities

Jul 22, 2024, 4:17 PM
69 points
0 comments16 min readLW link

[In­terim re­search re­port] Eval­u­at­ing the Goal-Direct­ed­ness of Lan­guage Models

Jul 18, 2024, 6:19 PM
40 points
4 comments11 min readLW link