RSS

Jérémy Scheurer

Karma: 1,182

Claude Son­net 3.7 (of­ten) knows when it’s in al­ign­ment evaluations

Mar 17, 2025, 7:11 PM
159 points
7 comments6 min readLW link

Fore­cast­ing Fron­tier Lan­guage Model Agent Capabilities

Feb 24, 2025, 4:51 PM
35 points
0 comments5 min readLW link
(www.apolloresearch.ai)

Abla­tions for “Fron­tier Models are Ca­pable of In-con­text Schem­ing”

Dec 17, 2024, 11:58 PM
113 points
1 comment2 min readLW link

Fron­tier Models are Ca­pable of In-con­text Scheming

Dec 5, 2024, 10:11 PM
203 points
24 comments7 min readLW link

An Opinionated Evals Read­ing List

Oct 15, 2024, 2:38 PM
65 points
0 comments13 min readLW link
(www.apolloresearch.ai)

An­a­lyz­ing Deep­Mind’s Prob­a­bil­is­tic Meth­ods for Eval­u­at­ing Agent Capabilities

Jul 22, 2024, 4:17 PM
69 points
0 comments16 min readLW link

Me, My­self, and AI: the Si­tu­a­tional Aware­ness Dataset (SAD) for LLMs

Jul 8, 2024, 10:24 PM
109 points
37 comments5 min readLW link

Apollo Re­search 1-year update

May 29, 2024, 5:44 PM
93 points
0 comments7 min readLW link

We need a Science of Evals

Jan 22, 2024, 8:30 PM
71 points
13 comments9 min readLW link

A starter guide for evals

Jan 8, 2024, 6:24 PM
51 points
2 comments12 min readLW link
(www.apolloresearch.ai)

Un­der­stand­ing strate­gic de­cep­tion and de­cep­tive alignment

Sep 25, 2023, 4:27 PM
64 points
16 comments7 min readLW link
(www.apolloresearch.ai)

An­nounc­ing Apollo Research

May 30, 2023, 4:17 PM
217 points
11 comments8 min readLW link

Imi­ta­tion Learn­ing from Lan­guage Feedback

Mar 30, 2023, 2:11 PM
71 points
3 comments10 min readLW link

Prac­ti­cal Pit­falls of Causal Scrubbing

Mar 27, 2023, 7:47 AM
87 points
17 comments13 min readLW link