RSS

Fabien Roger

Karma: 5,254

Mod­ify­ing LLM Beliefs with Syn­thetic Doc­u­ment Finetuning

Apr 24, 2025, 9:15 PM
69 points
11 comments2 min readLW link
(alignment.anthropic.com)

Rea­son­ing mod­els don’t always say what they think

Apr 9, 2025, 7:48 PM
28 points
4 comments1 min readLW link
(www.anthropic.com)

Align­ment Fak­ing Re­vis­ited: Im­proved Clas­sifiers and Open Source Extensions

Apr 8, 2025, 5:32 PM
146 points
20 comments12 min readLW link

Au­to­mated Re­searchers Can Subtly Sandbag

Mar 26, 2025, 7:13 PM
44 points
0 comments4 min readLW link
(alignment.anthropic.com)

Au­dit­ing lan­guage mod­els for hid­den objectives

Mar 13, 2025, 7:18 PM
141 points
15 comments13 min readLW link

Do rea­son­ing mod­els use their scratch­pad like we do? Ev­i­dence from dis­till­ing paraphrases

Fabien RogerMar 11, 2025, 11:52 AM
121 points
23 comments11 min readLW link
(alignment.anthropic.com)

Fuzzing LLMs some­times makes them re­veal their secrets

Fabien RogerFeb 26, 2025, 4:48 PM
61 points
13 comments9 min readLW link

How to repli­cate and ex­tend our al­ign­ment fak­ing demo

Fabien RogerDec 19, 2024, 9:44 PM
113 points
5 comments2 min readLW link
(alignment.anthropic.com)

Align­ment Fak­ing in Large Lan­guage Models

Dec 18, 2024, 5:19 PM
483 points
75 comments10 min readLW link

A toy eval­u­a­tion of in­fer­ence code tampering

Fabien RogerDec 9, 2024, 5:43 PM
52 points
0 comments9 min readLW link
(alignment.anthropic.com)

The case for un­learn­ing that re­moves in­for­ma­tion from LLM weights

Fabien RogerOct 14, 2024, 2:08 PM
96 points
18 comments6 min readLW link

[Question] Is cy­ber­crime re­ally cost­ing trillions per year?

Fabien RogerSep 27, 2024, 8:44 AM
63 points
28 comments1 min readLW link

An is­sue with train­ing schemers with su­per­vised fine-tuning

Fabien RogerJun 27, 2024, 3:37 PM
49 points
12 comments6 min readLW link

Best-of-n with mis­al­igned re­ward mod­els for Math reasoning

Fabien RogerJun 21, 2024, 10:53 PM
25 points
0 comments3 min readLW link

Me­moriz­ing weak ex­am­ples can elicit strong be­hav­ior out of pass­word-locked models

Jun 6, 2024, 11:54 PM
58 points
5 comments7 min readLW link

[Paper] Stress-test­ing ca­pa­bil­ity elic­i­ta­tion with pass­word-locked models

Jun 4, 2024, 2:52 PM
85 points
10 comments12 min readLW link
(arxiv.org)

Open con­sul­tancy: Let­ting un­trusted AIs choose what an­swer to ar­gue for

Fabien RogerMar 12, 2024, 8:38 PM
35 points
5 comments5 min readLW link

Fa­bien’s Shortform

Fabien RogerMar 5, 2024, 6:58 PM
6 points
114 comments1 min readLW link

Notes on con­trol eval­u­a­tions for safety cases

Feb 28, 2024, 4:15 PM
49 points
0 comments32 min readLW link

Pro­to­col eval­u­a­tions: good analo­gies vs control

Fabien RogerFeb 19, 2024, 6:00 PM
42 points
10 comments11 min readLW link