RSS

Fabien Roger

Karma: 4,440

How to repli­cate and ex­tend our al­ign­ment fak­ing demo

Fabien Roger19 Dec 2024 21:44 UTC
104 points
5 comments2 min readLW link
(alignment.anthropic.com)

Align­ment Fak­ing in Large Lan­guage Models

18 Dec 2024 17:19 UTC
479 points
71 comments10 min readLW link

A toy eval­u­a­tion of in­fer­ence code tampering

Fabien Roger9 Dec 2024 17:43 UTC
50 points
0 comments9 min readLW link
(alignment.anthropic.com)

The case for un­learn­ing that re­moves in­for­ma­tion from LLM weights

Fabien Roger14 Oct 2024 14:08 UTC
96 points
15 comments6 min readLW link

[Question] Is cy­ber­crime re­ally cost­ing trillions per year?

Fabien Roger27 Sep 2024 8:44 UTC
63 points
28 comments1 min readLW link

An is­sue with train­ing schemers with su­per­vised fine-tuning

Fabien Roger27 Jun 2024 15:37 UTC
49 points
12 comments6 min readLW link

Best-of-n with mis­al­igned re­ward mod­els for Math reasoning

Fabien Roger21 Jun 2024 22:53 UTC
25 points
0 comments3 min readLW link

Me­moriz­ing weak ex­am­ples can elicit strong be­hav­ior out of pass­word-locked models

6 Jun 2024 23:54 UTC
58 points
5 comments7 min readLW link

[Paper] Stress-test­ing ca­pa­bil­ity elic­i­ta­tion with pass­word-locked models

4 Jun 2024 14:52 UTC
85 points
10 comments12 min readLW link
(arxiv.org)

Open con­sul­tancy: Let­ting un­trusted AIs choose what an­swer to ar­gue for

Fabien Roger12 Mar 2024 20:38 UTC
35 points
5 comments5 min readLW link

Fa­bien’s Shortform

Fabien Roger5 Mar 2024 18:58 UTC
6 points
76 comments1 min readLW link

Notes on con­trol eval­u­a­tions for safety cases

28 Feb 2024 16:15 UTC
49 points
0 comments32 min readLW link

Pro­to­col eval­u­a­tions: good analo­gies vs control

Fabien Roger19 Feb 2024 18:00 UTC
42 points
10 comments11 min readLW link

Toy mod­els of AI con­trol for con­cen­trated catas­tro­phe prevention

6 Feb 2024 1:38 UTC
51 points
2 comments7 min readLW link

A quick in­ves­ti­ga­tion of AI pro-AI bias

Fabien Roger19 Jan 2024 23:26 UTC
55 points
1 comment2 min readLW link

Mea­sure­ment tam­per­ing de­tec­tion as a spe­cial case of weak-to-strong generalization

23 Dec 2023 0:05 UTC
57 points
10 comments4 min readLW link

Scal­able Over­sight and Weak-to-Strong Gen­er­al­iza­tion: Com­pat­i­ble ap­proaches to the same problem

16 Dec 2023 5:49 UTC
76 points
4 comments6 min readLW link1 review

AI Con­trol: Im­prov­ing Safety De­spite In­ten­tional Subversion

13 Dec 2023 15:51 UTC
235 points
24 comments10 min readLW link4 reviews

Au­dit­ing failures vs con­cen­trated failures

11 Dec 2023 2:47 UTC
44 points
1 comment7 min readLW link1 review

Some nega­tive steganog­ra­phy results

Fabien Roger9 Dec 2023 20:22 UTC
59 points
5 comments2 min readLW link