RSS

Fabien Roger

Karma: 5,263

Toy mod­els of AI con­trol for con­cen­trated catas­tro­phe prevention

Feb 6, 2024, 1:38 AM
51 points
2 comments7 min readLW link

A quick in­ves­ti­ga­tion of AI pro-AI bias

Fabien RogerJan 19, 2024, 11:26 PM
55 points
1 comment2 min readLW link

Mea­sure­ment tam­per­ing de­tec­tion as a spe­cial case of weak-to-strong generalization

Dec 23, 2023, 12:05 AM
57 points
10 comments4 min readLW link

Scal­able Over­sight and Weak-to-Strong Gen­er­al­iza­tion: Com­pat­i­ble ap­proaches to the same problem

Dec 16, 2023, 5:49 AM
76 points
4 comments6 min readLW link1 review

AI Con­trol: Im­prov­ing Safety De­spite In­ten­tional Subversion

Dec 13, 2023, 3:51 PM
236 points
24 comments10 min readLW link4 reviews

Au­dit­ing failures vs con­cen­trated failures

Dec 11, 2023, 2:47 AM
47 points
1 comment7 min readLW link1 review

Some nega­tive steganog­ra­phy results

Fabien RogerDec 9, 2023, 8:22 PM
60 points
5 comments2 min readLW link

Coup probes: Catch­ing catas­tro­phes with probes trained off-policy

Fabien RogerNov 17, 2023, 5:58 PM
93 points
9 comments11 min readLW link1 review

Prevent­ing Lan­guage Models from hid­ing their reasoning

Oct 31, 2023, 2:34 PM
119 points
15 comments12 min readLW link1 review

Pro­gram­matic back­doors: DNNs can use SGD to run ar­bi­trary state­ful computation

Oct 23, 2023, 4:37 PM
107 points
3 comments8 min readLW link

Will early trans­for­ma­tive AIs pri­mar­ily use text? [Man­i­fold ques­tion]

Fabien RogerOct 2, 2023, 3:05 PM
17 points
0 comments3 min readLW link

If in­fluence func­tions are not ap­prox­i­mat­ing leave-one-out, how are they sup­posed to help?

Fabien RogerSep 22, 2023, 2:23 PM
66 points
5 comments3 min readLW link

Bench­marks for De­tect­ing Mea­sure­ment Tam­per­ing [Red­wood Re­search]

Sep 5, 2023, 4:44 PM
87 points
22 comments20 min readLW link1 review
(arxiv.org)

When AI cri­tique works even with mis­al­igned models

Fabien RogerAug 17, 2023, 12:12 AM
23 points
0 comments2 min readLW link

Pass­word-locked mod­els: a stress case for ca­pa­bil­ities evaluation

Fabien RogerAug 3, 2023, 2:53 PM
156 points
14 comments6 min readLW link

Sim­plified bio-an­chors for up­per bounds on AI timelines

Fabien RogerJul 15, 2023, 6:15 PM
21 points
4 comments5 min readLW link

LLMs Some­times Gen­er­ate Purely Nega­tively-Re­in­forced Text

Fabien RogerJun 16, 2023, 4:31 PM
177 points
11 comments7 min readLW link

How Do In­duc­tion Heads Ac­tu­ally Work in Trans­form­ers With Finite Ca­pac­ity?

Fabien RogerMar 23, 2023, 9:09 AM
27 points
0 comments5 min readLW link

What Dis­cov­er­ing La­tent Knowl­edge Did and Did Not Find

Fabien RogerMar 13, 2023, 7:29 PM
166 points
17 comments11 min readLW link

Some ML-Re­lated Math I Now Un­der­stand Better

Fabien RogerMar 9, 2023, 4:35 PM
50 points
6 comments4 min readLW link