RSS

Annah

Karma: 125

An in­for­ma­tion-the­o­retic study of ly­ing in LLMs

Aug 2, 2024, 10:06 AM
17 points
0 comments4 min readLW link

Im­ple­ment­ing ac­ti­va­tion steering

AnnahFeb 5, 2024, 5:51 PM
73 points
8 comments7 min readLW link

Clas­sify­ing rep­re­sen­ta­tions of sparse au­toen­coders (SAEs)

AnnahNov 17, 2023, 1:54 PM
15 points
6 comments2 min readLW link

Eval­u­at­ing hid­den di­rec­tions on the util­ity dataset: clas­sifi­ca­tion, steer­ing and removal

Sep 25, 2023, 5:19 PM
25 points
3 comments7 min readLW link