RSS

Annah

Karma: 116

An in­for­ma­tion-the­o­retic study of ly­ing in LLMs

2 Aug 2024 10:06 UTC
16 points
0 comments4 min readLW link

Im­ple­ment­ing ac­ti­va­tion steering

Annah5 Feb 2024 17:51 UTC
68 points
7 comments7 min readLW link

Clas­sify­ing rep­re­sen­ta­tions of sparse au­toen­coders (SAEs)

Annah17 Nov 2023 13:54 UTC
15 points
6 comments2 min readLW link

Eval­u­at­ing hid­den di­rec­tions on the util­ity dataset: clas­sifi­ca­tion, steer­ing and removal

25 Sep 2023 17:19 UTC
25 points
3 comments7 min readLW link