RSS

lukemarks

Karma: 466

Early Ex­per­i­ments in Re­ward Model In­ter­pre­ta­tion Us­ing Sparse Autoencoders

Oct 3, 2023, 7:45 AM
17 points
0 comments5 min readLW link

The Löbian Ob­sta­cle, And Why You Should Care

lukemarksSep 7, 2023, 11:59 PM
18 points
6 comments2 min readLW link

[Question] What Does LessWrong/​EA Think of Hu­man In­tel­li­gence Aug­men­ta­tion as of mid-2023?

lukemarksJul 8, 2023, 11:42 AM
84 points
28 comments2 min readLW link

Direct Prefer­ence Op­ti­miza­tion in One Minute

lukemarksJun 26, 2023, 11:52 AM
22 points
3 comments2 min readLW link

Par­tial Si­mu­la­tion Ex­trap­o­la­tion: A Pro­posal for Build­ing Safer Simulators

lukemarksJun 17, 2023, 1:55 PM
16 points
0 comments10 min readLW link

Higher Di­men­sion Carte­sian Ob­jects and Align­ing ‘Tiling Si­mu­la­tors’

lukemarksJun 11, 2023, 12:13 AM
22 points
0 comments5 min readLW link