RSS

Amirali Abdullah

Karma: 20

Back­doors have uni­ver­sal rep­re­sen­ta­tions across large lan­guage models

Dec 6, 2024, 10:56 PM
14 points
0 comments16 min readLW link

Early Ex­per­i­ments in Re­ward Model In­ter­pre­ta­tion Us­ing Sparse Autoencoders

Oct 3, 2023, 7:45 AM
17 points
0 comments5 min readLW link