RSS

Amirali Abdullah

Karma: 20

Back­doors have uni­ver­sal rep­re­sen­ta­tions across large lan­guage models

6 Dec 2024 22:56 UTC
14 points
0 comments16 min readLW link

Early Ex­per­i­ments in Re­ward Model In­ter­pre­ta­tion Us­ing Sparse Autoencoders

3 Oct 2023 7:45 UTC
17 points
0 comments5 min readLW link