RSS

Jacob Dunefsky

Karma: 210

One-shot steer­ing vec­tors cause emer­gent mis­al­ign­ment, too

Jacob DunefskyApr 14, 2025, 6:40 AM
88 points
6 comments11 min readLW link

Do safety-rele­vant LLM steer­ing vec­tors op­ti­mized on a sin­gle ex­am­ple gen­er­al­ize?

Jacob DunefskyFeb 28, 2025, 12:01 PM
20 points
1 comment14 min readLW link
(arxiv.org)

Transcoders en­able fine-grained in­ter­pretable cir­cuit anal­y­sis for lan­guage models

Apr 30, 2024, 5:58 PM
74 points
14 comments17 min readLW link

Case Stud­ies in Re­v­erse-Eng­ineer­ing Sparse Au­toen­coder Fea­tures by Us­ing MLP Linearization

Jan 14, 2024, 2:06 AM
24 points
0 comments42 min readLW link

Au­to­mat­i­cally find­ing fea­ture vec­tors in the OV cir­cuits of Trans­form­ers with­out us­ing probing

Jacob DunefskySep 12, 2023, 5:38 PM
16 points
2 comments29 min readLW link