RSS

wesg

Karma: 490

OR PhD student at MIT working on interpretability.

Find out more here: https://​​wesg.me/​​

Re­fusal in LLMs is me­di­ated by a sin­gle direction

Apr 27, 2024, 11:13 AM
245 points
95 comments10 min readLW link

SAE re­con­struc­tion er­rors are (em­piri­cally) pathological

wesgMar 29, 2024, 4:37 PM
106 points
16 comments8 min readLW link

Find­ing Neu­rons in a Haystack: Case Stud­ies with Sparse Probing

May 3, 2023, 1:30 PM
33 points
6 comments2 min readLW link1 review
(arxiv.org)