RSS

wesg

Karma: 444

OR PhD student at MIT working on interpretability.

Find out more here: https://​​wesg.me/​​

Re­fusal in LLMs is me­di­ated by a sin­gle direction

27 Apr 2024 11:13 UTC
228 points
93 comments10 min readLW link

SAE re­con­struc­tion er­rors are (em­piri­cally) pathological

wesg29 Mar 2024 16:37 UTC
105 points
16 comments8 min readLW link

Find­ing Neu­rons in a Haystack: Case Stud­ies with Sparse Probing

3 May 2023 13:30 UTC
33 points
5 comments2 min readLW link
(arxiv.org)