Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
wesg
Karma:
444
OR PhD student at MIT working on interpretability.
Find out more here:
https://wesg.me/
All
Posts
Comments
New
Top
Old
Refusal in LLMs is mediated by a single direction
Andy Arditi
,
Oscar Obeso
,
Aaquib111
,
wesg
and
Neel Nanda
27 Apr 2024 11:13 UTC
228
points
93
comments
10
min read
LW
link
SAE reconstruction errors are (empirically) pathological
wesg
29 Mar 2024 16:37 UTC
105
points
16
comments
8
min read
LW
link
Finding Neurons in a Haystack: Case Studies with Sparse Probing
wesg
and
Neel Nanda
3 May 2023 13:30 UTC
33
points
5
comments
2
min read
LW
link
(arxiv.org)
Back to top