Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Jacob Dunefsky
Karma:
210
All
Posts
Comments
New
Top
Old
One-shot steering vectors cause emergent misalignment, too
Jacob Dunefsky
Apr 14, 2025, 6:40 AM
88
points
6
comments
11
min read
LW
link
Do safety-relevant LLM steering vectors optimized on a single example generalize?
Jacob Dunefsky
Feb 28, 2025, 12:01 PM
20
points
1
comment
14
min read
LW
link
(arxiv.org)
Transcoders enable fine-grained interpretable circuit analysis for language models
Jacob Dunefsky
,
Philippe Chlenski
and
Neel Nanda
Apr 30, 2024, 5:58 PM
74
points
14
comments
17
min read
LW
link
Case Studies in Reverse-Engineering Sparse Autoencoder Features by Using MLP Linearization
Jacob Dunefsky
,
Philippe Chlenski
,
Senthooran Rajamanoharan
and
Neel Nanda
Jan 14, 2024, 2:06 AM
24
points
0
comments
42
min read
LW
link
Automatically finding feature vectors in the OV circuits of Transformers without using probing
Jacob Dunefsky
Sep 12, 2023, 5:38 PM
16
points
2
comments
29
min read
LW
link
Back to top
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel