Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Andy Arditi
Karma:
638
https://andyrdt.com
All
Posts
Comments
New
Top
Old
Do models say what they learn?
Andy Arditi
,
marvinli
,
Joe Benton
and
Miles Turpin
Mar 22, 2025, 3:19 PM
115
points
12
comments
13
min read
LW
link
Finding Features Causally Upstream of Refusal
Daniel Lee
,
Eric Breck
and
Andy Arditi
Jan 14, 2025, 2:30 AM
53
points
5
comments
12
min read
LW
link
AI as systems, not just models
Andy Arditi
Dec 21, 2024, 11:19 PM
28
points
0
comments
7
min read
LW
link
(andyrdt.com)
Unlearning via RMU is mostly shallow
Andy Arditi
and
bilalchughtai
Jul 23, 2024, 4:07 PM
54
points
3
comments
6
min read
LW
link
Refusal in LLMs is mediated by a single direction
Andy Arditi
,
Oscar Obeso
,
Aaquib111
,
wesg
and
Neel Nanda
Apr 27, 2024, 11:13 AM
246
points
95
comments
10
min read
LW
link
Refusal mechanisms: initial experiments with Llama-2-7b-chat
Andy Arditi
and
Oscar Obeso
Dec 8, 2023, 5:08 PM
82
points
7
comments
7
min read
LW
link
Back to top
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel