Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Adrià Garriga-alonso
Karma:
1,230
All
Posts
Comments
New
Top
Old
Page
1
Sparsity is the enemy of feature extraction (ft. absorption)
7vik
,
chanind
and
Adrià Garriga-alonso
May 3, 2025, 10:13 AM
32
points
0
comments
6
min read
LW
link
Among Us: A Sandbox for Agentic Deception
7vik
and
Adrià Garriga-alonso
Apr 5, 2025, 6:24 AM
110
points
7
comments
7
min read
LW
link
A Bunch of Matryoshka SAEs
chanind
,
TomasD
and
Adrià Garriga-alonso
Apr 4, 2025, 2:53 PM
25
points
0
comments
8
min read
LW
link
Feature Hedging: Another way correlated features break SAEs
chanind
,
TomasD
and
Adrià Garriga-alonso
Mar 25, 2025, 2:33 PM
22
points
0
comments
18
min read
LW
link
Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google
ChengCheng
,
Brendan Murphy
,
Adrià Garriga-alonso
,
Yashvardhan Sharma
,
dsbowen
,
smallsilo
,
Yawen Duan
,
ChrisCundy
,
Hannah Betts
,
AdamGleave
and
Kellin Pelrine
Feb 7, 2025, 3:57 AM
29
points
0
comments
10
min read
LW
link
Crafting Polysemantic Transformer Benchmarks with Known Circuits
Evan Anders
and
Adrià Garriga-alonso
Aug 23, 2024, 10:03 PM
17
points
0
comments
25
min read
LW
link
Pacing Outside the Box: RNNs Learn to Plan in Sokoban
Adrià Garriga-alonso
,
taufeeque
,
AdamGleave
and
ChengCheng
Jul 25, 2024, 10:00 PM
59
points
8
comments
2
min read
LW
link
(arxiv.org)
Compact Proofs of Model Performance via Mechanistic Interpretability
LawrenceC
,
rajashree
,
Adrià Garriga-alonso
and
Jason Gross
Jun 24, 2024, 7:27 PM
97
points
4
comments
8
min read
LW
link
(arxiv.org)
Catastrophic Goodhart in RL with KL penalty
Thomas Kwa
and
Adrià Garriga-alonso
May 15, 2024, 12:58 AM
62
points
10
comments
7
min read
LW
link
An evaluation of circuit evaluation metrics
Iván Arcuschin
,
Niels uit de Bos
and
Adrià Garriga-alonso
Apr 15, 2024, 7:38 PM
18
points
0
comments
4
min read
LW
link
Ophiology (or, how the Mamba architecture works)
Danielle Ensign
,
SrGonao
and
Adrià Garriga-alonso
Apr 9, 2024, 7:31 PM
67
points
8
comments
10
min read
LW
link
Does literacy remove your ability to be a bard as good as Homer?
Adrià Garriga-alonso
Jan 18, 2024, 3:43 AM
51
points
19
comments
3
min read
LW
link
Thomas Kwa’s research journal
Thomas Kwa
and
Adrià Garriga-alonso
Nov 23, 2023, 5:11 AM
79
points
1
comment
6
min read
LW
link
On Frequentism and Bayesian Dogma
DanielFilan
and
Adrià Garriga-alonso
Oct 15, 2023, 10:23 PM
59
points
27
comments
6
min read
LW
link
A comparison of causal scrubbing, causal abstractions, and related methods
Erik Jenner
,
Adrià Garriga-alonso
and
Egor Zverev
Jun 8, 2023, 11:40 PM
73
points
3
comments
22
min read
LW
link
Causal scrubbing: results on induction heads
LawrenceC
,
Adrià Garriga-alonso
,
Nicholas Goldowsky-Dill
,
ryan_greenblatt
,
Tao Lin
,
jenny
,
Ansh Radhakrishnan
,
Buck
and
Nate Thomas
Dec 3, 2022, 12:59 AM
34
points
1
comment
17
min read
LW
link
Causal scrubbing: results on a paren balance checker
LawrenceC
,
Adrià Garriga-alonso
,
Nicholas Goldowsky-Dill
,
ryan_greenblatt
,
Tao Lin
,
jenny
,
Ansh Radhakrishnan
,
Buck
and
Nate Thomas
Dec 3, 2022, 12:59 AM
34
points
2
comments
30
min read
LW
link
Causal scrubbing: Appendix
LawrenceC
,
Adrià Garriga-alonso
,
Nicholas Goldowsky-Dill
,
ryan_greenblatt
,
jenny
,
Ansh Radhakrishnan
,
Buck
and
Nate Thomas
Dec 3, 2022, 12:58 AM
18
points
4
comments
20
min read
LW
link
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
LawrenceC
,
Adrià Garriga-alonso
,
Nicholas Goldowsky-Dill
,
ryan_greenblatt
,
jenny
,
Ansh Radhakrishnan
,
Buck
and
Nate Thomas
Dec 3, 2022, 12:58 AM
206
points
35
comments
20
min read
LW
link
1
review
The No Free Lunch theorems and their Razor
Adrià Garriga-alonso
May 24, 2022, 6:40 AM
56
points
3
comments
9
min read
LW
link
Back to top
Next
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel