Adrià Garriga-alonso

Karma: 1,182

Among Us: A Sandbox for Agentic Deception

7vik and Adrià Garriga-alonso

Apr 5, 2025, 6:24 AM

103 points

4 comments7 min readLW link

A Bunch of Matryoshka SAEs

chanind, TomasD and Adrià Garriga-alonso

Apr 4, 2025, 2:53 PM

21 points

0 comments8 min readLW link

Feature Hedging: Another way correlated features break SAEs

chanind, TomasD and Adrià Garriga-alonso

Mar 25, 2025, 2:33 PM

19 points

0 comments18 min readLW link

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

ChengCheng, Brendan Murphy, Adrià Garriga-alonso, Yashvardhan Sharma, dsbowen, smallsilo, Yawen Duan, ChrisCundy, Hannah Betts, AdamGleave and Kellin Pelrine

Feb 7, 2025, 3:57 AM

29 points

0 comments10 min readLW link

Crafting Polysemantic Transformer Benchmarks with Known Circuits

Evan Anders and Adrià Garriga-alonso

Aug 23, 2024, 10:03 PM

10 points

0 comments25 min readLW link

Adrià Garriga-alonso Aug 1, 2024, 6:12 PM
LW: 2 AF: 1
0
AF
in reply to: Nathan Helm-Burger’s comment on: Pacing Outside the Box: RNNs Learn to Plan in Sokoban
I’m curious what you mean, but I don’t entirely understand. If you give me a text representation of the level I’ll run it! :) Or you can do so yourself
Here’s the text representation for level 53
```
##########
##########
##########
#######  #
######## #
#   ###.@#
#   $ $$ #
#. #.$   #
#     . ##
##########
```

Adrià Garriga-alonso Jul 26, 2024, 9:07 PM
LW: 1 AF: 1
0
AF
in reply to: Chris_Leong’s comment on: Pacing Outside the Box: RNNs Learn to Plan in Sokoban
Maybe in this case it’s a “confusion” shard? While it seems to be planning and produce optimizing behavior, it’s not clear that it will behave as a utility maximizer.

Adrià Garriga-alonso Jul 26, 2024, 9:06 PM
LW: 2 AF: 1
0
AF
in reply to: Lee Sharkey’s comment on: Pacing Outside the Box: RNNs Learn to Plan in Sokoban
Thank you!! I agree it’s a really good mesa-optimizer candidate, it remains to see now exactly how good. It’s a shame that I only found out about it about a year ago :)

Pacing Outside the Box: RNNs Learn to Plan in Sokoban

Adrià Garriga-alonso, taufeeque, AdamGleave and ChengCheng

Jul 25, 2024, 10:00 PM

59 points

8 comments2 min readLW link

(arxiv.org)

Adrià Garriga-alonso Jul 9, 2024, 2:57 PM
3 points
0
on: AI Alignment Research Engineer Accelerator (ARENA): Call for applicants v4.0
Asking for an acquaintance. If I know some graduate-level machine learning, and have read ~most of the recent mechanistic interpretability literature, and have made good progress understanding a small-ish neural network in the last few months.

Is ARENA for me, or will it teach things I mostly already know?

(I advised this person that they already have ARENA-graduate level, but I want to check in case I’m wrong.)

Compact Proofs of Model Performance via Mechanistic Interpretability

LawrenceC, rajashree, Adrià Garriga-alonso and Jason Gross

Jun 24, 2024, 7:27 PM

96 points

4 comments8 min readLW link

(arxiv.org)

Adrià Garriga-alonso May 17, 2024, 10:44 PM
4 points
0
on: Language Models Model Us
How did you feed the data into the model and get predictions? Was there a prompt and then you got the model’s answer? Then you got the logits from the API? What was the prompt?

Catastrophic Goodhart in RL with KL penalty

Thomas Kwa and Adrià Garriga-alonso

May 15, 2024, 12:58 AM

62 points

10 comments7 min readLW link

Adrià Garriga-alonso May 9, 2024, 1:49 AM
4 points
0
on: Why I’m doing PauseAI
Thank you for working on this Joseph!

Adrià Garriga-alonso Apr 19, 2024, 2:19 AM
1 point
0
in reply to: Chakshu Mira’s comment on: Ophiology (or, how the Mamba architecture works)
Thank you! Could you please provide more context? I don’t know what ‘E’ you’re referring to.

An evaluation of circuit evaluation metrics

Iván Arcuschin, Niels uit de Bos and Adrià Garriga-alonso

Apr 15, 2024, 7:38 PM

18 points

0 comments4 min readLW link

Ophiology (or, how the Mamba architecture works)

Danielle Ensign, SrGonao and Adrià Garriga-alonso

Apr 9, 2024, 7:31 PM

67 points

8 comments10 min readLW link

Adrià Garriga-alonso Feb 28, 2024, 6:18 PM
24 points
20
on: Timaeus’s First Four Months
That’s a lot of things done, congratulations!

Adrià Garriga-alonso Feb 6, 2024, 10:55 PM
1 point
0
in reply to: meedstrom’s comment on: Does literacy remove your ability to be a bard as good as Homer?
That’s very cool, maybe I should try to do that for important talks. Though I suppose almost always you have slide aid, so it may not be worth the time investment.

Adrià Garriga-alonso Jan 18, 2024, 10:20 PM
1 point
0
in reply to: Bezzi’s comment on: Does literacy remove your ability to be a bard as good as Homer?

Maybe being a guslar is not so different from telling a joke 2294 lines long

That’s a very good point! I think the level of ability required is different but it seems right.

The guslar’s songs are (and were of course already in the 1930-1950s) also printed, so the analogy may be closer than you thought.

Keyboard shortcuts

Keys shown in yellow (e.g., ]) are accesskeys, and require a browser-specific modifier key (or keys).

Keys shown in grey (e.g., ?) do not require any modifier keys.

General
? Show keyboard shortcuts
Esc Hide keyboard shortcuts

Site navigation
h Go to Home (a.k.a. “Frontpage”) view
f Go to Featured (a.k.a. “Curated”) view
a Go to All (a.k.a. “Community”) view
m Go to Meta view
v Go to Tags view
c Go to Recent Comments view
r Go to Archive view
q Go to Sequences view
t Go to About page
u Go to User or Login page
o Go to Inbox page

Page navigation
, Jump up to top of page
. Jump down to bottom of page
/ Jump to top of comments section
s Search

Page actions
n New post or comment
e Edit current post

Post/comment list views
. Focus next entry in list
, Focus previous entry in list
; Cycle between links in focused entry
Enter Go to currently focused entry
Esc Unfocus currently focused entry
] Go to next page
[ Go to previous page
\ Go to first page
e Edit currently focused post

Editor
k Bold text
i Italic text
l Insert hyperlink
q Blockquote text

Appearance
= Increase text size
- Decrease text size
0 Reset to default text size
′ Cycle through content width settings
1 Switch to default theme [A]
2 Switch to dark theme [B]
3 Switch to grey theme [C]
4 Switch to ultramodern theme [D]
5 Switch to simple theme [E]
6 Switch to brutalist theme [F]
7 Switch to ReadTheSequences theme [G]
8 Switch to classic Less Wrong theme [H]
9 Switch to modern Less Wrong theme [I]
; Open theme tweaker
Enter Save changes and close theme tweaker
Esc Close theme tweaker (without saving)

Slide shows
l Start/resume slideshow
Esc Exit slideshow
→↓ Next slide
←↑ Previous slide
Space Reset slide zoom

Miscellaneous
x Switch to next view on user page
z Switch to previous view on user page
` Toggle compact comment list view
g Toggle anti-kibitzer