Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Rohin Shah
(Rohin Shah)
Karma:
14,333
Research Scientist at DeepMind. Creator of the Alignment Newsletter.
http://rohinshah.com/
All
Posts
Comments
New
Top
Old
Page
1
Improving Dictionary Learning with Gated Sparse Autoencoders
Senthooran Rajamanoharan
,
Arthur Conmy
,
lsgos
,
Tom Lieberum
,
Vikrant Varma
,
János Kramár
,
Rohin Shah
and
Neel Nanda
25 Apr 2024 18:43 UTC
62
points
35
comments
1
min read
LW
link
(arxiv.org)
AtP*: An efficient and scalable method for localizing LLM behaviour to components
Neel Nanda
,
János Kramár
,
Tom Lieberum
and
Rohin Shah
18 Mar 2024 17:28 UTC
19
points
0
comments
1
min read
LW
link
(arxiv.org)
Fact Finding: Do Early Layers Specialise in Local Processing? (Post 5)
Neel Nanda
,
Senthooran Rajamanoharan
,
János Kramár
and
Rohin Shah
23 Dec 2023 2:46 UTC
18
points
0
comments
4
min read
LW
link
Fact Finding: How to Think About Interpreting Memorisation (Post 4)
Senthooran Rajamanoharan
,
Neel Nanda
,
János Kramár
and
Rohin Shah
23 Dec 2023 2:46 UTC
22
points
0
comments
9
min read
LW
link
Fact Finding: Trying to Mechanistically Understanding Early MLPs (Post 3)
Neel Nanda
,
Senthooran Rajamanoharan
,
János Kramár
and
Rohin Shah
23 Dec 2023 2:46 UTC
9
points
0
comments
16
min read
LW
link
Fact Finding: Simplifying the Circuit (Post 2)
Senthooran Rajamanoharan
,
Neel Nanda
,
János Kramár
and
Rohin Shah
23 Dec 2023 2:45 UTC
18
points
3
comments
14
min read
LW
link
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)
Neel Nanda
,
Senthooran Rajamanoharan
,
János Kramár
and
Rohin Shah
23 Dec 2023 2:44 UTC
106
points
6
comments
22
min read
LW
link
Discussion: Challenges with Unsupervised LLM Knowledge Discovery
Seb Farquhar
,
Vikrant Varma
,
zac_kenton
,
gasteigerjo
,
Vlad Mikulik
and
Rohin Shah
18 Dec 2023 11:58 UTC
147
points
21
comments
10
min read
LW
link
Explaining grokking through circuit efficiency
Vikrant Varma
and
Rohin Shah
8 Sep 2023 14:39 UTC
98
points
10
comments
3
min read
LW
link
(arxiv.org)
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
Neel Nanda
,
Tom Lieberum
,
Matthew Rahtz
,
János Kramár
,
Geoffrey Irving
,
Rohin Shah
and
Vlad Mikulik
20 Jul 2023 10:50 UTC
44
points
3
comments
2
min read
LW
link
(arxiv.org)
Shah (DeepMind) and Leahy (Conjecture) Discuss Alignment Cruxes
OliviaJ
,
Rohin Shah
,
Connor Leahy
and
Andrea_Miotti
1 May 2023 16:47 UTC
96
points
10
comments
30
min read
LW
link
[Linkpost] Some high-level thoughts on the DeepMind alignment team’s strategy
Vika
and
Rohin Shah
7 Mar 2023 11:55 UTC
128
points
13
comments
5
min read
LW
link
(drive.google.com)
Categorizing failures as “outer” or “inner” misalignment is often confused
Rohin Shah
6 Jan 2023 15:48 UTC
86
points
21
comments
8
min read
LW
link
Definitions of “objective” should be Probable and Predictive
Rohin Shah
6 Jan 2023 15:40 UTC
43
points
27
comments
12
min read
LW
link
Refining the Sharp Left Turn threat model, part 2: applying alignment techniques
Vika
,
Vikrant Varma
,
Ramana Kumar
and
Rohin Shah
25 Nov 2022 14:36 UTC
39
points
9
comments
6
min read
LW
link
(vkrakovna.wordpress.com)
Threat Model Literature Review
zac_kenton
,
Rohin Shah
,
David Lindner
,
Vikrant Varma
,
Vika
,
Mary Phuong
,
Ramana Kumar
and
Elliot Catt
1 Nov 2022 11:03 UTC
75
points
4
comments
25
min read
LW
link
Clarifying AI X-risk
zac_kenton
,
Rohin Shah
,
David Lindner
,
Vikrant Varma
,
Vika
,
Mary Phuong
,
Ramana Kumar
and
Elliot Catt
1 Nov 2022 11:03 UTC
127
points
24
comments
4
min read
LW
link
1
review
More examples of goal misgeneralization
Rohin Shah
and
Vikrant Varma
7 Oct 2022 14:38 UTC
53
points
8
comments
2
min read
LW
link
(deepmindsafetyresearch.medium.com)
[AN #173] Recent language model results from DeepMind
Rohin Shah
21 Jul 2022 2:30 UTC
37
points
9
comments
8
min read
LW
link
(mailchi.mp)
[AN #172] Sorry for the long hiatus!
Rohin Shah
5 Jul 2022 6:20 UTC
54
points
0
comments
3
min read
LW
link
(mailchi.mp)
Back to top
Next