Rohin Shah(Rohin Shah)

Karma: 14,333

Research Scientist at DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/

Improving Dictionary Learning with Gated Sparse Autoencoders

Senthooran Rajamanoharan, Arthur Conmy, lsgos, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah and Neel Nanda

25 Apr 2024 18:43 UTC

62 points

35 comments1 min readLW link

(arxiv.org)

AtP*: An efficient and scalable method for localizing LLM behaviour to components

Neel Nanda, János Kramár, Tom Lieberum and Rohin Shah

18 Mar 2024 17:28 UTC

19 points

0 comments1 min readLW link

(arxiv.org)

Fact Finding: Do Early Layers Specialise in Local Processing? (Post 5)

Neel Nanda, Senthooran Rajamanoharan, János Kramár and Rohin Shah

23 Dec 2023 2:46 UTC

18 points

0 comments4 min readLW link

Fact Finding: How to Think About Interpreting Memorisation (Post 4)

Senthooran Rajamanoharan, Neel Nanda, János Kramár and Rohin Shah

23 Dec 2023 2:46 UTC

22 points

0 comments9 min readLW link

Fact Finding: Trying to Mechanistically Understanding Early MLPs (Post 3)

Neel Nanda, Senthooran Rajamanoharan, János Kramár and Rohin Shah

23 Dec 2023 2:46 UTC

9 points

0 comments16 min readLW link

Fact Finding: Simplifying the Circuit (Post 2)

Senthooran Rajamanoharan, Neel Nanda, János Kramár and Rohin Shah

23 Dec 2023 2:45 UTC

18 points

3 comments14 min readLW link

Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)

Neel Nanda, Senthooran Rajamanoharan, János Kramár and Rohin Shah

23 Dec 2023 2:44 UTC

106 points

6 comments22 min readLW link

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Seb Farquhar, Vikrant Varma, zac_kenton, gasteigerjo, Vlad Mikulik and Rohin Shah

18 Dec 2023 11:58 UTC

147 points

21 comments10 min readLW link

Explaining grokking through circuit efficiency

Vikrant Varma and Rohin Shah

8 Sep 2023 14:39 UTC

98 points

10 comments3 min readLW link

(arxiv.org)

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

Neel Nanda, Tom Lieberum, Matthew Rahtz, János Kramár, Geoffrey Irving, Rohin Shah and Vlad Mikulik

20 Jul 2023 10:50 UTC

44 points

3 comments2 min readLW link

(arxiv.org)

Shah (DeepMind) and Leahy (Conjecture) Discuss Alignment Cruxes

OliviaJ, Rohin Shah, Connor Leahy and Andrea_Miotti

1 May 2023 16:47 UTC

96 points

10 comments30 min readLW link

[Linkpost] Some high-level thoughts on the DeepMind alignment team’s strategy

Vika and Rohin Shah

7 Mar 2023 11:55 UTC

128 points

13 comments5 min readLW link

(drive.google.com)

Categorizing failures as “outer” or “inner” misalignment is often confused

Rohin Shah6 Jan 2023 15:48 UTC

86 points

21 comments8 min readLW link

Definitions of “objective” should be Probable and Predictive

Rohin Shah6 Jan 2023 15:40 UTC

43 points

27 comments12 min readLW link

Refining the Sharp Left Turn threat model, part 2: applying alignment techniques

Vika, Vikrant Varma, Ramana Kumar and Rohin Shah

25 Nov 2022 14:36 UTC

39 points

9 comments6 min readLW link

(vkrakovna.wordpress.com)

Threat Model Literature Review

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar and Elliot Catt

1 Nov 2022 11:03 UTC

75 points

4 comments25 min readLW link

Clarifying AI X-risk

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar and Elliot Catt

1 Nov 2022 11:03 UTC

127 points

24 comments4 min readLW link 1 review

More examples of goal misgeneralization

Rohin Shah and Vikrant Varma

7 Oct 2022 14:38 UTC

53 points

8 comments2 min readLW link

(deepmindsafetyresearch.medium.com)

[AN #173] Recent language model results from DeepMind

Rohin Shah21 Jul 2022 2:30 UTC

37 points

9 comments8 min readLW link

(mailchi.mp)

[AN #172] Sorry for the long hiatus!

Rohin Shah5 Jul 2022 6:20 UTC

54 points

0 comments3 min readLW link

(mailchi.mp)