RSS

Rohin Shah

Karma: 15,408

Research Scientist at Google DeepMind. Creator of the Alignment Newsletter. http://​​rohinshah.com/​​

AGI Safety & Align­ment @ Google Deep­Mind is hiring

Rohin ShahFeb 17, 2025, 9:11 PM
102 points
17 comments10 min readLW link

A short course on AGI safety from the GDM Align­ment team

Feb 14, 2025, 3:43 PM
97 points
1 comment1 min readLW link
(deepmindsafetyresearch.medium.com)

MONA: Man­aged My­opia with Ap­proval Feedback

Jan 23, 2025, 12:24 PM
76 points
29 comments9 min readLW link

AGI Safety and Align­ment at Google Deep­Mind: A Sum­mary of Re­cent Work

Aug 20, 2024, 4:22 PM
222 points
33 comments9 min readLW link

On scal­able over­sight with weak LLMs judg­ing strong LLMs

Jul 8, 2024, 8:59 AM
49 points
18 comments7 min readLW link
(arxiv.org)

Im­prov­ing Dic­tionary Learn­ing with Gated Sparse Autoencoders

Apr 25, 2024, 6:43 PM
63 points
38 comments1 min readLW link
(arxiv.org)

AtP*: An effi­cient and scal­able method for lo­cal­iz­ing LLM be­havi­our to components

Mar 18, 2024, 5:28 PM
19 points
0 comments1 min readLW link
(arxiv.org)

Fact Find­ing: Do Early Lay­ers Spe­cial­ise in Lo­cal Pro­cess­ing? (Post 5)

Dec 23, 2023, 2:46 AM
18 points
0 comments4 min readLW link

Fact Find­ing: How to Think About In­ter­pret­ing Me­mori­sa­tion (Post 4)

Dec 23, 2023, 2:46 AM
22 points
0 comments9 min readLW link

Fact Find­ing: Try­ing to Mechanis­ti­cally Un­der­stand­ing Early MLPs (Post 3)

Dec 23, 2023, 2:46 AM
10 points
0 comments16 min readLW link

Fact Find­ing: Sim­plify­ing the Cir­cuit (Post 2)

Dec 23, 2023, 2:45 AM
25 points
3 comments14 min readLW link

Fact Find­ing: At­tempt­ing to Re­v­erse-Eng­ineer Fac­tual Re­call on the Neu­ron Level (Post 1)

Dec 23, 2023, 2:44 AM
106 points
10 comments22 min readLW link2 reviews

Dis­cus­sion: Challenges with Un­su­per­vised LLM Knowl­edge Discovery

Dec 18, 2023, 11:58 AM
147 points
21 comments10 min readLW link

Ex­plain­ing grokking through cir­cuit efficiency

Sep 8, 2023, 2:39 PM
101 points
11 comments3 min readLW link
(arxiv.org)

Does Cir­cuit Anal­y­sis In­ter­pretabil­ity Scale? Ev­i­dence from Mul­ti­ple Choice Ca­pa­bil­ities in Chinchilla

Jul 20, 2023, 10:50 AM
44 points
3 comments2 min readLW link
(arxiv.org)

Shah (Deep­Mind) and Leahy (Con­jec­ture) Dis­cuss Align­ment Cruxes

May 1, 2023, 4:47 PM
96 points
10 comments30 min readLW link

[Linkpost] Some high-level thoughts on the Deep­Mind al­ign­ment team’s strategy

Mar 7, 2023, 11:55 AM
128 points
13 comments5 min readLW link
(drive.google.com)

Cat­e­go­riz­ing failures as “outer” or “in­ner” mis­al­ign­ment is of­ten confused

Rohin ShahJan 6, 2023, 3:48 PM
93 points
21 comments8 min readLW link

Defi­ni­tions of “ob­jec­tive” should be Prob­a­ble and Predictive

Rohin ShahJan 6, 2023, 3:40 PM
43 points
27 comments12 min readLW link

Refin­ing the Sharp Left Turn threat model, part 2: ap­ply­ing al­ign­ment techniques

Nov 25, 2022, 2:36 PM
39 points
9 comments6 min readLW link
(vkrakovna.wordpress.com)