RSS

TurnTrout

Karma: 20,250

My name is Alex Turner. I’m a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com

Self-fulfilling mis­al­ign­ment data might be poi­son­ing our AI models

TurnTroutMar 2, 2025, 7:51 PM
154 points
27 comments1 min readLW link
(turntrout.com)

Steer­ing Gem­ini with BiDPO

TurnTroutJan 31, 2025, 2:37 AM
104 points
5 comments1 min readLW link
(turntrout.com)

In­sights from “The Manga Guide to Phys­iol­ogy”

TurnTroutJan 24, 2025, 5:18 AM
26 points
3 comments1 min readLW link
(turntrout.com)

De­cep­tive Align­ment and Homuncularity

Jan 16, 2025, 1:55 PM
25 points
12 comments22 min readLW link

Gam­ing Truth­fulQA: Sim­ple Heuris­tics Ex­posed Dataset Weaknesses

TurnTroutJan 16, 2025, 2:14 AM
64 points
3 comments1 min readLW link
(turntrout.com)

Re­view: Break­ing Free with Dr. Stone

TurnTroutDec 18, 2024, 1:26 AM
47 points
5 comments1 min readLW link
(turntrout.com)

Gra­di­ent Rout­ing: Mask­ing Gra­di­ents to Lo­cal­ize Com­pu­ta­tion in Neu­ral Networks

Dec 6, 2024, 10:19 PM
165 points
12 comments11 min readLW link
(arxiv.org)

Deep Causal Transcod­ing: A Frame­work for Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

Dec 3, 2024, 9:19 PM
100 points
7 comments41 min readLW link

In­trin­sic Power-Seek­ing: AI Might Seek Power for Power’s Sake

TurnTroutNov 19, 2024, 6:36 PM
40 points
5 comments1 min readLW link
(turntrout.com)

An­nounc­ing turn­trout.com, my new digi­tal home

TurnTroutNov 17, 2024, 5:42 PM
107 points
33 comments1 min readLW link
(turntrout.com)

I found >800 or­thog­o­nal “write code” steer­ing vectors

Jul 15, 2024, 7:06 PM
102 points
19 comments7 min readLW link
(jacobgw.com)

Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

Apr 30, 2024, 6:51 PM
208 points
43 comments45 min readLW link

Many ar­gu­ments for AI x-risk are wrong

TurnTroutMar 5, 2024, 2:31 AM
159 points
87 comments12 min readLW link

Dreams of AI al­ign­ment: The dan­ger of sug­ges­tive names

TurnTroutFeb 10, 2024, 1:22 AM
103 points
59 comments4 min readLW link

Steer­ing Llama-2 with con­trastive ac­ti­va­tion additions

Jan 2, 2024, 12:47 AM
125 points
29 comments8 min readLW link
(arxiv.org)

How should TurnTrout han­dle his Deep­Mind equity situ­a­tion?

Oct 16, 2023, 6:25 PM
63 points
36 comments6 min readLW link1 review

Paper: Un­der­stand­ing and Con­trol­ling a Maze-Solv­ing Policy Network

Oct 13, 2023, 1:38 AM
70 points
0 comments1 min readLW link
(arxiv.org)

AI pres­i­dents dis­cuss AI al­ign­ment agendas

Sep 9, 2023, 6:55 PM
217 points
23 comments1 min readLW link
(www.youtube.com)

Ac­tAdd: Steer­ing Lan­guage Models with­out Optimization

Sep 6, 2023, 5:21 PM
105 points
3 comments2 min readLW link
(arxiv.org)

Open prob­lems in ac­ti­va­tion engineering

Jul 24, 2023, 7:46 PM
51 points
2 comments1 min readLW link
(coda.io)