RSS

TurnTrout

Karma: 19,690

My name is Alex Turner. I’m a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com

De­cep­tive Align­ment and Homuncularity

Jan 16, 2025, 1:55 PM
25 points
12 comments22 min readLW link

Gam­ing Truth­fulQA: Sim­ple Heuris­tics Ex­posed Dataset Weaknesses

TurnTroutJan 16, 2025, 2:14 AM
64 points
3 comments1 min readLW link
(turntrout.com)

Re­view: Break­ing Free with Dr. Stone

TurnTroutDec 18, 2024, 1:26 AM
47 points
5 comments1 min readLW link
(turntrout.com)

Gra­di­ent Rout­ing: Mask­ing Gra­di­ents to Lo­cal­ize Com­pu­ta­tion in Neu­ral Networks

Dec 6, 2024, 10:19 PM
157 points
12 comments11 min readLW link
(arxiv.org)

Deep Causal Transcod­ing: A Frame­work for Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

Dec 3, 2024, 9:19 PM
95 points
7 comments41 min readLW link

In­trin­sic Power-Seek­ing: AI Might Seek Power for Power’s Sake

TurnTroutNov 19, 2024, 6:36 PM
40 points
5 comments1 min readLW link
(turntrout.com)

An­nounc­ing turn­trout.com, my new digi­tal home

TurnTroutNov 17, 2024, 5:42 PM
107 points
24 comments1 min readLW link
(turntrout.com)

I found >800 or­thog­o­nal “write code” steer­ing vectors

Jul 15, 2024, 7:06 PM
99 points
19 comments7 min readLW link
(jacobgw.com)

Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

Apr 30, 2024, 6:51 PM
206 points
43 comments45 min readLW link

Many ar­gu­ments for AI x-risk are wrong

TurnTroutMar 5, 2024, 2:31 AM
167 points
86 comments12 min readLW link

Dreams of AI al­ign­ment: The dan­ger of sug­ges­tive names

TurnTroutFeb 10, 2024, 1:22 AM
103 points
59 comments4 min readLW link

Steer­ing Llama-2 with con­trastive ac­ti­va­tion additions

Jan 2, 2024, 12:47 AM
124 points
29 comments8 min readLW link
(arxiv.org)

How should TurnTrout han­dle his Deep­Mind equity situ­a­tion?

Oct 16, 2023, 6:25 PM
63 points
35 comments6 min readLW link1 review

Paper: Un­der­stand­ing and Con­trol­ling a Maze-Solv­ing Policy Network

Oct 13, 2023, 1:38 AM
70 points
0 comments1 min readLW link
(arxiv.org)

AI pres­i­dents dis­cuss AI al­ign­ment agendas

Sep 9, 2023, 6:55 PM
217 points
23 comments1 min readLW link
(www.youtube.com)

Ac­tAdd: Steer­ing Lan­guage Models with­out Optimization

Sep 6, 2023, 5:21 PM
105 points
3 comments2 min readLW link
(arxiv.org)

Open prob­lems in ac­ti­va­tion engineering

Jul 24, 2023, 7:46 PM
51 points
2 comments1 min readLW link
(coda.io)

Ban de­vel­op­ment of un­pre­dictable pow­er­ful mod­els?

TurnTroutJun 20, 2023, 1:43 AM
46 points
25 comments4 min readLW link

Mode col­lapse in RL may be fueled by the up­date equation

Jun 19, 2023, 9:51 PM
49 points
10 comments8 min readLW link

Think care­fully be­fore call­ing RL poli­cies “agents”

TurnTroutJun 2, 2023, 3:46 AM
133 points
38 comments4 min readLW link1 review