RSS

TurnTrout

Karma: 20,288

My name is Alex Turner. I’m a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com

Ban de­vel­op­ment of un­pre­dictable pow­er­ful mod­els?

TurnTroutJun 20, 2023, 1:43 AM
46 points
25 comments4 min readLW link

Mode col­lapse in RL may be fueled by the up­date equation

Jun 19, 2023, 9:51 PM
53 points
10 comments8 min readLW link

Think care­fully be­fore call­ing RL poli­cies “agents”

TurnTroutJun 2, 2023, 3:46 AM
134 points
38 comments4 min readLW link1 review

Steer­ing GPT-2-XL by adding an ac­ti­va­tion vector

May 13, 2023, 6:42 PM
437 points
98 comments50 min readLW link1 review

Resi­d­ual stream norms grow ex­po­nen­tially over the for­ward pass

May 7, 2023, 12:46 AM
77 points
24 comments11 min readLW link

Be­havi­oural statis­tics for a maze-solv­ing agent

Apr 20, 2023, 10:26 PM
46 points
11 comments10 min readLW link

[April Fools’] Defini­tive con­fir­ma­tion of shard theory

TurnTroutApr 1, 2023, 7:27 AM
170 points
8 comments2 min readLW link

Maze-solv­ing agents: Add a top-right vec­tor, make the agent go to the top-right

Mar 31, 2023, 7:20 PM
101 points
17 comments11 min readLW link

Un­der­stand­ing and con­trol­ling a maze-solv­ing policy network

Mar 11, 2023, 6:59 PM
333 points
28 comments23 min readLW link

Pre­dic­tions for shard the­ory mechanis­tic in­ter­pretabil­ity results

Mar 1, 2023, 5:16 AM
105 points
10 comments5 min readLW link

Para­met­ri­cally re­tar­getable de­ci­sion-mak­ers tend to seek power

TurnTroutFeb 18, 2023, 6:41 PM
172 points
10 comments2 min readLW link
(arxiv.org)

Some of my dis­agree­ments with List of Lethalities

TurnTroutJan 24, 2023, 12:25 AM
70 points
7 comments10 min readLW link

Pos­i­tive val­ues seem more ro­bust and last­ing than prohibitions

TurnTroutDec 17, 2022, 9:43 PM
52 points
13 comments2 min readLW link

In­ner and outer al­ign­ment de­com­pose one hard prob­lem into two ex­tremely hard problems

TurnTroutDec 2, 2022, 2:43 AM
149 points
22 comments47 min readLW link3 reviews

Align­ment al­lows “non­ro­bust” de­ci­sion-in­fluences and doesn’t re­quire ro­bust grading

TurnTroutNov 29, 2022, 6:23 AM
62 points
41 comments15 min readLW link

Don’t al­ign agents to eval­u­a­tions of plans

TurnTroutNov 26, 2022, 9:16 PM
48 points
49 comments18 min readLW link

Don’t de­sign agents which ex­ploit ad­ver­sar­ial inputs

Nov 18, 2022, 1:48 AM
72 points
64 comments12 min readLW link

Peo­ple care about each other even though they have im­perfect mo­ti­va­tional poin­t­ers?

TurnTroutNov 8, 2022, 6:15 PM
33 points
25 comments7 min readLW link

A shot at the di­a­mond-al­ign­ment problem

TurnTroutOct 6, 2022, 6:29 PM
95 points
67 comments15 min readLW link

Four us­ages of “loss” in AI

TurnTroutOct 2, 2022, 12:52 AM
46 points
18 comments4 min readLW link