RSS

TurnTrout

Karma: 20,296

My name is Alex Turner. I’m a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com

Ban de­vel­op­ment of un­pre­dictable pow­er­ful mod­els?

TurnTrout20 Jun 2023 1:43 UTC
46 points
25 comments4 min readLW link

Mode col­lapse in RL may be fueled by the up­date equation

19 Jun 2023 21:51 UTC
53 points
10 comments8 min readLW link

Think care­fully be­fore call­ing RL poli­cies “agents”

TurnTrout2 Jun 2023 3:46 UTC
135 points
38 comments4 min readLW link1 review

Steer­ing GPT-2-XL by adding an ac­ti­va­tion vector

13 May 2023 18:42 UTC
437 points
98 comments50 min readLW link1 review

Resi­d­ual stream norms grow ex­po­nen­tially over the for­ward pass

7 May 2023 0:46 UTC
77 points
24 comments11 min readLW link

Be­havi­oural statis­tics for a maze-solv­ing agent

20 Apr 2023 22:26 UTC
46 points
11 comments10 min readLW link

[April Fools’] Defini­tive con­fir­ma­tion of shard theory

TurnTrout1 Apr 2023 7:27 UTC
170 points
8 comments2 min readLW link

Maze-solv­ing agents: Add a top-right vec­tor, make the agent go to the top-right

31 Mar 2023 19:20 UTC
101 points
17 comments11 min readLW link

Un­der­stand­ing and con­trol­ling a maze-solv­ing policy network

11 Mar 2023 18:59 UTC
334 points
28 comments23 min readLW link

Pre­dic­tions for shard the­ory mechanis­tic in­ter­pretabil­ity results

1 Mar 2023 5:16 UTC
105 points
10 comments5 min readLW link

Para­met­ri­cally re­tar­getable de­ci­sion-mak­ers tend to seek power

TurnTrout18 Feb 2023 18:41 UTC
172 points
10 comments2 min readLW link
(arxiv.org)

Some of my dis­agree­ments with List of Lethalities

TurnTrout24 Jan 2023 0:25 UTC
70 points
7 comments10 min readLW link

Pos­i­tive val­ues seem more ro­bust and last­ing than prohibitions

TurnTrout17 Dec 2022 21:43 UTC
52 points
13 comments2 min readLW link

In­ner and outer al­ign­ment de­com­pose one hard prob­lem into two ex­tremely hard problems

TurnTrout2 Dec 2022 2:43 UTC
149 points
22 comments47 min readLW link3 reviews

Align­ment al­lows “non­ro­bust” de­ci­sion-in­fluences and doesn’t re­quire ro­bust grading

TurnTrout29 Nov 2022 6:23 UTC
62 points
41 comments15 min readLW link

Don’t al­ign agents to eval­u­a­tions of plans

TurnTrout26 Nov 2022 21:16 UTC
48 points
49 comments18 min readLW link

Don’t de­sign agents which ex­ploit ad­ver­sar­ial inputs

18 Nov 2022 1:48 UTC
72 points
64 comments12 min readLW link

Peo­ple care about each other even though they have im­perfect mo­ti­va­tional poin­t­ers?

TurnTrout8 Nov 2022 18:15 UTC
33 points
25 comments7 min readLW link

A shot at the di­a­mond-al­ign­ment problem

TurnTrout6 Oct 2022 18:29 UTC
95 points
67 comments15 min readLW link

Four us­ages of “loss” in AI

TurnTrout2 Oct 2022 0:52 UTC
46 points
18 comments4 min readLW link