TurnTrout

Karma: 20,296

My name is Alex Turner. I’m a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com

Ban development of unpredictable powerful models?

TurnTrout20 Jun 2023 1:43 UTC

46 points

25 comments4 min readLW link

Mode collapse in RL may be fueled by the update equation

TurnTrout and MichaelEinhorn

19 Jun 2023 21:51 UTC

53 points

10 comments8 min readLW link

Think carefully before calling RL policies “agents”

TurnTrout2 Jun 2023 3:46 UTC

135 points

38 comments4 min readLW link 1 review

Steering GPT-2-XL by adding an activation vector

TurnTrout, Monte M, David Udell, lisathiergart and Ulisse Mini

13 May 2023 18:42 UTC

437 points

98 comments50 min readLW link 1 review

Residual stream norms grow exponentially over the forward pass

StefanHex and TurnTrout

7 May 2023 0:46 UTC

77 points

24 comments11 min readLW link

Behavioural statistics for a maze-solving agent

peligrietzer and TurnTrout

20 Apr 2023 22:26 UTC

46 points

11 comments10 min readLW link

[April Fools’] Definitive confirmation of shard theory

TurnTrout1 Apr 2023 7:27 UTC

170 points

8 comments2 min readLW link

Maze-solving agents: Add a top-right vector, make the agent go to the top-right

TurnTrout, peligrietzer and lisathiergart

31 Mar 2023 19:20 UTC

101 points

17 comments11 min readLW link

Understanding and controlling a maze-solving policy network

TurnTrout, peligrietzer, Ulisse Mini, Monte M and David Udell

11 Mar 2023 18:59 UTC

334 points

28 comments23 min readLW link

Predictions for shard theory mechanistic interpretability results

TurnTrout, Ulisse Mini and peligrietzer

1 Mar 2023 5:16 UTC

105 points

10 comments5 min readLW link

Parametrically retargetable decision-makers tend to seek power

TurnTrout18 Feb 2023 18:41 UTC

172 points

10 comments2 min readLW link

(arxiv.org)

Some of my disagreements with List of Lethalities

TurnTrout24 Jan 2023 0:25 UTC

70 points

7 comments10 min readLW link

Positive values seem more robust and lasting than prohibitions

TurnTrout17 Dec 2022 21:43 UTC

52 points

13 comments2 min readLW link

Inner and outer alignment decompose one hard problem into two extremely hard problems

TurnTrout2 Dec 2022 2:43 UTC

149 points

22 comments47 min readLW link 3 reviews

Alignment allows “nonrobust” decision-influences and doesn’t require robust grading

TurnTrout29 Nov 2022 6:23 UTC

62 points

41 comments15 min readLW link

Don’t align agents to evaluations of plans

TurnTrout26 Nov 2022 21:16 UTC

48 points

49 comments18 min readLW link

Don’t design agents which exploit adversarial inputs

TurnTrout and Garrett Baker

18 Nov 2022 1:48 UTC

72 points

64 comments12 min readLW link

People care about each other even though they have imperfect motivational pointers?

TurnTrout8 Nov 2022 18:15 UTC

33 points

25 comments7 min readLW link

A shot at the diamond-alignment problem

TurnTrout6 Oct 2022 18:29 UTC

95 points

67 comments15 min readLW link

Four usages of “loss” in AI

TurnTrout2 Oct 2022 0:52 UTC

46 points

18 comments4 min readLW link