TurnTrout

Karma: 20,250

My name is Alex Turner. I’m a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com

Self-fulfilling misalignment data might be poisoning our AI models

TurnTroutMar 2, 2025, 7:51 PM

154 points

27 comments1 min readLW link

(turntrout.com)

Steering Gemini with BiDPO

TurnTroutJan 31, 2025, 2:37 AM

104 points

5 comments1 min readLW link

(turntrout.com)

Insights from “The Manga Guide to Physiology”

TurnTroutJan 24, 2025, 5:18 AM

26 points

3 comments1 min readLW link

(turntrout.com)

Deceptive Alignment and Homuncularity

Oliver Sourbut and TurnTrout

Jan 16, 2025, 1:55 PM

25 points

12 comments22 min readLW link

Gaming TruthfulQA: Simple Heuristics Exposed Dataset Weaknesses

TurnTroutJan 16, 2025, 2:14 AM

64 points

3 comments1 min readLW link

(turntrout.com)

Review: Breaking Free with Dr. Stone

TurnTroutDec 18, 2024, 1:26 AM

47 points

5 comments1 min readLW link

(turntrout.com)

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

cloud, Jacob G-W, Evzen, Joseph Miller and TurnTrout

Dec 6, 2024, 10:19 PM

165 points

12 comments11 min readLW link

(arxiv.org)

Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack and TurnTrout

Dec 3, 2024, 9:19 PM

100 points

7 comments41 min readLW link

Intrinsic Power-Seeking: AI Might Seek Power for Power’s Sake

TurnTroutNov 19, 2024, 6:36 PM

40 points

5 comments1 min readLW link

(turntrout.com)

Announcing turntrout.com, my new digital home

TurnTroutNov 17, 2024, 5:42 PM

107 points

33 comments1 min readLW link

(turntrout.com)

I found >800 orthogonal “write code” steering vectors

Jacob G-W and TurnTrout

Jul 15, 2024, 7:06 PM

102 points

19 comments7 min readLW link

(jacobgw.com)

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack and TurnTrout

Apr 30, 2024, 6:51 PM

208 points

43 comments45 min readLW link

Many arguments for AI x-risk are wrong

TurnTroutMar 5, 2024, 2:31 AM

159 points

87 comments12 min readLW link

Dreams of AI alignment: The danger of suggestive names

TurnTroutFeb 10, 2024, 1:22 AM

103 points

59 comments4 min readLW link

Steering Llama-2 with contrastive activation additions

Nina Panickssery, Wuschel Schulz, NickGabs, Meg, evhub and TurnTrout

Jan 2, 2024, 12:47 AM

125 points

29 comments8 min readLW link

(arxiv.org)

How should TurnTrout handle his DeepMind equity situation?

habryka and TurnTrout

Oct 16, 2023, 6:25 PM

63 points

36 comments6 min readLW link 1 review

Paper: Understanding and Controlling a Maze-Solving Policy Network

TurnTrout, Ulisse Mini, peligrietzer, mrinank_sharma, Austin Meek, Monte M and lisathiergart

Oct 13, 2023, 1:38 AM

70 points

0 comments1 min readLW link

(arxiv.org)

AI presidents discuss AI alignment agendas

TurnTrout and Garrett Baker

Sep 9, 2023, 6:55 PM

217 points

23 comments1 min readLW link

(www.youtube.com)

ActAdd: Steering Language Models without Optimization

technicalities, TurnTrout, lisathiergart, David Udell, Ulisse Mini and Monte M

Sep 6, 2023, 5:21 PM

105 points

3 comments2 min readLW link

(arxiv.org)

Open problems in activation engineering

TurnTrout, woog, lisathiergart, Monte M and Ulisse Mini

Jul 24, 2023, 7:46 PM

51 points

2 comments1 min readLW link

(coda.io)