RSS

TurnTrout

Karma: 19,839

My name is Alex Turner. I’m a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com

Steer­ing Gem­ini with BiDPO

TurnTrout31 Jan 2025 2:37 UTC
83 points
3 comments1 min readLW link
(turntrout.com)

In­sights from “The Manga Guide to Phys­iol­ogy”

TurnTrout24 Jan 2025 5:18 UTC
26 points
3 comments1 min readLW link
(turntrout.com)

De­cep­tive Align­ment and Homuncularity

16 Jan 2025 13:55 UTC
25 points
12 comments22 min readLW link

Gam­ing Truth­fulQA: Sim­ple Heuris­tics Ex­posed Dataset Weaknesses

TurnTrout16 Jan 2025 2:14 UTC
64 points
3 comments1 min readLW link
(turntrout.com)

Re­view: Break­ing Free with Dr. Stone

TurnTrout18 Dec 2024 1:26 UTC
47 points
5 comments1 min readLW link
(turntrout.com)

Gra­di­ent Rout­ing: Mask­ing Gra­di­ents to Lo­cal­ize Com­pu­ta­tion in Neu­ral Networks

6 Dec 2024 22:19 UTC
161 points
12 comments11 min readLW link
(arxiv.org)

Deep Causal Transcod­ing: A Frame­work for Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

3 Dec 2024 21:19 UTC
95 points
7 comments41 min readLW link

In­trin­sic Power-Seek­ing: AI Might Seek Power for Power’s Sake

TurnTrout19 Nov 2024 18:36 UTC
40 points
5 comments1 min readLW link
(turntrout.com)

An­nounc­ing turn­trout.com, my new digi­tal home

TurnTrout17 Nov 2024 17:42 UTC
107 points
24 comments1 min readLW link
(turntrout.com)

I found >800 or­thog­o­nal “write code” steer­ing vectors

15 Jul 2024 19:06 UTC
99 points
19 comments7 min readLW link
(jacobgw.com)

Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

30 Apr 2024 18:51 UTC
206 points
43 comments45 min readLW link

Many ar­gu­ments for AI x-risk are wrong

TurnTrout5 Mar 2024 2:31 UTC
158 points
87 comments12 min readLW link

Dreams of AI al­ign­ment: The dan­ger of sug­ges­tive names

TurnTrout10 Feb 2024 1:22 UTC
103 points
59 comments4 min readLW link

Steer­ing Llama-2 with con­trastive ac­ti­va­tion additions

2 Jan 2024 0:47 UTC
124 points
29 comments8 min readLW link
(arxiv.org)

How should TurnTrout han­dle his Deep­Mind equity situ­a­tion?

16 Oct 2023 18:25 UTC
63 points
35 comments6 min readLW link1 review

Paper: Un­der­stand­ing and Con­trol­ling a Maze-Solv­ing Policy Network

13 Oct 2023 1:38 UTC
70 points
0 comments1 min readLW link
(arxiv.org)

AI pres­i­dents dis­cuss AI al­ign­ment agendas

9 Sep 2023 18:55 UTC
217 points
23 comments1 min readLW link
(www.youtube.com)

Ac­tAdd: Steer­ing Lan­guage Models with­out Optimization

6 Sep 2023 17:21 UTC
105 points
3 comments2 min readLW link
(arxiv.org)

Open prob­lems in ac­ti­va­tion engineering

24 Jul 2023 19:46 UTC
51 points
2 comments1 min readLW link
(coda.io)

Ban de­vel­op­ment of un­pre­dictable pow­er­ful mod­els?

TurnTrout20 Jun 2023 1:43 UTC
46 points
25 comments4 min readLW link