RSS

TurnTrout

Karma: 20,250

My name is Alex Turner. I’m a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com

De­cep­tive Align­ment and Homuncularity

Jan 16, 2025, 1:55 PM
25 points
12 comments22 min readLW link

Gam­ing Truth­fulQA: Sim­ple Heuris­tics Ex­posed Dataset Weaknesses

TurnTroutJan 16, 2025, 2:14 AM
64 points
3 comments1 min readLW link
(turntrout.com)

Re­view: Break­ing Free with Dr. Stone

TurnTroutDec 18, 2024, 1:26 AM
47 points
5 comments1 min readLW link
(turntrout.com)

Gra­di­ent Rout­ing: Mask­ing Gra­di­ents to Lo­cal­ize Com­pu­ta­tion in Neu­ral Networks

Dec 6, 2024, 10:19 PM
165 points
12 comments11 min readLW link
(arxiv.org)

Deep Causal Transcod­ing: A Frame­work for Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

Dec 3, 2024, 9:19 PM
100 points
7 comments41 min readLW link

In­trin­sic Power-Seek­ing: AI Might Seek Power for Power’s Sake

TurnTroutNov 19, 2024, 6:36 PM
40 points
5 comments1 min readLW link
(turntrout.com)

An­nounc­ing turn­trout.com, my new digi­tal home

TurnTroutNov 17, 2024, 5:42 PM
107 points
33 comments1 min readLW link
(turntrout.com)