All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 202220232024

All JanFebMar Apr May Jun Jul Aug Sep Oct Nov Dec

All 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 222324 25 26 27 28

Short Notes on Research Process

Shoshannah Tekofsky22 Feb 2023 23:41 UTC

21 points

0 comments2 min readLW link

Video/animation: Neel Nanda explains what mechanistic interpretability is

DanielFilan22 Feb 2023 22:42 UTC

24 points

7 comments1 min readLW link

(youtu.be)

A Telepathic Exam about AI and Consequentialism

alkexr22 Feb 2023 21:00 UTC

4 points

4 comments4 min readLW link

[Question] Injecting noise to GPT to get multiple answers

bipolo22 Feb 2023 20:02 UTC

1 point

1 comment1 min readLW link

EIS XI: Moving Forward

scasper22 Feb 2023 19:05 UTC

19 points

2 comments9 min readLW link

Building and Entertaining Couples

Jacob Falkovich22 Feb 2023 19:02 UTC

85 points

11 comments4 min readLW link

Can submarines swim?

jasoncrawford22 Feb 2023 18:48 UTC

18 points

14 comments13 min readLW link

(rootsofprogress.org)

Is there a ML agent that abandons it’s utility function out-of-distribution without losing capabilities?

Christopher King22 Feb 2023 16:49 UTC

1 point

7 comments1 min readLW link

The male AI alignment solution

TekhneMakre22 Feb 2023 16:34 UTC

−25 points

24 comments1 min readLW link

Progress links and tweets, 2023-02-22

jasoncrawford22 Feb 2023 16:23 UTC

13 points

0 comments1 min readLW link

(rootsofprogress.org)

Cyborg Periods: There will be multiple AI transitions

Jan_Kulveit and rosehadshar

22 Feb 2023 16:09 UTC

108 points

9 comments6 min readLW link

The Open Agency Model

Eric Drexler22 Feb 2023 10:35 UTC

114 points

18 comments4 min readLW link

Intervening in the Residual Stream

MadHatter22 Feb 2023 6:29 UTC

30 points

1 comment9 min readLW link

What do language models know about fictional characters?

skybrian22 Feb 2023 5:58 UTC

6 points

0 comments4 min readLW link

Power-Seeking = Minimising free energy

Jonas Hallgren22 Feb 2023 4:28 UTC

21 points

10 comments7 min readLW link

The shallow reality of ‘deep learning theory’

Jesse Hoogland22 Feb 2023 4:16 UTC

34 points

11 comments3 min readLW link

(www.jessehoogland.com)

Candyland is Terrible

jefftk22 Feb 2023 1:50 UTC

16 points

2 comments1 min readLW link

(www.jefftk.com)

A proof of inner Löb’s theorem

James Payor21 Feb 2023 21:11 UTC

13 points

0 comments2 min readLW link

Fighting For Our Lives—What Ordinary People Can Do

TinkerBird21 Feb 2023 20:36 UTC

12 points

18 comments4 min readLW link

The Emotional Type of a Decision

moridinamael21 Feb 2023 20:35 UTC

13 points

0 comments4 min readLW link

What is it like doing AI safety work?

KatWoods21 Feb 2023 20:12 UTC

57 points

2 comments1 min readLW link

Pretraining Language Models with Human Preferences

Tomek Korbak, Sam Bowman and Ethan Perez

21 Feb 2023 17:57 UTC

134 points

19 comments11 min readLW link

A Stranger Priority? Topics at the Outer Reaches of Effective Altruism (my dissertation)

Joe Carlsmith21 Feb 2023 17:26 UTC

38 points

16 comments1 min readLW link

EIS X: Continual Learning, Modularity, Compression, and Biological Brains

scasper21 Feb 2023 16:59 UTC

14 points

4 comments3 min readLW link

No Room for Political Philosophy

Arturo Macias21 Feb 2023 16:11 UTC

0 points

7 comments3 min readLW link

Deceptive Alignment is <1% Likely by Default

DavidW21 Feb 2023 15:09 UTC

90 points

29 comments14 min readLW link

AI #1: Sydney and Bing

Zvi21 Feb 2023 14:00 UTC

171 points

45 comments61 min readLW link 1 review

(thezvi.wordpress.com)

You’re not a simulation, ’cause you’re hallucinating

Stuart_Armstrong21 Feb 2023 12:12 UTC

25 points

6 comments1 min readLW link

Basic facts about language models during training

beren21 Feb 2023 11:46 UTC

97 points

15 comments18 min readLW link

[Preprint] Pretraining Language Models with Human Preferences

Giulio21 Feb 2023 11:44 UTC

12 points

0 comments1 min readLW link

(arxiv.org)

Breaking the Optimizer’s Curse, and Consequences for Existential Risks and Value Learning

Roger Dearnaley21 Feb 2023 9:05 UTC

10 points

1 comment23 min readLW link

Medlife Crisis: “Why Do People Keep Falling For Things That Don’t Work?”

RomanHauksson21 Feb 2023 6:22 UTC

12 points

5 comments1 min readLW link

(www.youtube.com)

A foundation model approach to value inference

sen21 Feb 2023 5:09 UTC

6 points

0 comments3 min readLW link

Instrumentality makes agents agenty

porby21 Feb 2023 4:28 UTC

20 points

4 comments6 min readLW link

Gamified narrow reverse imitation learning

TekhneMakre21 Feb 2023 4:26 UTC

8 points

0 comments2 min readLW link

Feelings are Good, Actually

Gordon Seidoh Worley21 Feb 2023 2:38 UTC

18 points

1 comment4 min readLW link

AI alignment researchers don’t (seem to) stack

So8res21 Feb 2023 0:48 UTC

191 points

40 comments3 min readLW link

EA & LW Forum Weekly Summary (6th − 19th Feb 2023)

Zoe Williams21 Feb 2023 0:26 UTC

8 points

0 comments1 min readLW link

What to think when a language model tells you it’s sentient

Robbo21 Feb 2023 0:01 UTC

9 points

6 comments6 min readLW link

On second thought, prompt injections are probably examples of misalignment

lc20 Feb 2023 23:56 UTC

22 points

5 comments1 min readLW link

Nothing Is Ever Taught Correctly

LVSN20 Feb 2023 22:31 UTC

5 points

3 comments1 min readLW link

Behavioral and mechanistic definitions (often confuse AI alignment discussions)

LawrenceC20 Feb 2023 21:33 UTC

33 points

5 comments6 min readLW link

Validator models: A simple approach to detecting goodharting

beren20 Feb 2023 21:32 UTC

14 points

1 comment4 min readLW link

There are no coherence theorems

Dan H and EJT

20 Feb 2023 21:25 UTC

145 points

127 comments19 min readLW link 1 review

[Question] Are there any AI safety relevant fully remote roles suitable for someone with 2-3 years of machine learning engineering industry experience?

Malleable_shape20 Feb 2023 19:57 UTC

7 points

2 comments1 min readLW link

A circuit for Python docstrings in a 4-layer attention-only transformer

StefanHex and Jett Janiak

20 Feb 2023 19:35 UTC

96 points

8 comments21 min readLW link

Sydney the Bingenator Can’t Think, But It Still Threatens People

Valentin Baltadzhiev20 Feb 2023 18:37 UTC

−3 points

2 comments8 min readLW link

EIS IX: Interpretability and Adversaries

scasper20 Feb 2023 18:25 UTC

30 points

8 comments8 min readLW link

What AI companies can do today to help with the most important century

HoldenKarnofsky20 Feb 2023 17:00 UTC

38 points

3 comments9 min readLW link

(www.cold-takes.com)

Bankless Podcast: 159 - We’re All Gonna Die with Eliezer Yudkowsky

bayesed20 Feb 2023 16:42 UTC

83 points

54 comments1 min readLW link

(www.youtube.com)