Let’s think about slow­ing down AI

KatjaGraceDec 22, 2022, 5:40 PM
551 points
182 comments38 min readLW link3 reviews
(aiimpacts.org)

Star­ing into the abyss as a core life skill

benkuhnDec 22, 2022, 3:30 PM
354 points
22 comments12 min readLW link1 review
(www.benkuhn.net)

Models Don’t “Get Re­ward”

Sam RingerDec 30, 2022, 10:37 AM
313 points
61 comments5 min readLW link1 review

A challenge for AGI or­ga­ni­za­tions, and a challenge for readers

Dec 1, 2022, 11:11 PM
302 points
33 comments2 min readLW link

Sazen

Duncan Sabien (Deactivated)Dec 21, 2022, 7:54 AM
281 points
83 comments12 min readLW link2 reviews

AI al­ign­ment is dis­tinct from its near-term applications

paulfchristianoDec 13, 2022, 7:10 AM
255 points
21 comments2 min readLW link
(ai-alignment.com)

How “Dis­cov­er­ing La­tent Knowl­edge in Lan­guage Models Without Su­per­vi­sion” Fits Into a Broader Align­ment Scheme

CollinDec 15, 2022, 6:22 PM
244 points
39 comments16 min readLW link1 review

Jailbreak­ing ChatGPT on Re­lease Day

ZviDec 2, 2022, 1:10 PM
242 points
77 comments6 min readLW link1 review
(thezvi.wordpress.com)

The Plan − 2022 Update

johnswentworthDec 1, 2022, 8:43 PM
239 points
37 comments8 min readLW link1 review

Causal Scrub­bing: a method for rigor­ously test­ing in­ter­pretabil­ity hy­pothe­ses [Red­wood Re­search]

Dec 3, 2022, 12:58 AM
206 points
35 comments20 min readLW link1 review

What AI Safety Ma­te­ri­als Do ML Re­searchers Find Com­pel­ling?

Dec 28, 2022, 2:03 AM
175 points
34 comments2 min readLW link

The next decades might be wild

Marius HobbhahnDec 15, 2022, 4:10 PM
175 points
42 comments41 min readLW link1 review

Finite Fac­tored Sets in Pictures

Magdalena WacheDec 11, 2022, 6:49 PM
174 points
35 comments12 min readLW link

Us­ing GPT-Eliezer against ChatGPT Jailbreaking

Dec 6, 2022, 7:54 PM
170 points
85 comments9 min readLW link

Things that can kill you quickly: What ev­ery­one should know about first aid

jasoncrawfordDec 27, 2022, 4:23 PM
166 points
21 comments2 min readLW link1 review
(jasoncrawford.org)

Log­i­cal in­duc­tion for soft­ware engineers

Alex FlintDec 3, 2022, 7:55 PM
163 points
8 comments27 min readLW link1 review

Shard The­ory in Nine Th­e­ses: a Distil­la­tion and Crit­i­cal Appraisal

LawrenceCDec 19, 2022, 10:52 PM
150 points
30 comments18 min readLW link

[In­terim re­search re­port] Tak­ing fea­tures out of su­per­po­si­tion with sparse autoencoders

Dec 13, 2022, 3:41 PM
150 points
23 comments22 min readLW link2 reviews

A Year of AI In­creas­ing AI Progress

TW123Dec 30, 2022, 2:09 AM
148 points
3 comments2 min readLW link

In­ner and outer al­ign­ment de­com­pose one hard prob­lem into two ex­tremely hard problems

TurnTroutDec 2, 2022, 2:43 AM
148 points
22 comments47 min readLW link3 reviews

K-com­plex­ity is silly; use cross-en­tropy instead

So8resDec 20, 2022, 11:06 PM
147 points
54 comments14 min readLW link2 reviews

Up­dat­ing my AI timelines

Matthew BarnettDec 5, 2022, 8:46 PM
145 points
50 comments2 min readLW link

[Question] How to Con­vince my Son that Drugs are Bad

concerned_dadDec 17, 2022, 6:47 PM
140 points
84 comments2 min readLW link

De­con­fus­ing Direct vs Amor­tised Optimization

berenDec 2, 2022, 11:30 AM
134 points
19 comments10 min readLW link

Re-Ex­am­in­ing LayerNorm

Eric WinsorDec 1, 2022, 10:20 PM
127 points
12 comments5 min readLW link

Shared re­al­ity: a key driver of hu­man behavior

kdbscottDec 24, 2022, 7:35 PM
126 points
25 comments4 min readLW link

The case against AI alignment

andrew sauerDec 24, 2022, 6:57 AM
126 points
110 comments5 min readLW link

Did ChatGPT just gaslight me?

TW123Dec 1, 2022, 5:41 AM
123 points
45 comments9 min readLW link
(aiwatchtower.substack.com)

[Question] Why The Fo­cus on Ex­pected Utility Max­imisers?

DragonGodDec 27, 2022, 3:49 PM
118 points
84 comments3 min readLW link

200 Con­crete Open Prob­lems in Mechanis­tic In­ter­pretabil­ity: Introduction

Neel NandaDec 28, 2022, 9:06 PM
106 points
0 comments10 min readLW link

Try­ing to dis­am­biguate differ­ent ques­tions about whether RLHF is “good”

BuckDec 14, 2022, 4:03 AM
106 points
47 comments7 min readLW link1 review

Lan­guage mod­els are nearly AGIs but we don’t no­tice it be­cause we keep shift­ing the bar

philosophybearDec 30, 2022, 5:15 AM
105 points
13 comments7 min readLW link

Find­ing gliders in the game of life

paulfchristianoDec 1, 2022, 8:40 PM
104 points
8 comments16 min readLW link
(ai-alignment.com)

But is it re­ally in Rome? An in­ves­ti­ga­tion of the ROME model edit­ing technique

jacquesthibsDec 30, 2022, 2:40 AM
104 points
2 comments18 min readLW link

Slightly against al­ign­ing with neo-luddites

Matthew BarnettDec 26, 2022, 10:46 PM
104 points
31 comments4 min readLW link

Ap­plied Lin­ear Alge­bra Lec­ture Series

johnswentworthDec 22, 2022, 6:57 AM
103 points
8 comments1 min readLW link

[Linkpost] The Story Of VaccinateCA

hathDec 9, 2022, 11:54 PM
103 points
4 comments10 min readLW link
(www.worksinprogress.co)

Thoughts on AGI or­ga­ni­za­tions and ca­pa­bil­ities work

Dec 7, 2022, 7:46 PM
102 points
17 comments5 min readLW link

Dis­cov­er­ing Lan­guage Model Be­hav­iors with Model-Writ­ten Evaluations

Dec 20, 2022, 8:08 PM
100 points
34 comments1 min readLW link
(www.anthropic.com)

Bad at Arith­metic, Promis­ing at Math

cohenmacaulayDec 18, 2022, 5:40 AM
100 points
19 comments20 min readLW link1 review

[Link] Why I’m op­ti­mistic about OpenAI’s al­ign­ment approach

janleikeDec 5, 2022, 10:51 PM
98 points
15 comments1 min readLW link
(aligned.substack.com)

You can still fetch the coffee to­day if you’re dead tomorrow

davidadDec 9, 2022, 2:06 PM
96 points
19 comments5 min readLW link

The LessWrong 2021 Re­view: In­tel­lec­tual Cir­cle Expansion

Dec 1, 2022, 9:17 PM
95 points
55 comments8 min readLW link

Towards Hodge-podge Alignment

Cleo NardoDec 19, 2022, 8:12 PM
95 points
30 comments9 min readLW link

Re­vis­it­ing al­gorith­mic progress

Dec 13, 2022, 1:39 AM
95 points
15 comments2 min readLW link1 review
(arxiv.org)

A Com­pre­hen­sive Mechanis­tic In­ter­pretabil­ity Ex­plainer & Glossary

Neel NandaDec 21, 2022, 12:35 PM
91 points
6 comments2 min readLW link
(neelnanda.io)

Set­ting the Zero Point

Duncan Sabien (Deactivated)Dec 9, 2022, 6:06 AM
91 points
43 comments20 min readLW link1 review

Can we effi­ciently dis­t­in­guish differ­ent mechanisms?

paulfchristianoDec 27, 2022, 12:20 AM
91 points
30 comments16 min readLW link
(ai-alignment.com)

Lo­cal Memes Against Geo­met­ric Rationality

Scott GarrabrantDec 21, 2022, 3:53 AM
90 points
3 comments6 min readLW link

Con­sider us­ing re­versible au­tomata for al­ign­ment research

Alex_AltairDec 11, 2022, 1:00 AM
88 points
30 comments2 min readLW link