RSS

Distil­la­tion & Pedagogy

TagLast edit: 21 Aug 2020 22:31 UTC by Raemon

Distillation is the process of taking a complex subject, and making it easier to understand. Pedagogy is the method and practice of teaching. A good intellectual pipeline requires not just discovering new ideas, but making it easier for newcomers to learn them, stand on the shoulders of giants, and discover even more ideas.

Chris Olah, founder of distill.pub, writes in his essay Research Debt:

Programmers talk about technical debt: there are ways to write software that are faster in the short run but problematic in the long run. Managers talk about institutional debt: institutions can grow quickly at the cost of bad practices creeping in. Both are easy to accumulate but hard to get rid of.

Research can also have debt. It comes in several forms:

  • Poor Exposition – Often, there is no good explanation of important ideas and one has to struggle to understand them. This problem is so pervasive that we take it for granted and don’t appreciate how much better things could be.

  • Undigested Ideas – Most ideas start off rough and hard to understand. They become radically easier as we polish them, developing the right analogies, language, and ways of thinking.

  • Bad abstractions and notation – Abstractions and notation are the user interface of research, shaping how we think and communicate. Unfortunately, we often get stuck with the first formalisms to develop even when they’re bad. For example, an object with extra electrons is negative, and pi is wrong.

  • Noise – Being a researcher is like standing in the middle of a construction site. Countless papers scream for your attention and there’s no easy way to filter or summarize them.Because most work is explained poorly, it takes a lot of energy to understand each piece of work. For many papers, one wants a simple one sentence explanation of it, but needs to fight with it to get that sentence. Because the simplest way to get the attention of interested parties is to get everyone’s attention, we get flooded with work. Because we incentivize people being “prolific,” we get flooded with a lot of work… We think noise is the main way experts experience research debt.

The insidious thing about research debt is that it’s normal. Everyone takes it for granted, and doesn’t realize that things could be different. For example, it’s normal to give very mediocre explanations of research, and people perceive that to be the ceiling of explanation quality. On the rare occasions that truly excellent explanations come along, people see them as one-off miracles rather than a sign that we could systematically be doing better.

See also Scholarship and Learning, and Good Explanations.

How to teach things well

Neel Nanda28 Aug 2020 16:44 UTC
108 points
17 comments15 min readLW link1 review
(www.neelnanda.io)

Re­search Debt

Elizabeth15 Jul 2018 19:36 UTC
24 points
2 comments1 min readLW link
(distill.pub)

Iron­ing Out the Squiggles

Zack_M_Davis29 Apr 2024 16:13 UTC
153 points
36 comments11 min readLW link

[Question] What are Ex­am­ples of Great Distil­lers?

adamShimi12 Nov 2020 14:09 UTC
35 points
12 comments1 min readLW link

The Cave Alle­gory Re­vis­ited: Un­der­stand­ing GPT’s Worldview

Jan_Kulveit14 Feb 2023 16:00 UTC
85 points
5 comments3 min readLW link

Ex­plain­ers Shoot High. Aim Low!

Eliezer Yudkowsky24 Oct 2007 1:13 UTC
101 points
35 comments1 min readLW link

Davi­dad’s Bold Plan for Align­ment: An In-Depth Explanation

19 Apr 2023 16:09 UTC
161 points
34 comments21 min readLW link

In­fra-Bayesi­anism Unwrapped

adamShimi20 Jan 2021 13:35 UTC
58 points
0 comments24 min readLW link

Call For Distillers

johnswentworth4 Apr 2022 18:25 UTC
206 points
43 comments3 min readLW link1 review

Ab­stracts should be ei­ther Ac­tu­ally Short™, or bro­ken into paragraphs

Raemon24 Mar 2023 0:51 UTC
93 points
27 comments5 min readLW link

Learn­ing how to learn

Neel Nanda30 Sep 2020 16:50 UTC
38 points
0 comments15 min readLW link
(www.neelnanda.io)

Fea­tures that make a re­port es­pe­cially helpful to me

lukeprog14 Apr 2022 1:12 UTC
40 points
0 comments2 min readLW link

TAPs for Tutoring

Mark Xu24 Dec 2020 20:46 UTC
27 points
3 comments5 min readLW link

Stampy’s AI Safety Info—New Distil­la­tions #3 [May 2023]

markov6 Jun 2023 14:18 UTC
16 points
0 comments2 min readLW link
(aisafety.info)

(Sum­mary) Se­quence High­lights—Think­ing Bet­ter on Purpose

qazzquimby2 Aug 2022 17:45 UTC
33 points
3 comments11 min readLW link

The 101 Space You Will Always Have With You

Screwtape29 Nov 2023 4:56 UTC
253 points
21 comments6 min readLW link1 review

Nat­u­ral Ab­strac­tions: Key claims, The­o­rems, and Critiques

16 Mar 2023 16:37 UTC
237 points
22 comments45 min readLW link2 reviews

Stampy’s AI Safety Info—New Distil­la­tions #1 [March 2023]

markov7 Apr 2023 11:06 UTC
42 points
0 comments2 min readLW link
(aisafety.info)

Stampy’s AI Safety Info—New Distil­la­tions #2 [April 2023]

markov9 May 2023 13:31 UTC
25 points
1 comment1 min readLW link
(aisafety.info)

Ex­pan­sive trans­la­tions: con­sid­er­a­tions and possibilities

ozziegooen18 Sep 2020 15:39 UTC
43 points
15 comments6 min readLW link

DARPA Digi­tal Tu­tor: Four Months to To­tal Tech­ni­cal Ex­per­tise?

SebastianG 6 Jul 2020 23:34 UTC
213 points
22 comments7 min readLW link

Ra­tion­al­ity Dojo

lsusr24 Apr 2022 0:53 UTC
14 points
5 comments1 min readLW link

Cal­ling for Stu­dent Sub­mis­sions: AI Safety Distil­la­tion Contest

Aris24 Apr 2022 1:53 UTC
48 points
15 comments4 min readLW link

In­fra-Bayesi­anism Distil­la­tion: Real­iz­abil­ity and De­ci­sion Theory

Thomas Larsen26 May 2022 21:57 UTC
40 points
9 comments18 min readLW link

[Re­quest for Distil­la­tion] Co­her­ence of Distributed De­ci­sions With Differ­ent In­puts Im­plies Conditioning

johnswentworth25 Apr 2022 17:01 UTC
22 points
14 comments2 min readLW link

The Solomonoff Prior is Malign

Mark Xu14 Oct 2020 1:33 UTC
173 points
52 comments16 min readLW link3 reviews

How to get peo­ple to pro­duce more great ex­po­si­tion? Some strate­gies and their assumptions

riceissa25 May 2022 22:30 UTC
26 points
10 comments3 min readLW link

Beren’s “De­con­fus­ing Direct vs Amor­tised Op­ti­mi­sa­tion”

DragonGod7 Apr 2023 8:57 UTC
52 points
10 comments3 min readLW link

Cheat sheet of AI X-risk

momom229 Jun 2023 4:28 UTC
19 points
1 comment7 min readLW link

But why would the AI kill us?

So8res17 Apr 2023 18:42 UTC
132 points
95 comments2 min readLW link

Ex­po­si­tion as sci­ence: some ideas for how to make progress

riceissa8 Jul 2022 1:29 UTC
21 points
1 comment8 min readLW link

A dis­til­la­tion of Evan Hub­inger’s train­ing sto­ries (for SERI MATS)

Daphne_W18 Jul 2022 3:38 UTC
15 points
1 comment10 min readLW link

Ra­tion­al­ity, Ped­a­gogy, and “Vibes”: Quick Thoughts

Nicholas / Heather Kross15 Jul 2023 2:09 UTC
14 points
1 comment4 min readLW link

Pit­falls with Proofs

scasper19 Jul 2022 22:21 UTC
19 points
21 comments8 min readLW link

AGI ruin mostly rests on strong claims about al­ign­ment and de­ploy­ment, not about society

Rob Bensinger24 Apr 2023 13:06 UTC
70 points
8 comments6 min readLW link

Distil­la­tion Con­test—Re­sults and Recap

Aris29 Jul 2022 17:40 UTC
34 points
0 comments7 min readLW link

[Question] Which in­tro-to-AI-risk text would you recom­mend to...

Sherrinford1 Aug 2022 9:36 UTC
12 points
1 comment1 min readLW link

Seek­ing PCK (Ped­a­gog­i­cal Con­tent Knowl­edge)

CFAR!Duncan12 Aug 2022 4:15 UTC
59 points
11 comments5 min readLW link

A con­cise sum-up of the ba­sic ar­gu­ment for AI doom

Mergimio H. Doefevmil24 Apr 2023 17:37 UTC
11 points
6 comments2 min readLW link

AI al­ign­ment as “nav­i­gat­ing the space of in­tel­li­gent be­havi­our”

Nora_Ammann23 Aug 2022 13:28 UTC
18 points
0 comments6 min readLW link

Align­ment is hard. Com­mu­ni­cat­ing that, might be harder

Eleni Angelou1 Sep 2022 16:57 UTC
7 points
8 comments3 min readLW link

How To Know What the AI Knows—An ELK Distillation

Fabien Roger4 Sep 2022 0:46 UTC
7 points
0 comments5 min readLW link

Learn­ing-the­o­retic agenda read­ing list

Vanessa Kosoy9 Nov 2023 17:25 UTC
99 points
1 comment2 min readLW link1 review

Sum­maries: Align­ment Fun­da­men­tals Curriculum

Leon Lang18 Sep 2022 13:08 UTC
44 points
3 comments1 min readLW link
(docs.google.com)

Su­per­po­si­tion is not “just” neu­ron polysemanticity

LawrenceC26 Apr 2024 23:22 UTC
64 points
4 comments13 min readLW link

Power-Seek­ing AI and Ex­is­ten­tial Risk

Antonio Franca11 Oct 2022 22:50 UTC
6 points
0 comments9 min readLW link

Real-Time Re­search Record­ing: Can a Trans­former Re-Derive Po­si­tional Info?

Neel Nanda1 Nov 2022 23:56 UTC
69 points
16 comments1 min readLW link
(youtu.be)

Distil­la­tion Ex­per­i­ment: Chunk-Knitting

DirectedEvolution7 Nov 2022 19:56 UTC
9 points
1 comment6 min readLW link

The No Free Lunch the­o­rem for dummies

Steven Byrnes5 Dec 2022 21:46 UTC
37 points
16 comments3 min readLW link

Shard The­ory in Nine Th­e­ses: a Distil­la­tion and Crit­i­cal Appraisal

LawrenceC19 Dec 2022 22:52 UTC
143 points
30 comments18 min readLW link

Sum­mary of 80k’s AI prob­lem profile

JakubK1 Jan 2023 7:30 UTC
7 points
0 comments5 min readLW link
(forum.effectivealtruism.org)

Re­sults from the Tur­ing Sem­i­nar hackathon

7 Dec 2023 14:50 UTC
29 points
1 comment6 min readLW link

In­duc­tion heads—illustrated

CallumMcDougall2 Jan 2023 15:35 UTC
113 points
9 comments3 min readLW link

AI Safety Info Distil­la­tion Fellowship

17 Feb 2023 16:16 UTC
47 points
3 comments3 min readLW link

How ARENA course ma­te­rial gets made

CallumMcDougall2 Jul 2024 18:04 UTC
41 points
2 comments7 min readLW link

[Question] Is “Strong Co­her­ence” Anti-Nat­u­ral?

DragonGod11 Apr 2023 6:22 UTC
23 points
25 comments2 min readLW link

Uncer­tainty in all its flavours

Cleo Nardo9 Jan 2024 16:21 UTC
27 points
6 comments35 min readLW link

An Ex­tremely Opinionated An­no­tated List of My Favourite Mechanis­tic In­ter­pretabil­ity Papers v2

Neel Nanda7 Jul 2024 17:39 UTC
134 points
16 comments25 min readLW link

AI Safety − 7 months of dis­cus­sion in 17 minutes

Zoe Williams15 Mar 2023 23:41 UTC
25 points
0 comments1 min readLW link

Get­ting ra­tio­nal now or later: nav­i­gat­ing pro­cras­ti­na­tion and time-in­con­sis­tent prefer­ences for new ra­tio­nal­ists

milo_thoughts26 Feb 2024 19:38 UTC
1 point
0 comments8 min readLW link

Dialogue in­tro­duc­tion to Sin­gu­lar Learn­ing Theory

Olli Järviniemi8 Jul 2024 16:58 UTC
97 points
14 comments8 min readLW link

An­nounc­ing AISafety.info’s Write-a-thon (June 16-18) and Se­cond Distil­la­tion Fel­low­ship (July 3-Oc­to­ber 2)

steven04613 Jun 2023 2:03 UTC
33 points
1 comment2 min readLW link

Join AISafety.info’s Distil­la­tion Hackathon (Oct 6-9th)

smallsilo1 Oct 2023 18:43 UTC
21 points
0 comments2 min readLW link
(forum.effectivealtruism.org)

Poker is a bad game for teach­ing epistemics. Fig­gie is a bet­ter one.

rossry8 Jul 2024 6:05 UTC
104 points
47 comments11 min readLW link
(blog.rossry.net)

Hi­a­tus: EA and LW post summaries

Zoe Williams17 May 2023 17:17 UTC
14 points
0 comments1 min readLW link

Pod­cast: “How the Smart Money teaches trad­ing with Ricki He­ick­len” (Pa­trick McKen­zie in­ter­view­ing)

rossry11 Jul 2024 22:49 UTC
20 points
2 comments1 min readLW link
(www.complexsystemspodcast.com)

AI Safety Strate­gies Landscape

Charbel-Raphaël9 May 2024 17:33 UTC
34 points
1 comment42 min readLW link

Ex­plain­ing Im­pact Markets

Saul Munn31 Jan 2024 9:51 UTC
95 points
2 comments3 min readLW link
(www.brasstacks.blog)

Ex­per­tise and advice

John_Maxwell27 May 2012 1:49 UTC
25 points
4 comments1 min readLW link

Grace­ful Degradation

Screwtape5 Nov 2024 23:57 UTC
79 points
8 comments4 min readLW link

Avoid Un­nec­es­sar­ily Poli­ti­cal Examples

Raemon11 Jan 2021 5:41 UTC
106 points
42 comments3 min readLW link

Dis­cov­ery fic­tion for the Pythagorean theorem

riceissa19 Jan 2021 2:09 UTC
16 points
1 comment4 min readLW link

In­ver­sion of the­o­rems into defi­ni­tions when generalizing

riceissa4 Aug 2019 17:44 UTC
25 points
3 comments5 min readLW link

Think like an ed­u­ca­tor about code quality

Adam Zerner27 Mar 2021 5:43 UTC
44 points
8 comments9 min readLW link

99% shorter

philh27 May 2021 19:50 UTC
16 points
0 comments6 min readLW link
(reasonableapproximation.net)

An Ap­pren­tice Ex­per­i­ment in Python Programming

4 Jul 2021 3:29 UTC
67 points
4 comments9 min readLW link

Learn­ing Math in Time for Alignment

Nicholas / Heather Kross9 Jan 2024 1:02 UTC
32 points
3 comments3 min readLW link

An Ap­pren­tice Ex­per­i­ment in Python Pro­gram­ming, Part 2

29 Jul 2021 7:39 UTC
30 points
18 comments10 min readLW link

Cal­ibra­tion proverbs

Malmesbury11 Jan 2022 5:11 UTC
76 points
19 comments1 min readLW link

“Deep Learn­ing” Is Func­tion Approximation

Zack_M_Davis21 Mar 2024 17:50 UTC
98 points
28 comments10 min readLW link
(zackmdavis.net)

[Closed] Job Offer­ing: Help Com­mu­ni­cate Infrabayesianism

23 Mar 2022 18:35 UTC
129 points
22 comments1 min readLW link

Sum­mary: “How to Write Quickly...” by John Wentworth

Pablo Repetto11 Apr 2022 23:26 UTC
4 points
0 comments2 min readLW link
(pabloernesto.github.io)

[Question] What to in­clude in a guest lec­ture on ex­is­ten­tial risks from AI?

Aryeh Englander13 Apr 2022 17:03 UTC
20 points
9 comments1 min readLW link

AI Safety 101 : Ca­pa­bil­ities—Hu­man Level AI, What? How? and When?

7 Mar 2024 17:29 UTC
46 points
8 comments54 min readLW link

A Ped­a­gog­i­cal Guide to Corrigibility

A.H.17 Jan 2024 11:45 UTC
6 points
3 comments16 min readLW link

Dreams of “Matho­pe­dia”

Nicholas / Heather Kross2 Jun 2023 1:30 UTC
40 points
16 comments2 min readLW link
(www.thinkingmuchbetter.com)

Ob­ser­va­tions on Teach­ing for Four Weeks

ClareChiaraVincent6 May 2024 16:55 UTC
50 points
14 comments3 min readLW link

An Illus­trated Sum­mary of “Ro­bust Agents Learn Causal World Model”

Dalcy14 Dec 2024 15:02 UTC
57 points
2 comments10 min readLW link

Failure Modes of Teach­ing AI Safety

Eleni Angelou25 Jun 2024 19:07 UTC
20 points
0 comments1 min readLW link

Distil­la­tion of ‘Do lan­guage mod­els plan for fu­ture to­kens’

TheManxLoiner27 Jun 2024 20:57 UTC
26 points
2 comments6 min readLW link

DIY RLHF: A sim­ple im­ple­men­ta­tion for hands on experience

10 Jul 2024 12:07 UTC
28 points
0 comments6 min readLW link

Video In­tro to Guaran­teed Safe AI

11 Jul 2024 17:53 UTC
27 points
0 comments1 min readLW link
(youtu.be)

Proof Ex­plained for “Ro­bust Agents Learn Causal World Model”

Dalcy22 Dec 2024 15:06 UTC
18 points
0 comments15 min readLW link

Distil­la­tion Of Deep­Seek-Prover V1.5

IvanLin15 Oct 2024 18:53 UTC
4 points
1 comment3 min readLW link

Con­crete Meth­ods for Heuris­tic Es­ti­ma­tion on Neu­ral Networks

Oliver Daniels14 Nov 2024 5:07 UTC
28 points
0 comments27 min readLW link

How I got so ex­cited about HowTruthful

Bruce Lewis9 Nov 2023 18:49 UTC
17 points
3 comments5 min readLW link

Does any­one use ad­vanced me­dia pro­jects?

ryan_b20 Jun 2018 23:33 UTC
33 points
5 comments1 min readLW link

Teach­ing the Unteachable

Eliezer Yudkowsky3 Mar 2009 23:14 UTC
55 points
18 comments6 min readLW link

The Fun­da­men­tal Ques­tion—Ra­tion­al­ity com­puter game design

Kaj_Sotala13 Feb 2013 13:45 UTC
61 points
68 comments9 min readLW link

Zetetic explanation

Benquo27 Aug 2018 0:12 UTC
95 points
138 comments6 min readLW link
(benjaminrosshoffman.com)

Pa­ter­nal Formats

abramdemski9 Jun 2019 1:26 UTC
51 points
35 comments2 min readLW link

The Stan­ley Parable: Mak­ing philos­o­phy fun

Nathan112322 May 2023 2:15 UTC
6 points
3 comments3 min readLW link

Teach­able Ra­tion­al­ity Skills

Eliezer Yudkowsky27 May 2011 21:57 UTC
74 points
263 comments1 min readLW link

Five-minute ra­tio­nal­ity techniques

sketerpot10 Aug 2010 2:24 UTC
72 points
237 comments2 min readLW link

Just One Sentence

Eliezer Yudkowsky5 Jan 2013 1:27 UTC
96 points
143 comments1 min readLW link

Me­dia bias

PhilGoetz5 Jul 2009 16:54 UTC
39 points
47 comments1 min readLW link

The RAIN Frame­work for In­for­ma­tional Effectiveness

ozziegooen13 Feb 2019 12:54 UTC
37 points
16 comments6 min readLW link

The Up-Goer Five Game: Ex­plain­ing hard ideas with sim­ple words

Rob Bensinger5 Sep 2013 5:54 UTC
44 points
82 comments2 min readLW link

Tu­tor-GPT & Ped­a­gog­i­cal Reasoning

courtlandleer5 Jun 2023 17:53 UTC
26 points
3 comments4 min readLW link

A com­par­i­son of causal scrub­bing, causal ab­strac­tions, and re­lated methods

8 Jun 2023 23:40 UTC
73 points
3 comments22 min readLW link

Ra­tion­al­ity Games & Apps Brainstorming

lukeprog9 Jul 2012 3:04 UTC
42 points
59 comments2 min readLW link

How not to be a Naïve Computationalist

diegocaleiro13 Apr 2011 19:45 UTC
39 points
36 comments2 min readLW link

Dense Math Notation

JK_Ravenclaw1 Apr 2011 3:37 UTC
33 points
23 comments1 min readLW link

AIS 101: Task de­com­po­si­tion for scal­able oversight

Charbel-Raphaël25 Jul 2023 13:34 UTC
27 points
0 comments19 min readLW link
(docs.google.com)

Numer­acy ne­glect—A per­sonal postmortem

vlad.proex27 Sep 2020 15:12 UTC
81 points
29 comments9 min readLW link

Paper di­ges­tion: “May We Have Your At­ten­tion Please? Hu­man-Rights NGOs and the Prob­lem of Global Com­mu­ni­ca­tion”

Klara Helene Nielsen20 Jul 2023 17:08 UTC
4 points
1 comment2 min readLW link
(journals.sagepub.com)

AI Safety 101 : In­tro­duc­tion to Vi­sion Interpretability

28 Jul 2023 17:32 UTC
41 points
0 comments1 min readLW link
(github.com)

Sub­di­vi­sions for Use­ful Distil­la­tions?

Sharat Jacob Jacob24 Jul 2023 18:55 UTC
8 points
2 comments2 min readLW link

Stampy’s AI Safety Info—New Distil­la­tions #4 [July 2023]

markov16 Aug 2023 19:03 UTC
22 points
10 comments1 min readLW link
(aisafety.info)

[Question] What AI Posts Do You Want Distil­led?

brook25 Aug 2023 9:01 UTC
11 points
2 comments1 min readLW link
(forum.effectivealtruism.org)

Jan Kul­veit’s Cor­rigi­bil­ity Thoughts Distilled

brook20 Aug 2023 17:52 UTC
20 points
1 comment5 min readLW link

Mesa-Op­ti­miza­tion: Ex­plain it like I’m 10 Edition

brook26 Aug 2023 23:04 UTC
20 points
1 comment6 min readLW link

Moved from Moloch’s Toolbox: Dis­cus­sion re style of lat­est Eliezer sequence

habryka5 Nov 2017 2:22 UTC
7 points
2 comments3 min readLW link

Short Primers on Cru­cial Topics

lukeprog31 May 2012 0:46 UTC
35 points
24 comments1 min readLW link

Graph­i­cal ten­sor no­ta­tion for interpretability

Jordan Taylor4 Oct 2023 8:04 UTC
140 points
11 comments19 min readLW link

An Ele­men­tary In­tro­duc­tion to In­fra-Bayesianism

CharlesRW20 Sep 2023 14:29 UTC
16 points
0 comments1 min readLW link

Great Explanations

lukeprog31 Oct 2011 23:58 UTC
34 points
115 comments2 min readLW link

A LessWrong “ra­tio­nal­ity work­book” idea

jwhendy9 Jan 2011 17:52 UTC
26 points
26 comments3 min readLW link

De­bug­ging the student

Adam Zerner16 Dec 2020 7:07 UTC
46 points
7 comments4 min readLW link

AI Safety 101 : Re­ward Misspecification

markov18 Oct 2023 20:39 UTC
30 points
4 comments31 min readLW link

Ret­ro­spec­tive on Teach­ing Ra­tion­al­ity Workshops

Neel Nanda3 Jan 2021 17:15 UTC
66 points
2 comments31 min readLW link

[Question] What cur­rents of thought on LessWrong do you want to see dis­til­led?

ryan_b8 Jan 2021 21:43 UTC
48 points
19 comments1 min readLW link

An Ap­pren­tice Ex­per­i­ment in Python Pro­gram­ming, Part 3

16 Aug 2021 4:42 UTC
14 points
10 comments22 min readLW link

Distill­ing and ap­proaches to the determinant

AprilSR6 Apr 2022 6:34 UTC
6 points
0 comments6 min readLW link

Deriv­ing Con­di­tional Ex­pected Utility from Pareto-Effi­cient Decisions

Thomas Kwa5 May 2022 3:21 UTC
24 points
1 comment6 min readLW link

How RL Agents Be­have When Their Ac­tions Are Mod­ified? [Distil­la­tion post]

PabloAMC20 May 2022 18:47 UTC
22 points
0 comments8 min readLW link

Univer­sal­ity Unwrapped

adamShimi21 Aug 2020 18:53 UTC
29 points
2 comments18 min readLW link

Imi­ta­tive Gen­er­al­i­sa­tion (AKA ‘Learn­ing the Prior’)

Beth Barnes10 Jan 2021 0:30 UTC
107 points
15 comments11 min readLW link1 review

Does SGD Pro­duce De­cep­tive Align­ment?

Mark Xu6 Nov 2020 23:48 UTC
96 points
9 comments16 min readLW link

Ex­plain­ing in­ner al­ign­ment to myself

Jeremy Gillen24 May 2022 23:10 UTC
9 points
2 comments10 min readLW link

Croe­sus, Cer­berus, and the mag­pies: a gen­tle in­tro­duc­tion to Elic­it­ing La­tent Knowledge

Alexandre Variengien27 May 2022 17:58 UTC
17 points
0 comments16 min readLW link

De­con­fus­ing Lan­dauer’s Principle

EuanMcLean27 May 2022 17:58 UTC
58 points
13 comments15 min readLW link

Un­der­stand­ing Selec­tion Theorems

adamk28 May 2022 1:49 UTC
41 points
3 comments7 min readLW link

Distil­led—AGI Safety from First Principles

Harrison G29 May 2022 0:57 UTC
11 points
1 comment14 min readLW link

Abram Dem­ski’s ELK thoughts and pro­posal—distillation

Rubi J. Hudson19 Jul 2022 6:57 UTC
19 points
8 comments16 min readLW link

AI Safety Cheat­sheet /​ Quick Reference

Zohar Jackson20 Jul 2022 9:39 UTC
3 points
0 comments1 min readLW link
(github.com)

An­nounc­ing the Distil­la­tion for Align­ment Practicum (DAP)

18 Aug 2022 19:50 UTC
23 points
3 comments3 min readLW link

Epistemic Arte­facts of (con­cep­tual) AI al­ign­ment research

19 Aug 2022 17:18 UTC
31 points
1 comment5 min readLW link

Deep Q-Net­works Explained

Jay Bailey13 Sep 2022 12:01 UTC
58 points
8 comments20 min readLW link

Un­der­stand­ing In­fra-Bayesi­anism: A Begin­ner-Friendly Video Series

22 Sep 2022 13:25 UTC
140 points
6 comments2 min readLW link

Distil­la­tion of “How Likely Is De­cep­tive Align­ment?”

NickGabs18 Nov 2022 16:31 UTC
24 points
4 comments10 min readLW link

MIRI’s “Death with Dig­nity” in 60 sec­onds.

Cleo Nardo6 Dec 2022 17:18 UTC
58 points
4 comments1 min readLW link

Models Don’t “Get Re­ward”

Sam Ringer30 Dec 2022 10:37 UTC
313 points
61 comments5 min readLW link1 review

The AI Con­trol Prob­lem in a wider in­tel­lec­tual context

philosophybear13 Jan 2023 0:28 UTC
11 points
3 comments12 min readLW link

Noth­ing Is Ever Taught Correctly

LVSN20 Feb 2023 22:31 UTC
5 points
3 comments1 min readLW link

The Benefits of Distil­la­tion in Research

Jonas Hallgren4 Mar 2023 17:45 UTC
15 points
2 comments5 min readLW link

[Ap­pendix] Nat­u­ral Ab­strac­tions: Key Claims, The­o­rems, and Critiques

16 Mar 2023 16:38 UTC
48 points
0 comments13 min readLW link
No comments.