RSS

AI

Core TagLast edit: 9 Feb 2024 2:18 UTC by jimrandomh

Artificial Intelligence is the study of creating intelligence in algorithms. AI Alignment is the task of ensuring [powerful] AI system are aligned with human values and interests. The central concern is that a powerful enough AI, if not designed and implemented with sufficient understanding, would optimize something unintended by its creators and pose an existential threat to the future of humanity. This is known as the AI alignment problem.

Common terms in this space are superintelligence, AI Alignment, AI Safety, Friendly AI, Transformative AI, human-level-intelligence, AI Governance, and Beneficial AI. This entry and the associated tag roughly encompass all of these topics: anything part of the broad cluster of understanding AI and its future impacts on our civilization deserves this tag.

AI Alignment

There are narrow conceptions of alignment, where you’re trying to get it to do something like cure Alzheimer’s disease without destroying the rest of the world. And there’s much more ambitious notions of alignment, where you’re trying to get it to do the right thing and achieve a happy intergalactic civilization.

But both the narrow and the ambitious alignment have in common that you’re trying to have the AI do that thing rather than making a lot of paperclips.

See also General Intelligence.

Basic Alignment Theory

AIXI
Coherent Extrapolated Volition
Complexity of Value
Corrigibility
Deceptive Alignment
Decision Theory
Embedded Agency
Fixed Point Theorems
Goodhart’s Law
Goal-Directedness
Gradient Hacking
Infra-Bayesianism
Inner Alignment
Instrumental Convergence
Intelligence Explosion
Logical Induction
Logical Uncertainty
Mesa-Optimization
Multipolar Scenarios
Myopia
Newcomb’s Problem
Optimization
Orthogonality Thesis
Outer Alignment
Paperclip Maximizer
Power Seeking (AI)
Recursive Self-Improvement
Simulator Theory
Sharp Left Turn
Solomonoff Induction
Superintelligence
Symbol Grounding
Transformative AI
Treacherous Turn
Utility Functions
Whole Brain Emulation

Engineering Alignment

Agent Foundations
AI-assisted Alignment
AI Boxing (Containment)
Conservatism (AI)
Debate (AI safety technique)
Eliciting Latent Knowledge (ELK)
Factored Cognition
Humans Consulting HCH
Impact Measures
Inverse Reinforcement Learning
Iterated Amplification
Mild Optimization
Oracle AI
Reward Functions
RLHF
Shard Theory
Tool AI
Transparency /​ Interpretability
Tripwire
Value Learning

Organizations

Full map here

AI Safety Camp
Alignment Research Center
Anthropic
Apart Research
AXRP
CHAI (UC Berkeley)
Conjecture (org)
DeepMind
FHI (Oxford)
Future of Life Institute
MIRI
OpenAI
Ought
SERI MATS

Strategy

AI Alignment Fieldbuilding
AI Governance
AI Persuasion
AI Risk
AI Risk Concrete Stories
AI Safety Public Materials
AI Services (CAIS)
AI Success Models
AI Takeoff
AI Timelines
Computing Overhang
Regulation and AI Risk
Restrain AI Development

Other

AI Alignment Intro Materials
AI Capabilities
AI Questions Open Thread
Compute
DALL-E
GPT
Language Models
Machine Learning
Narrow AI
Neuromorphic AI
Prompt Engineering
Reinforcement Learning
Research Agendas

An overview of 11 pro­pos­als for build­ing safe ad­vanced AI

evhub29 May 2020 20:38 UTC
194 points
36 comments38 min readLW link2 reviews

There’s No Fire Alarm for Ar­tifi­cial Gen­eral Intelligence

Eliezer Yudkowsky13 Oct 2017 21:38 UTC
124 points
71 comments25 min readLW link

Su­per­in­tel­li­gence FAQ

Scott Alexander20 Sep 2016 19:00 UTC
92 points
16 comments27 min readLW link

Risks from Learned Op­ti­miza­tion: Introduction

31 May 2019 23:44 UTC
166 points
42 comments12 min readLW link3 reviews

Embed­ded Agents

29 Oct 2018 19:53 UTC
198 points
41 comments1 min readLW link2 reviews

What failure looks like

paulfchristiano17 Mar 2019 20:18 UTC
319 points
49 comments8 min readLW link2 reviews

The Rocket Align­ment Problem

Eliezer Yudkowsky4 Oct 2018 0:38 UTC
198 points
42 comments15 min readLW link2 reviews

Challenges to Chris­ti­ano’s ca­pa­bil­ity am­plifi­ca­tion proposal

Eliezer Yudkowsky19 May 2018 18:18 UTC
115 points
54 comments23 min readLW link1 review

Embed­ded Agency (full-text ver­sion)

15 Nov 2018 19:49 UTC
143 points
15 comments54 min readLW link

A space of pro­pos­als for build­ing safe ad­vanced AI

Richard_Ngo10 Jul 2020 16:58 UTC
55 points
4 comments4 min readLW link

Biol­ogy-In­spired AGI Timelines: The Trick That Never Works

Eliezer Yudkowsky1 Dec 2021 22:35 UTC
181 points
143 comments65 min readLW link

PreDCA: vanessa kosoy’s al­ign­ment protocol

Tamsin Leake20 Aug 2022 10:03 UTC
46 points
8 comments7 min readLW link
(carado.moe)

larger lan­guage mod­els may dis­ap­point you [or, an eter­nally un­finished draft]

nostalgebraist26 Nov 2021 23:08 UTC
237 points
29 comments31 min readLW link1 review

Deep­mind’s Go­pher—more pow­er­ful than GPT-3

hath8 Dec 2021 17:06 UTC
86 points
27 comments1 min readLW link
(deepmind.com)

Pro­ject pro­posal: Test­ing the IBP defi­ni­tion of agent

9 Aug 2022 1:09 UTC
21 points
4 comments2 min readLW link

Good­hart Taxonomy

Scott Garrabrant30 Dec 2017 16:38 UTC
180 points
33 comments10 min readLW link

AI Align­ment 2018-19 Review

Rohin Shah28 Jan 2020 2:19 UTC
125 points
6 comments35 min readLW link

Some AI re­search ar­eas and their rele­vance to ex­is­ten­tial safety

Andrew_Critch19 Nov 2020 3:18 UTC
199 points
40 comments50 min readLW link2 reviews

Mo­ravec’s Para­dox Comes From The Availa­bil­ity Heuristic

james.lucassen20 Oct 2021 6:23 UTC
32 points
2 comments2 min readLW link
(jlucassen.com)

In­fer­ence cost limits the im­pact of ever larger models

SoerenMind23 Oct 2021 10:51 UTC
36 points
28 comments2 min readLW link

[Linkpost] Chi­nese gov­ern­ment’s guidelines on AI

RomanS10 Dec 2021 21:10 UTC
61 points
14 comments1 min readLW link

That Alien Message

Eliezer Yudkowsky22 May 2008 5:55 UTC
304 points
173 comments10 min readLW link

Episte­molog­i­cal Fram­ing for AI Align­ment Research

adamShimi8 Mar 2021 22:05 UTC
53 points
7 comments9 min readLW link

Effi­cien­tZero: hu­man ALE sam­ple-effi­ciency w/​MuZero+self-supervised

gwern2 Nov 2021 2:32 UTC
134 points
52 comments1 min readLW link
(arxiv.org)

Dis­cus­sion with Eliezer Yud­kowsky on AGI interventions

11 Nov 2021 3:01 UTC
325 points
257 comments34 min readLW link

Shul­man and Yud­kowsky on AI progress

3 Dec 2021 20:05 UTC
90 points
16 comments20 min readLW link

Fu­ture ML Sys­tems Will Be Qual­i­ta­tively Different

jsteinhardt11 Jan 2022 19:50 UTC
113 points
10 comments5 min readLW link
(bounded-regret.ghost.io)

[Linkpost] Tro­janNet: Embed­ding Hid­den Tro­jan Horse Models in Neu­ral Networks

Gunnar_Zarncke11 Feb 2022 1:17 UTC
13 points
1 comment1 min readLW link

Briefly think­ing through some analogs of debate

Eli Tyre11 Sep 2022 12:02 UTC
20 points
3 comments4 min readLW link

Ro­bust­ness to Scale

Scott Garrabrant21 Feb 2018 22:55 UTC
109 points
22 comments2 min readLW link1 review

Chris Olah’s views on AGI safety

evhub1 Nov 2019 20:13 UTC
197 points
38 comments12 min readLW link2 reviews

[AN #96]: Buck and I dis­cuss/​ar­gue about AI Alignment

Rohin Shah22 Apr 2020 17:20 UTC
17 points
4 comments10 min readLW link
(mailchi.mp)

Matt Botv­inick on the spon­ta­neous emer­gence of learn­ing algorithms

Adam Scholl12 Aug 2020 7:47 UTC
147 points
87 comments5 min readLW link

A de­scrip­tive, not pre­scrip­tive, overview of cur­rent AI Align­ment Research

6 Jun 2022 21:59 UTC
126 points
21 comments7 min readLW link

Co­her­ence ar­gu­ments do not en­tail goal-di­rected behavior

Rohin Shah3 Dec 2018 3:26 UTC
101 points
69 comments7 min readLW link3 reviews

Align­ment By Default

johnswentworth12 Aug 2020 18:54 UTC
153 points
92 comments11 min readLW link2 reviews

Book re­view: “A Thou­sand Brains” by Jeff Hawkins

Steven Byrnes4 Mar 2021 5:10 UTC
110 points
18 comments19 min readLW link

Model­ling Trans­for­ma­tive AI Risks (MTAIR) Pro­ject: Introduction

16 Aug 2021 7:12 UTC
89 points
0 comments9 min readLW link

In­fra-Bayesian phys­i­cal­ism: a for­mal the­ory of nat­u­ral­ized induction

Vanessa Kosoy30 Nov 2021 22:25 UTC
98 points
20 comments42 min readLW link1 review

What an ac­tu­ally pes­simistic con­tain­ment strat­egy looks like

lc5 Apr 2022 0:19 UTC
554 points
136 comments6 min readLW link

Why I think strong gen­eral AI is com­ing soon

porby28 Sep 2022 5:40 UTC
269 points
126 comments34 min readLW link

AlphaGo Zero and the Foom Debate

Eliezer Yudkowsky21 Oct 2017 2:18 UTC
89 points
17 comments3 min readLW link

Trade­off be­tween de­sir­able prop­er­ties for baseline choices in im­pact measures

Vika4 Jul 2020 11:56 UTC
37 points
24 comments5 min readLW link

Com­pe­ti­tion: Am­plify Ro­hin’s Pre­dic­tion on AGI re­searchers & Safety Concerns

stuhlmueller21 Jul 2020 20:06 UTC
80 points
40 comments3 min readLW link

the scal­ing “in­con­sis­tency”: openAI’s new insight

nostalgebraist7 Nov 2020 7:40 UTC
146 points
14 comments9 min readLW link
(nostalgebraist.tumblr.com)

2019 Re­view Rewrite: Seek­ing Power is Often Ro­bustly In­stru­men­tal in MDPs

TurnTrout23 Dec 2020 17:16 UTC
35 points
0 comments4 min readLW link
(www.lesswrong.com)

Boot­strapped Alignment

Gordon Seidoh Worley27 Feb 2021 15:46 UTC
19 points
12 comments2 min readLW link

Mul­ti­modal Neu­rons in Ar­tifi­cial Neu­ral Networks

Kaj_Sotala5 Mar 2021 9:01 UTC
57 points
2 comments2 min readLW link
(distill.pub)

Re­view of “Fun with +12 OOMs of Com­pute”

28 Mar 2021 14:55 UTC
60 points
20 comments8 min readLW link

Draft re­port on ex­is­ten­tial risk from power-seek­ing AI

Joe Carlsmith28 Apr 2021 21:41 UTC
80 points
23 comments1 min readLW link

Rogue AGI Em­bod­ies Valuable In­tel­lec­tual Property

3 Jun 2021 20:37 UTC
70 points
9 comments3 min readLW link

Deep­Mind: Gen­er­ally ca­pa­ble agents emerge from open-ended play

Daniel Kokotajlo27 Jul 2021 14:19 UTC
247 points
53 comments2 min readLW link
(deepmind.com)

Analo­gies and Gen­eral Pri­ors on Intelligence

20 Aug 2021 21:03 UTC
57 points
12 comments14 min readLW link

We’re already in AI takeoff

Valentine8 Mar 2022 23:09 UTC
120 points
115 comments7 min readLW link

It Looks Like You’re Try­ing To Take Over The World

gwern9 Mar 2022 16:35 UTC
386 points
125 comments1 min readLW link
(www.gwern.net)

In­ter­pretabil­ity’s Align­ment-Solv­ing Po­ten­tial: Anal­y­sis of 7 Scenarios

Evan R. Murphy12 May 2022 20:01 UTC
45 points
0 comments59 min readLW link

Why all the fuss about re­cur­sive self-im­prove­ment?

So8res12 Jun 2022 20:53 UTC
150 points
62 comments7 min readLW link

AI Safety bounty for prac­ti­cal ho­mo­mor­phic encryption

acylhalide19 Aug 2022 12:27 UTC
29 points
9 comments4 min readLW link

Paper: Dis­cov­er­ing novel al­gorithms with AlphaTen­sor [Deep­mind]

LawrenceC5 Oct 2022 16:20 UTC
80 points
18 comments1 min readLW link
(www.deepmind.com)

The Teacup Test

lsusr8 Oct 2022 4:25 UTC
71 points
28 comments2 min readLW link

Dis­con­tin­u­ous progress in his­tory: an update

KatjaGrace14 Apr 2020 0:00 UTC
179 points
25 comments31 min readLW link1 review
(aiimpacts.org)

Repli­ca­tion Dy­nam­ics Bridge to RL in Ther­mo­dy­namic Limit

Past Account18 May 2020 1:02 UTC
6 points
1 comment2 min readLW link

The ground of optimization

Alex Flint20 Jun 2020 0:38 UTC
218 points
74 comments27 min readLW link1 review

Model­ling Con­tin­u­ous Progress

Sammy Martin23 Jun 2020 18:06 UTC
29 points
3 comments7 min readLW link

Refram­ing Su­per­in­tel­li­gence: Com­pre­hen­sive AI Ser­vices as Gen­eral Intelligence

Rohin Shah8 Jan 2019 7:12 UTC
118 points
75 comments5 min readLW link2 reviews
(www.fhi.ox.ac.uk)

Clas­sifi­ca­tion of AI al­ign­ment re­search: de­con­fu­sion, “good enough” non-su­per­in­tel­li­gent AI al­ign­ment, su­per­in­tel­li­gent AI alignment

philip_b14 Jul 2020 22:48 UTC
35 points
25 comments3 min readLW link

Col­lec­tion of GPT-3 results

Kaj_Sotala18 Jul 2020 20:04 UTC
89 points
24 comments1 min readLW link
(twitter.com)

Hiring en­g­ineers and re­searchers to help al­ign GPT-3

paulfchristiano1 Oct 2020 18:54 UTC
206 points
14 comments3 min readLW link

The date of AI Takeover is not the day the AI takes over

Daniel Kokotajlo22 Oct 2020 10:41 UTC
116 points
32 comments2 min readLW link1 review

[Question] What could one do with truly un­limited com­pu­ta­tional power?

Yitz11 Nov 2020 10:03 UTC
30 points
22 comments2 min readLW link

AGI Predictions

21 Nov 2020 3:46 UTC
110 points
36 comments4 min readLW link

[Question] What are the best prece­dents for in­dus­tries failing to in­vest in valuable AI re­search?

Daniel Kokotajlo14 Dec 2020 23:57 UTC
18 points
17 comments1 min readLW link

Ex­trap­o­lat­ing GPT-N performance

Lukas Finnveden18 Dec 2020 21:41 UTC
103 points
31 comments25 min readLW link1 review

De­bate up­date: Obfus­cated ar­gu­ments problem

Beth Barnes23 Dec 2020 3:24 UTC
125 points
21 comments16 min readLW link

Liter­a­ture Re­view on Goal-Directedness

18 Jan 2021 11:15 UTC
69 points
21 comments31 min readLW link

[Question] How will OpenAI + GitHub’s Copi­lot af­fect pro­gram­ming?

smountjoy29 Jun 2021 16:42 UTC
55 points
23 comments1 min readLW link

Model­ing Risks From Learned Optimization

Ben Cottier12 Oct 2021 20:54 UTC
44 points
0 comments12 min readLW link

Truth­ful AI: Devel­op­ing and gov­ern­ing AI that does not lie

18 Oct 2021 18:37 UTC
81 points
9 comments10 min readLW link

Effi­cien­tZero: How It Works

1a3orn26 Nov 2021 15:17 UTC
273 points
42 comments29 min readLW link

The­o­ret­i­cal Neu­ro­science For Align­ment Theory

Cameron Berg7 Dec 2021 21:50 UTC
62 points
19 comments23 min readLW link

Magna Alta Doctrina

jacob_cannell11 Dec 2021 21:54 UTC
37 points
7 comments28 min readLW link

DL to­wards the un­al­igned Re­cur­sive Self-Op­ti­miza­tion attractor

jacob_cannell18 Dec 2021 2:15 UTC
32 points
22 comments4 min readLW link

Reg­u­lariza­tion Causes Mo­du­lar­ity Causes Generalization

dkirmani1 Jan 2022 23:34 UTC
49 points
7 comments3 min readLW link

Is Gen­eral In­tel­li­gence “Com­pact”?

DragonGod4 Jul 2022 13:27 UTC
21 points
6 comments22 min readLW link

The Tree of Life: Stan­ford AI Align­ment The­ory of Change

Gabe M2 Jul 2022 18:36 UTC
22 points
0 comments14 min readLW link

Shard The­ory: An Overview

David Udell11 Aug 2022 5:44 UTC
135 points
34 comments10 min readLW link

How evolu­tion suc­ceeds and fails at value alignment

Ocracoke21 Aug 2022 7:14 UTC
21 points
2 comments4 min readLW link

An Un­trol­lable Math­e­mat­i­cian Illustrated

abramdemski20 Mar 2018 0:00 UTC
155 points
38 comments1 min readLW link1 review

Con­di­tions for Mesa-Optimization

1 Jun 2019 20:52 UTC
75 points
48 comments12 min readLW link

Thoughts on Hu­man Models

21 Feb 2019 9:10 UTC
124 points
32 comments10 min readLW link1 review

In­ner al­ign­ment in the brain

Steven Byrnes22 Apr 2020 13:14 UTC
76 points
16 comments16 min readLW link

Prob­lem re­lax­ation as a tactic

TurnTrout22 Apr 2020 23:44 UTC
113 points
8 comments7 min readLW link

[Question] How should po­ten­tial AI al­ign­ment re­searchers gauge whether the field is right for them?

TurnTrout6 May 2020 12:24 UTC
20 points
5 comments1 min readLW link

Speci­fi­ca­tion gam­ing: the flip side of AI ingenuity

6 May 2020 23:51 UTC
46 points
8 comments6 min readLW link

Les­sons from Isaac: Pit­falls of Reason

adamShimi8 May 2020 20:44 UTC
9 points
0 comments8 min readLW link

Cor­rigi­bil­ity as out­side view

TurnTrout8 May 2020 21:56 UTC
36 points
11 comments4 min readLW link

[Question] How to choose a PhD with AI Safety in mind

Ariel Kwiatkowski15 May 2020 22:19 UTC
9 points
1 comment1 min readLW link

Re­ward func­tions and up­dat­ing as­sump­tions can hide a mul­ti­tude of sins

Stuart_Armstrong18 May 2020 15:18 UTC
16 points
2 comments9 min readLW link

Pos­si­ble take­aways from the coro­n­avirus pan­demic for slow AI takeoff

Vika31 May 2020 17:51 UTC
135 points
36 comments3 min readLW link1 review

Fo­cus: you are al­lowed to be bad at ac­com­plish­ing your goals

adamShimi3 Jun 2020 21:04 UTC
19 points
17 comments3 min readLW link

Re­ply to Paul Chris­ti­ano on Inac­cessible Information

Alex Flint5 Jun 2020 9:10 UTC
77 points
15 comments6 min readLW link

Our take on CHAI’s re­search agenda in un­der 1500 words

Alex Flint17 Jun 2020 12:24 UTC
112 points
19 comments5 min readLW link

[Question] Ques­tion on GPT-3 Ex­cel Demo

Zhitao Hou22 Jun 2020 20:31 UTC
0 points
2 comments1 min readLW link

Dy­namic in­con­sis­tency of the in­ac­tion and ini­tial state baseline

Stuart_Armstrong7 Jul 2020 12:02 UTC
30 points
8 comments2 min readLW link

Cortés, Pizarro, and Afonso as Prece­dents for Takeover

Daniel Kokotajlo1 Mar 2020 3:49 UTC
145 points
75 comments11 min readLW link1 review

[Question] What prob­lem would you like to see Re­in­force­ment Learn­ing ap­plied to?

Julian Schrittwieser8 Jul 2020 2:40 UTC
43 points
4 comments1 min readLW link

My cur­rent frame­work for think­ing about AGI timelines

zhukeepa30 Mar 2020 1:23 UTC
107 points
5 comments3 min readLW link

[Question] To what ex­tent is GPT-3 ca­pa­ble of rea­son­ing?

TurnTrout20 Jul 2020 17:10 UTC
70 points
74 comments16 min readLW link

Repli­cat­ing the repli­ca­tion crisis with GPT-3?

skybrian22 Jul 2020 21:20 UTC
29 points
10 comments1 min readLW link

Can you get AGI from a Trans­former?

Steven Byrnes23 Jul 2020 15:27 UTC
114 points
39 comments12 min readLW link

Writ­ing with GPT-3

Jacob Falkovich24 Jul 2020 15:22 UTC
42 points
0 comments4 min readLW link

In­ner Align­ment: Ex­plain like I’m 12 Edition

Rafael Harth1 Aug 2020 15:24 UTC
175 points
46 comments13 min readLW link2 reviews

Devel­op­men­tal Stages of GPTs

orthonormal26 Jul 2020 22:03 UTC
140 points
74 comments7 min readLW link1 review

Gen­er­al­iz­ing the Power-Seek­ing Theorems

TurnTrout27 Jul 2020 0:28 UTC
40 points
6 comments4 min readLW link

Are we in an AI over­hang?

Andy Jones27 Jul 2020 12:48 UTC
255 points
109 comments4 min readLW link

[Question] What spe­cific dan­gers arise when ask­ing GPT-N to write an Align­ment Fo­rum post?

Matthew Barnett28 Jul 2020 2:56 UTC
44 points
14 comments1 min readLW link

[Question] Prob­a­bil­ity that other ar­chi­tec­tures will scale as well as Trans­form­ers?

Daniel Kokotajlo28 Jul 2020 19:36 UTC
22 points
4 comments1 min readLW link

What a 20-year-lead in mil­i­tary tech might look like

Daniel Kokotajlo29 Jul 2020 20:10 UTC
68 points
44 comments16 min readLW link

[Question] What if memes are com­mon in highly ca­pa­ble minds?

Daniel Kokotajlo30 Jul 2020 20:45 UTC
36 points
15 comments2 min readLW link

Three men­tal images from think­ing about AGI de­bate & corrigibility

Steven Byrnes3 Aug 2020 14:29 UTC
55 points
35 comments4 min readLW link

Solv­ing Key Align­ment Prob­lems Group

Logan Riggs3 Aug 2020 19:30 UTC
19 points
7 comments2 min readLW link

How eas­ily can we sep­a­rate a friendly AI in de­sign space from one which would bring about a hy­per­ex­is­ten­tial catas­tro­phe?

Anirandis10 Sep 2020 0:40 UTC
19 points
20 comments2 min readLW link

My com­pu­ta­tional frame­work for the brain

Steven Byrnes14 Sep 2020 14:19 UTC
144 points
26 comments13 min readLW link1 review

[Question] Where is hu­man level on text pre­dic­tion? (GPTs task)

Daniel Kokotajlo20 Sep 2020 9:00 UTC
27 points
19 comments1 min readLW link

Needed: AI in­fo­haz­ard policy

Vanessa Kosoy21 Sep 2020 15:26 UTC
61 points
17 comments2 min readLW link

The Col­lid­ing Ex­po­nen­tials of AI

Vermillion14 Oct 2020 23:31 UTC
27 points
16 comments5 min readLW link

“Lit­tle glimpses of em­pa­thy” as the foun­da­tion for so­cial emotions

Steven Byrnes22 Oct 2020 11:02 UTC
31 points
1 comment5 min readLW link

In­tro­duc­tion to Carte­sian Frames

Scott Garrabrant22 Oct 2020 13:00 UTC
145 points
29 comments22 min readLW link1 review

“Carte­sian Frames” Talk #2 this Sun­day at 2pm (PT)

Rob Bensinger28 Oct 2020 13:59 UTC
30 points
0 comments1 min readLW link

Does SGD Pro­duce De­cep­tive Align­ment?

Mark Xu6 Nov 2020 23:48 UTC
85 points
6 comments16 min readLW link

[Question] How can I bet on short timelines?

Daniel Kokotajlo7 Nov 2020 12:44 UTC
43 points
16 comments2 min readLW link

Non-Ob­struc­tion: A Sim­ple Con­cept Mo­ti­vat­ing Corrigibility

TurnTrout21 Nov 2020 19:35 UTC
67 points
19 comments19 min readLW link

Carte­sian Frames Definitions

Rob Bensinger8 Nov 2020 12:44 UTC
25 points
0 comments4 min readLW link

Com­mu­ni­ca­tion Prior as Align­ment Strategy

johnswentworth12 Nov 2020 22:06 UTC
40 points
8 comments6 min readLW link

How Rood­man’s GWP model trans­lates to TAI timelines

Daniel Kokotajlo16 Nov 2020 14:05 UTC
22 points
5 comments3 min readLW link

Normativity

abramdemski18 Nov 2020 16:52 UTC
46 points
11 comments9 min readLW link

In­ner Align­ment in Salt-Starved Rats

Steven Byrnes19 Nov 2020 2:40 UTC
136 points
39 comments11 min readLW link2 reviews

Con­tin­u­ing the take­offs debate

Richard_Ngo23 Nov 2020 15:58 UTC
67 points
13 comments9 min readLW link

The next AI win­ter will be due to en­ergy costs

hippke24 Nov 2020 16:53 UTC
57 points
7 comments2 min readLW link

Re­cur­sive Quan­tiliz­ers II

abramdemski2 Dec 2020 15:26 UTC
30 points
15 comments13 min readLW link

Su­per­vised learn­ing in the brain, part 4: com­pres­sion /​ filtering

Steven Byrnes5 Dec 2020 17:06 UTC
12 points
0 comments5 min readLW link

Con­ser­vatism in neo­cor­tex-like AGIs

Steven Byrnes8 Dec 2020 16:37 UTC
22 points
5 comments8 min readLW link

Avoid­ing Side Effects in Com­plex Environments

12 Dec 2020 0:34 UTC
62 points
9 comments2 min readLW link
(avoiding-side-effects.github.io)

The Power of Annealing

meanderingmoose14 Dec 2020 11:02 UTC
25 points
6 comments5 min readLW link

[link] The AI Gir­lfriend Se­duc­ing China’s Lonely Men

Kaj_Sotala14 Dec 2020 20:18 UTC
34 points
11 comments1 min readLW link
(www.sixthtone.com)

Oper­a­tional­iz­ing com­pat­i­bil­ity with strat­egy-stealing

evhub24 Dec 2020 22:36 UTC
41 points
6 comments4 min readLW link

De­fus­ing AGI Danger

Mark Xu24 Dec 2020 22:58 UTC
48 points
9 comments9 min readLW link

Multi-di­men­sional re­wards for AGI in­ter­pretabil­ity and control

Steven Byrnes4 Jan 2021 3:08 UTC
19 points
8 comments10 min readLW link

DALL-E by OpenAI

Daniel Kokotajlo5 Jan 2021 20:05 UTC
97 points
22 comments1 min readLW link

Re­view of ‘But ex­actly how com­plex and frag­ile?’

TurnTrout6 Jan 2021 18:39 UTC
55 points
0 comments8 min readLW link

The Case for a Jour­nal of AI Alignment

adamShimi9 Jan 2021 18:13 UTC
45 points
32 comments4 min readLW link

Trans­parency and AGI safety

jylin0411 Jan 2021 18:51 UTC
52 points
12 comments30 min readLW link

Birds, Brains, Planes, and AI: Against Ap­peals to the Com­plex­ity/​Mys­te­ri­ous­ness/​Effi­ciency of the Brain

Daniel Kokotajlo18 Jan 2021 12:08 UTC
184 points
85 comments14 min readLW link1 review

In­fra-Bayesi­anism Unwrapped

adamShimi20 Jan 2021 13:35 UTC
41 points
0 comments24 min readLW link

Op­ti­mal play in hu­man-judged De­bate usu­ally won’t an­swer your question

Joe Collman27 Jan 2021 7:34 UTC
33 points
12 comments12 min readLW link

Creat­ing AGI Safety Interlocks

Koen.Holtman5 Feb 2021 12:01 UTC
7 points
4 comments8 min readLW link

Timeline of AI safety

riceissa7 Feb 2021 22:29 UTC
63 points
6 comments2 min readLW link
(timelines.issarice.com)

Tour­ne­sol, YouTube and AI Risk

adamShimi12 Feb 2021 18:56 UTC
36 points
13 comments4 min readLW link

In­ter­net En­cy­clo­pe­dia of Philos­o­phy on Ethics of Ar­tifi­cial Intelligence

Kaj_Sotala20 Feb 2021 13:54 UTC
15 points
1 comment4 min readLW link
(iep.utm.edu)

Be­hav­ioral Suffi­cient Statis­tics for Goal-Directedness

adamShimi11 Mar 2021 15:01 UTC
21 points
12 comments9 min readLW link

A sim­ple way to make GPT-3 fol­low instructions

Quintin Pope8 Mar 2021 2:57 UTC
11 points
5 comments4 min readLW link

Towards a Mechanis­tic Un­der­stand­ing of Goal-Directedness

Mark Xu9 Mar 2021 20:17 UTC
45 points
1 comment5 min readLW link

AXRP Epi­sode 5 - In­fra-Bayesi­anism with Vanessa Kosoy

DanielFilan10 Mar 2021 4:30 UTC
33 points
12 comments35 min readLW link

Com­ments on “The Sin­gu­lar­ity is Nowhere Near”

Steven Byrnes16 Mar 2021 23:59 UTC
50 points
6 comments8 min readLW link

Is RL in­volved in sen­sory pro­cess­ing?

Steven Byrnes18 Mar 2021 13:57 UTC
21 points
21 comments5 min readLW link

Against evolu­tion as an anal­ogy for how hu­mans will cre­ate AGI

Steven Byrnes23 Mar 2021 12:29 UTC
44 points
25 comments25 min readLW link

My AGI Threat Model: Misal­igned Model-Based RL Agent

Steven Byrnes25 Mar 2021 13:45 UTC
66 points
40 comments16 min readLW link

Co­her­ence ar­gu­ments im­ply a force for goal-di­rected behavior

KatjaGrace26 Mar 2021 16:10 UTC
88 points
27 comments14 min readLW link
(aiimpacts.org)

Trans­parency Trichotomy

Mark Xu28 Mar 2021 20:26 UTC
25 points
2 comments7 min readLW link

Hard­ware is already ready for the sin­gu­lar­ity. Al­gorithm knowl­edge is the only bar­rier.

Andrew Vlahos30 Mar 2021 22:48 UTC
16 points
3 comments3 min readLW link

Ben Go­ertzel’s “Kinds of Minds”

JoshuaFox11 Apr 2021 12:41 UTC
12 points
4 comments1 min readLW link

Up­dat­ing the Lot­tery Ticket Hypothesis

johnswentworth18 Apr 2021 21:45 UTC
73 points
41 comments2 min readLW link

Three rea­sons to ex­pect long AI timelines

Matthew Barnett22 Apr 2021 18:44 UTC
68 points
29 comments11 min readLW link
(matthewbarnett.substack.com)

Be­ware over-use of the agent model

Alex Flint25 Apr 2021 22:19 UTC
28 points
10 comments5 min readLW link1 review

Agents Over Carte­sian World Models

27 Apr 2021 2:06 UTC
62 points
3 comments27 min readLW link

Less Real­is­tic Tales of Doom

Mark Xu6 May 2021 23:01 UTC
110 points
13 comments4 min readLW link

Challenge: know ev­ery­thing that the best go bot knows about go

DanielFilan11 May 2021 5:10 UTC
48 points
93 comments2 min readLW link
(danielfilan.com)

For­mal In­ner Align­ment, Prospectus

abramdemski12 May 2021 19:57 UTC
91 points
57 comments16 min readLW link

Agency in Con­way’s Game of Life

Alex Flint13 May 2021 1:07 UTC
97 points
81 comments9 min readLW link1 review

Knowl­edge Neu­rons in Pre­trained Transformers

evhub17 May 2021 22:54 UTC
98 points
7 comments2 min readLW link
(arxiv.org)

De­cou­pling de­liber­a­tion from competition

paulfchristiano25 May 2021 18:50 UTC
72 points
16 comments9 min readLW link
(ai-alignment.com)

Power dy­nam­ics as a blind spot or blurry spot in our col­lec­tive world-mod­el­ing, es­pe­cially around AI

Andrew_Critch1 Jun 2021 18:45 UTC
176 points
26 comments6 min readLW link

Game-the­o­retic Align­ment in terms of At­tain­able Utility

8 Jun 2021 12:36 UTC
20 points
2 comments9 min readLW link

Beijing Academy of Ar­tifi­cial In­tel­li­gence an­nounces 1,75 trillion pa­ram­e­ters model, Wu Dao 2.0

Ozyrus3 Jun 2021 12:07 UTC
23 points
9 comments1 min readLW link
(www.engadget.com)

An In­tu­itive Guide to Garrabrant Induction

Mark Xu3 Jun 2021 22:21 UTC
115 points
18 comments24 min readLW link

Con­ser­va­tive Agency with Mul­ti­ple Stakeholders

TurnTrout8 Jun 2021 0:30 UTC
31 points
0 comments3 min readLW link

Sup­ple­ment to “Big pic­ture of pha­sic dopamine”

Steven Byrnes8 Jun 2021 13:08 UTC
13 points
2 comments9 min readLW link

Look­ing Deeper at Deconfusion

adamShimi13 Jun 2021 21:29 UTC
57 points
13 comments15 min readLW link

[Question] Open prob­lem: how can we quan­tify player al­ign­ment in 2x2 nor­mal-form games?

TurnTrout16 Jun 2021 2:09 UTC
23 points
59 comments1 min readLW link

Re­ward Is Not Enough

Steven Byrnes16 Jun 2021 13:52 UTC
105 points
18 comments10 min readLW link

En­vi­ron­men­tal Struc­ture Can Cause In­stru­men­tal Convergence

TurnTrout22 Jun 2021 22:26 UTC
71 points
44 comments16 min readLW link
(arxiv.org)

AXRP Epi­sode 9 - Finite Fac­tored Sets with Scott Garrabrant

DanielFilan24 Jun 2021 22:10 UTC
56 points
2 comments58 min readLW link

Mus­ings on gen­eral sys­tems alignment

Alex Flint30 Jun 2021 18:16 UTC
31 points
11 comments3 min readLW link

Thoughts on safety in pre­dic­tive learning

Steven Byrnes30 Jun 2021 19:17 UTC
18 points
17 comments19 min readLW link

The More Power At Stake, The Stronger In­stru­men­tal Con­ver­gence Gets For Op­ti­mal Policies

TurnTrout11 Jul 2021 17:36 UTC
45 points
7 comments6 min readLW link

A world in which the al­ign­ment prob­lem seems lower-stakes

TurnTrout8 Jul 2021 2:31 UTC
19 points
17 comments2 min readLW link

Frac­tional progress es­ti­mates for AI timelines and im­plied re­source requirements

15 Jul 2021 18:43 UTC
55 points
6 comments7 min readLW link

Ex­per­i­men­ta­tion with AI-gen­er­ated images (VQGAN+CLIP) | So­larpunk air­ships flee­ing a dragon

Kaj_Sotala15 Jul 2021 11:00 UTC
44 points
4 comments2 min readLW link
(kajsotala.fi)

Seek­ing Power is Con­ver­gently In­stru­men­tal in a Broad Class of Environments

TurnTrout8 Aug 2021 2:02 UTC
41 points
15 comments8 min readLW link

LCDT, A My­opic De­ci­sion Theory

3 Aug 2021 22:41 UTC
50 points
51 comments15 min readLW link

When Most VNM-Co­her­ent Prefer­ence Order­ings Have Con­ver­gent In­stru­men­tal Incentives

TurnTrout9 Aug 2021 17:22 UTC
52 points
4 comments5 min readLW link

Two AI-risk-re­lated game de­sign ideas

Daniel Kokotajlo5 Aug 2021 13:36 UTC
47 points
9 comments5 min readLW link

Re­search agenda update

Steven Byrnes6 Aug 2021 19:24 UTC
54 points
40 comments7 min readLW link

What 2026 looks like

Daniel Kokotajlo6 Aug 2021 16:14 UTC
371 points
109 comments16 min readLW link1 review

Satis­ficers Tend To Seek Power: In­stru­men­tal Con­ver­gence Via Retargetability

TurnTrout18 Nov 2021 1:54 UTC
69 points
8 comments17 min readLW link
(www.overleaf.com)

Dopamine-su­per­vised learn­ing in mam­mals & fruit flies

Steven Byrnes10 Aug 2021 16:13 UTC
16 points
6 comments8 min readLW link

Free course re­view — Reli­able and In­ter­pretable Ar­tifi­cial In­tel­li­gence (ETH Zurich)

Jan Czechowski10 Aug 2021 16:36 UTC
7 points
0 comments3 min readLW link

Tech­ni­cal Pre­dic­tions Re­lated to AI Safety

lsusr13 Aug 2021 0:29 UTC
28 points
12 comments8 min readLW link

Provide feed­back on Open Philan­thropy’s AI al­ign­ment RFP

20 Aug 2021 19:52 UTC
56 points
6 comments1 min readLW link

AI Safety Papers: An App for the TAI Safety Database

ozziegooen21 Aug 2021 2:02 UTC
74 points
13 comments2 min readLW link

Ran­dal Koene on brain un­der­stand­ing be­fore whole brain emulation

Steven Byrnes23 Aug 2021 20:59 UTC
36 points
12 comments3 min readLW link

MIRI/​OP ex­change about de­ci­sion theory

Rob Bensinger25 Aug 2021 22:44 UTC
47 points
7 comments10 min readLW link

Good­hart Ethology

Charlie Steiner17 Sep 2021 17:31 UTC
18 points
4 comments14 min readLW link

[Question] What are good al­ign­ment con­fer­ence pa­pers?

adamShimi28 Aug 2021 13:35 UTC
12 points
2 comments1 min readLW link

Brain-Com­puter In­ter­faces and AI Alignment

niplav28 Aug 2021 19:48 UTC
31 points
6 comments11 min readLW link

Su­per­in­tel­li­gent In­tro­spec­tion: A Counter-ar­gu­ment to the Orthog­o­nal­ity Thesis

DirectedEvolution29 Aug 2021 4:53 UTC
3 points
18 comments4 min readLW link

Align­ment Re­search = Con­cep­tual Align­ment Re­search + Ap­plied Align­ment Research

adamShimi30 Aug 2021 21:13 UTC
37 points
14 comments5 min readLW link

AXRP Epi­sode 11 - At­tain­able Utility and Power with Alex Turner

DanielFilan25 Sep 2021 21:10 UTC
19 points
5 comments52 min readLW link

Is progress in ML-as­sisted the­o­rem-prov­ing benefi­cial?

mako yass28 Sep 2021 1:54 UTC
10 points
3 comments1 min readLW link

Take­off Speeds and Discontinuities

30 Sep 2021 13:50 UTC
62 points
1 comment15 min readLW link

My take on Vanessa Kosoy’s take on AGI safety

Steven Byrnes30 Sep 2021 12:23 UTC
84 points
10 comments31 min readLW link

[Pre­dic­tion] We are in an Al­gorith­mic Overhang

lsusr29 Sep 2021 23:40 UTC
31 points
14 comments1 min readLW link

In­ter­view with Skynet

lsusr30 Sep 2021 2:20 UTC
49 points
1 comment2 min readLW link

AI learns be­trayal and how to avoid it

Stuart_Armstrong30 Sep 2021 9:39 UTC
30 points
4 comments2 min readLW link

The Dark Side of Cog­ni­tion Hypothesis

Cameron Berg3 Oct 2021 20:10 UTC
19 points
1 comment16 min readLW link

[Question] How to think about and deal with OpenAI

Rafael Harth9 Oct 2021 13:10 UTC
107 points
71 comments1 min readLW link

NVIDIA and Microsoft re­leases 530B pa­ram­e­ter trans­former model, Me­ga­tron-Tur­ing NLG

Ozyrus11 Oct 2021 15:28 UTC
51 points
36 comments1 min readLW link
(developer.nvidia.com)

Post­mod­ern Warfare

lsusr25 Oct 2021 9:02 UTC
61 points
25 comments2 min readLW link

A very crude de­cep­tion eval is already passed

Beth Barnes29 Oct 2021 17:57 UTC
105 points
8 comments2 min readLW link

Study Guide

johnswentworth6 Nov 2021 1:23 UTC
220 points
41 comments16 min readLW link

Re: At­tempted Gears Anal­y­sis of AGI In­ter­ven­tion Dis­cus­sion With Eliezer

lsusr15 Nov 2021 10:02 UTC
20 points
8 comments15 min readLW link

Ngo and Yud­kowsky on al­ign­ment difficulty

15 Nov 2021 20:31 UTC
235 points
143 comments99 min readLW link

Cor­rigi­bil­ity Can Be VNM-Incoherent

TurnTrout20 Nov 2021 0:30 UTC
64 points
24 comments7 min readLW link

Visi­ble Thoughts Pro­ject and Bounty Announcement

So8res30 Nov 2021 0:19 UTC
245 points
104 comments13 min readLW link

In­ter­pret­ing Yud­kowsky on Deep vs Shal­low Knowledge

adamShimi5 Dec 2021 17:32 UTC
100 points
32 comments24 min readLW link

Are there al­ter­na­tive to solv­ing value trans­fer and ex­trap­o­la­tion?

Stuart_Armstrong6 Dec 2021 18:53 UTC
19 points
7 comments5 min readLW link

Con­sid­er­a­tions on in­ter­ac­tion be­tween AI and ex­pected value of the fu­ture

Beth Barnes7 Dec 2021 2:46 UTC
64 points
28 comments4 min readLW link

Some thoughts on why ad­ver­sar­ial train­ing might be useful

Beth Barnes8 Dec 2021 1:28 UTC
9 points
5 comments3 min readLW link

The Plan

johnswentworth10 Dec 2021 23:41 UTC
235 points
77 comments14 min readLW link

Moore’s Law, AI, and the pace of progress

Veedrac11 Dec 2021 3:02 UTC
120 points
39 comments24 min readLW link

Sum­mary of the Acausal At­tack Is­sue for AIXI

Diffractor13 Dec 2021 8:16 UTC
14 points
6 comments4 min readLW link

Con­se­quen­tial­ism & corrigibility

Steven Byrnes14 Dec 2021 13:23 UTC
60 points
27 comments7 min readLW link

Should we rely on the speed prior for safety?

Marc Carauleanu14 Dec 2021 20:45 UTC
14 points
6 comments5 min readLW link

The Case for Rad­i­cal Op­ti­mism about Interpretability

Quintin Pope16 Dec 2021 23:38 UTC
57 points
16 comments8 min readLW link1 review

Re­searcher in­cen­tives cause smoother progress on bench­marks

ryan_greenblatt21 Dec 2021 4:13 UTC
20 points
4 comments1 min readLW link

Self-Or­ganised Neu­ral Net­works: A sim­ple, nat­u­ral and effi­cient way to intelligence

D𝜋1 Jan 2022 23:24 UTC
41 points
51 comments44 min readLW link

Prizes for ELK proposals

paulfchristiano3 Jan 2022 20:23 UTC
141 points
156 comments7 min readLW link

D𝜋′s Spik­ing Network

lsusr4 Jan 2022 4:08 UTC
50 points
37 comments4 min readLW link

More Is Differ­ent for AI

jsteinhardt4 Jan 2022 19:30 UTC
137 points
22 comments3 min readLW link
(bounded-regret.ghost.io)

In­stru­men­tal Con­ver­gence For Real­is­tic Agent Objectives

TurnTrout22 Jan 2022 0:41 UTC
35 points
9 comments9 min readLW link

What’s Up With Con­fus­ingly Per­va­sive Con­se­quen­tial­ism?

Raemon20 Jan 2022 19:22 UTC
169 points
88 comments4 min readLW link

[In­tro to brain-like-AGI safety] 1. What’s the prob­lem & Why work on it now?

Steven Byrnes26 Jan 2022 15:23 UTC
119 points
19 comments23 min readLW link

Ar­gu­ments about Highly Reli­able Agent De­signs as a Use­ful Path to Ar­tifi­cial In­tel­li­gence Safety

27 Jan 2022 13:13 UTC
27 points
0 comments1 min readLW link
(arxiv.org)

Com­pet­i­tive pro­gram­ming with AlphaCode

Algon2 Feb 2022 16:49 UTC
58 points
37 comments15 min readLW link
(deepmind.com)

Thoughts on AGI safety from the top

jylin042 Feb 2022 20:06 UTC
35 points
3 comments32 min readLW link

Paradigm-build­ing from first prin­ci­ples: Effec­tive al­tru­ism, AGI, and alignment

Cameron Berg8 Feb 2022 16:12 UTC
24 points
5 comments14 min readLW link

[In­tro to brain-like-AGI safety] 3. Two sub­sys­tems: Learn­ing & Steering

Steven Byrnes9 Feb 2022 13:09 UTC
59 points
3 comments24 min readLW link

[In­tro to brain-like-AGI safety] 4. The “short-term pre­dic­tor”

Steven Byrnes16 Feb 2022 13:12 UTC
51 points
11 comments13 min readLW link

ELK Pro­posal: Think­ing Via A Hu­man Imitator

TurnTrout22 Feb 2022 1:52 UTC
28 points
6 comments11 min readLW link

Why I’m co-found­ing Aligned AI

Stuart_Armstrong17 Feb 2022 19:55 UTC
93 points
54 comments3 min readLW link

Im­pli­ca­tions of au­to­mated on­tol­ogy identification

18 Feb 2022 3:30 UTC
67 points
29 comments23 min readLW link

Align­ment re­search exercises

Richard_Ngo21 Feb 2022 20:24 UTC
146 points
17 comments8 min readLW link

[In­tro to brain-like-AGI safety] 5. The “long-term pre­dic­tor”, and TD learning

Steven Byrnes23 Feb 2022 14:44 UTC
41 points
25 comments21 min readLW link

How do new mod­els from OpenAI, Deep­Mind and An­thropic perform on Truth­fulQA?

Owain_Evans26 Feb 2022 12:46 UTC
42 points
3 comments11 min readLW link

Es­ti­mat­ing Brain-Equiv­a­lent Com­pute from Image Recog­ni­tion Al­gorithms

Gunnar_Zarncke27 Feb 2022 2:45 UTC
14 points
4 comments2 min readLW link

[Link] Aligned AI AMA

Stuart_Armstrong1 Mar 2022 12:01 UTC
18 points
0 comments1 min readLW link

[In­tro to brain-like-AGI safety] 6. Big pic­ture of mo­ti­va­tion, de­ci­sion-mak­ing, and RL

Steven Byrnes2 Mar 2022 15:26 UTC
41 points
13 comments16 min readLW link

[Question] Would (my­opic) gen­eral pub­lic good pro­duc­ers sig­nifi­cantly ac­cel­er­ate the de­vel­op­ment of AGI?

mako yass2 Mar 2022 23:47 UTC
25 points
10 comments1 min readLW link

[In­tro to brain-like-AGI safety] 7. From hard­coded drives to fore­sighted plans: A worked example

Steven Byrnes9 Mar 2022 14:28 UTC
56 points
0 comments9 min readLW link

[In­tro to brain-like-AGI safety] 9. Take­aways from neuro 2/​2: On AGI motivation

Steven Byrnes23 Mar 2022 12:48 UTC
31 points
6 comments23 min readLW link

Hu­mans pre­tend­ing to be robots pre­tend­ing to be human

Richard_Kennaway28 Mar 2022 15:13 UTC
27 points
15 comments1 min readLW link

[In­tro to brain-like-AGI safety] 10. The al­ign­ment problem

Steven Byrnes30 Mar 2022 13:24 UTC
34 points
4 comments21 min readLW link

AXRP Epi­sode 13 - First Prin­ci­ples of AGI Safety with Richard Ngo

DanielFilan31 Mar 2022 5:20 UTC
24 points
1 comment48 min readLW link

Un­con­trol­lable Su­per-Pow­er­ful Explosives

Sammy Martin2 Apr 2022 20:13 UTC
53 points
12 comments5 min readLW link

The case for Do­ing Some­thing Else (if Align­ment is doomed)

Rafael Harth5 Apr 2022 17:52 UTC
81 points
14 comments2 min readLW link

[In­tro to brain-like-AGI safety] 11. Safety ≠ al­ign­ment (but they’re close!)

Steven Byrnes6 Apr 2022 13:39 UTC
25 points
1 comment10 min readLW link

Strate­gic Con­sid­er­a­tions Re­gard­ing Autis­tic/​Literal AI

Chris_Leong6 Apr 2022 14:57 UTC
−1 points
2 comments2 min readLW link

DALL·E 2 by OpenAI

P.6 Apr 2022 14:17 UTC
44 points
51 comments1 min readLW link
(openai.com)

How to train your trans­former

p.b.7 Apr 2022 9:34 UTC
6 points
0 comments8 min readLW link

AMA Con­jec­ture, A New Align­ment Startup

adamShimi9 Apr 2022 9:43 UTC
46 points
42 comments1 min readLW link

Worse than an un­al­igned AGI

Shmi10 Apr 2022 3:35 UTC
−1 points
12 comments1 min readLW link

[Question] Did OpenAI let GPT out of the box?

ChristianKl16 Apr 2022 14:56 UTC
4 points
12 comments1 min readLW link

In­stru­men­tal Con­ver­gence To Offer Hope?

michael_mjd22 Apr 2022 1:56 UTC
12 points
7 comments3 min readLW link

[In­tro to brain-like-AGI safety] 13. Sym­bol ground­ing & hu­man so­cial instincts

Steven Byrnes27 Apr 2022 13:30 UTC
54 points
13 comments14 min readLW link

[In­tro to brain-like-AGI safety] 14. Con­trol­led AGI

Steven Byrnes11 May 2022 13:17 UTC
26 points
25 comments18 min readLW link

[Question] What’s keep­ing con­cerned ca­pa­bil­ities gain re­searchers from leav­ing the field?

sovran12 May 2022 12:16 UTC
19 points
4 comments1 min readLW link

[Question] What’s keep­ing con­cerned ca­pa­bil­ities gain re­searchers from leav­ing the field?

sovran12 May 2022 12:16 UTC
19 points
4 comments1 min readLW link

Read­ing the ethi­cists: A re­view of ar­ti­cles on AI in the jour­nal Science and Eng­ineer­ing Ethics

Charlie Steiner18 May 2022 20:52 UTC
50 points
8 comments14 min readLW link

Con­fused why a “ca­pa­bil­ities re­search is good for al­ign­ment progress” po­si­tion isn’t dis­cussed more

Kaj_Sotala2 Jun 2022 21:41 UTC
132 points
26 comments4 min readLW link

I’m try­ing out “as­ter­oid mind­set”

Alex_Altair3 Jun 2022 13:35 UTC
85 points
5 comments4 min readLW link

An­nounc­ing the Align­ment of Com­plex Sys­tems Re­search Group

4 Jun 2022 4:10 UTC
79 points
18 comments5 min readLW link

AGI Ruin: A List of Lethalities

Eliezer Yudkowsky5 Jun 2022 22:05 UTC
725 points
653 comments30 min readLW link

Yes, AI re­search will be sub­stan­tially cur­tailed if a lab causes a ma­jor disaster

lc14 Jun 2022 22:17 UTC
96 points
35 comments2 min readLW link

Lamda is not an LLM

Kevin19 Jun 2022 11:13 UTC
7 points
10 comments1 min readLW link
(www.wired.com)

Google’s new text-to-image model—Parti, a demon­stra­tion of scal­ing benefits

Kayden22 Jun 2022 20:00 UTC
32 points
4 comments1 min readLW link

[Link] OpenAI: Learn­ing to Play Minecraft with Video PreTrain­ing (VPT)

Aryeh Englander23 Jun 2022 16:29 UTC
53 points
3 comments1 min readLW link

An­nounc­ing Epoch: A re­search or­ga­ni­za­tion in­ves­ti­gat­ing the road to Trans­for­ma­tive AI

27 Jun 2022 13:55 UTC
95 points
2 comments2 min readLW link
(epochai.org)

Paper: Fore­cast­ing world events with neu­ral nets

1 Jul 2022 19:40 UTC
39 points
3 comments4 min readLW link

Naive Hy­pothe­ses on AI Alignment

Shoshannah Tekofsky2 Jul 2022 19:03 UTC
89 points
29 comments5 min readLW link

Hu­mans provide an un­tapped wealth of ev­i­dence about alignment

14 Jul 2022 2:31 UTC
175 points
92 comments10 min readLW link

Ex­am­ples of AI In­creas­ing AI Progress

TW12317 Jul 2022 20:06 UTC
104 points
14 comments1 min readLW link

Fore­cast­ing ML Bench­marks in 2023

jsteinhardt18 Jul 2022 2:50 UTC
36 points
19 comments12 min readLW link
(bounded-regret.ghost.io)

Ro­bust­ness to Scal­ing Down: More Im­por­tant Than I Thought

adamShimi23 Jul 2022 11:40 UTC
37 points
5 comments3 min readLW link

Com­par­ing Four Ap­proaches to In­ner Alignment

Lucas Teixeira29 Jul 2022 21:06 UTC
33 points
1 comment9 min readLW link

Where are the red lines for AI?

Karl von Wendt5 Aug 2022 9:34 UTC
23 points
8 comments6 min readLW link

Jack Clark on the re­al­ities of AI policy

Kaj_Sotala7 Aug 2022 8:44 UTC
66 points
3 comments3 min readLW link
(threadreaderapp.com)

GD’s Im­plicit Bias on Separable Data

Xander Davies17 Oct 2022 4:13 UTC
23 points
0 comments7 min readLW link

AI Trans­parency: Why it’s crit­i­cal and how to ob­tain it.

Zohar Jackson14 Aug 2022 10:31 UTC
6 points
1 comment5 min readLW link

Brain-like AGI pro­ject “ain­telope”

Gunnar_Zarncke14 Aug 2022 16:33 UTC
48 points
2 comments1 min readLW link

A Mechanis­tic In­ter­pretabil­ity Anal­y­sis of Grokking

15 Aug 2022 2:41 UTC
338 points
39 comments42 min readLW link
(colab.research.google.com)

What if we ap­proach AI safety like a tech­ni­cal en­g­ineer­ing safety problem

zeshen20 Aug 2022 10:29 UTC
30 points
5 comments7 min readLW link

AI art isn’t “about to shake things up”. It’s already here.

Davis_Kingsley22 Aug 2022 11:17 UTC
65 points
19 comments3 min readLW link

Some con­cep­tual al­ign­ment re­search projects

Richard_Ngo25 Aug 2022 22:51 UTC
168 points
14 comments3 min readLW link

Lev­el­ling Up in AI Safety Re­search Engineering

Gabe M2 Sep 2022 4:59 UTC
40 points
7 comments17 min readLW link

The shard the­ory of hu­man values

4 Sep 2022 4:28 UTC
202 points
57 comments24 min readLW link

Quintin’s al­ign­ment pa­pers roundup—week 1

Quintin Pope10 Sep 2022 6:39 UTC
119 points
5 comments9 min readLW link

LOVE in a sim­box is all you need

jacob_cannell28 Sep 2022 18:25 UTC
59 points
69 comments44 min readLW link

A shot at the di­a­mond-al­ign­ment problem

TurnTrout6 Oct 2022 18:29 UTC
77 points
53 comments15 min readLW link

More ex­am­ples of goal misgeneralization

7 Oct 2022 14:38 UTC
51 points
8 comments2 min readLW link
(deepmindsafetyresearch.medium.com)

[Cross­post] AlphaTen­sor, Taste, and the Scal­a­bil­ity of AI

jamierumbelow9 Oct 2022 19:42 UTC
16 points
4 comments1 min readLW link
(jamieonsoftware.com)

QAPR 4: In­duc­tive biases

Quintin Pope10 Oct 2022 22:08 UTC
63 points
2 comments18 min readLW link

In­finite Pos­si­bil­ity Space and the Shut­down Problem

magfrump18 Oct 2022 5:37 UTC
6 points
0 comments2 min readLW link
(www.magfrump.net)

Cruxes in Katja Grace’s Counterarguments

azsantosk16 Oct 2022 8:44 UTC
16 points
0 comments7 min readLW link

Deep­Mind on Strat­ego, an im­perfect in­for­ma­tion game

sanxiyn24 Oct 2022 5:57 UTC
15 points
9 comments1 min readLW link
(arxiv.org)

An­nounc­ing: What Fu­ture World? - Grow­ing the AI Gover­nance Community

DavidCorfield2 Nov 2022 1:24 UTC
1 point
0 comments1 min readLW link

Poster Ses­sion on AI Safety

Neil Crawford12 Nov 2022 3:50 UTC
7 points
6 comments1 min readLW link

AI will change the world, but won’t take it over by play­ing “3-di­men­sional chess”.

22 Nov 2022 18:57 UTC
103 points
86 comments24 min readLW link

A challenge for AGI or­ga­ni­za­tions, and a challenge for readers

1 Dec 2022 23:11 UTC
265 points
30 comments2 min readLW link

Towards Hodge-podge Alignment

Cleo Nardo19 Dec 2022 20:12 UTC
65 points
26 comments9 min readLW link

[AN #94]: AI al­ign­ment as trans­la­tion be­tween hu­mans and machines

Rohin Shah8 Apr 2020 17:10 UTC
11 points
0 comments7 min readLW link
(mailchi.mp)

[Question] What are the rel­a­tive speeds of AI ca­pa­bil­ities and AI safety?

NunoSempere24 Apr 2020 18:21 UTC
8 points
2 comments1 min readLW link

Seek­ing Power is Often Con­ver­gently In­stru­men­tal in MDPs

5 Dec 2019 2:33 UTC
153 points
38 comments16 min readLW link2 reviews
(arxiv.org)

“Don’t even think about hell”

emmab2 May 2020 8:06 UTC
6 points
2 comments1 min readLW link

[Question] AI Box­ing for Hard­ware-bound agents (aka the China al­ign­ment prob­lem)

Logan Zoellner8 May 2020 15:50 UTC
11 points
27 comments10 min readLW link

Point­ing to a Flower

johnswentworth18 May 2020 18:54 UTC
59 points
18 comments9 min readLW link

Learn­ing and ma­nipu­lat­ing learning

Stuart_Armstrong19 May 2020 13:02 UTC
39 points
5 comments10 min readLW link

[Question] Why aren’t we test­ing gen­eral in­tel­li­gence dis­tri­bu­tion?

B Jacobs26 May 2020 16:07 UTC
25 points
7 comments1 min readLW link

OpenAI an­nounces GPT-3

gwern29 May 2020 1:49 UTC
67 points
23 comments1 min readLW link
(arxiv.org)

GPT-3: a dis­ap­point­ing paper

nostalgebraist29 May 2020 19:06 UTC
65 points
44 comments8 min readLW link1 review

In­tro­duc­tion to Ex­is­ten­tial Risks from Ar­tifi­cial In­tel­li­gence, for an EA audience

JoshuaFox2 Jun 2020 8:30 UTC
10 points
1 comment1 min readLW link

Prepar­ing for “The Talk” with AI projects

Daniel Kokotajlo13 Jun 2020 23:01 UTC
64 points
16 comments3 min readLW link

[Question] What are the high-level ap­proaches to AI al­ign­ment?

Gordon Seidoh Worley16 Jun 2020 17:10 UTC
12 points
13 comments1 min readLW link

Re­sults of $1,000 Or­a­cle con­test!

Stuart_Armstrong17 Jun 2020 17:44 UTC
58 points
2 comments1 min readLW link

[Question] Like­li­hood of hy­per­ex­is­ten­tial catas­tro­phe from a bug?

Anirandis18 Jun 2020 16:23 UTC
13 points
27 comments1 min readLW link

AI Benefits Post 1: In­tro­duc­ing “AI Benefits”

Cullen22 Jun 2020 16:59 UTC
11 points
3 comments3 min readLW link

Goals and short descriptions

Michele Campolo2 Jul 2020 17:41 UTC
14 points
8 comments5 min readLW link

Re­search ideas to study hu­mans with AI Safety in mind

Riccardo Volpato3 Jul 2020 16:01 UTC
23 points
2 comments5 min readLW link

AI Benefits Post 3: Direct and Indi­rect Ap­proaches to AI Benefits

Cullen6 Jul 2020 18:48 UTC
8 points
0 comments2 min readLW link

An­titrust-Com­pli­ant AI In­dus­try Self-Regulation

Cullen7 Jul 2020 20:53 UTC
9 points
3 comments1 min readLW link
(cullenokeefe.com)

Should AI Be Open?

Scott Alexander17 Dec 2015 8:25 UTC
20 points
3 comments13 min readLW link

Meta Pro­gram­ming GPT: A route to Su­per­in­tel­li­gence?

dmtea11 Jul 2020 14:51 UTC
10 points
7 comments4 min readLW link

The Dilemma of Worse Than Death Scenarios

arkaeik10 Jul 2018 9:18 UTC
5 points
18 comments4 min readLW link

[Question] What are the mostly likely ways AGI will emerge?

Craig Quiter14 Jul 2020 0:58 UTC
3 points
7 comments1 min readLW link

AI Benefits Post 4: Out­stand­ing Ques­tions on Select­ing Benefits

Cullen14 Jul 2020 17:26 UTC
4 points
4 comments5 min readLW link

Solv­ing Math Prob­lems by Relay

17 Jul 2020 15:32 UTC
98 points
26 comments7 min readLW link

AI Benefits Post 5: Out­stand­ing Ques­tions on Govern­ing Benefits

Cullen21 Jul 2020 16:46 UTC
4 points
0 comments4 min readLW link

[Question] Why is pseudo-al­ign­ment “worse” than other ways ML can fail to gen­er­al­ize?

nostalgebraist18 Jul 2020 22:54 UTC
45 points
10 comments2 min readLW link

[Question] “Do Noth­ing” util­ity func­tion, 3½ years later?

niplav20 Jul 2020 11:09 UTC
5 points
3 comments1 min readLW link

[AN #80]: Why AI risk might be solved with­out ad­di­tional in­ter­ven­tion from longtermists

Rohin Shah2 Jan 2020 18:20 UTC
34 points
94 comments10 min readLW link
(mailchi.mp)

Ac­cess to AI: a hu­man right?

dmtea25 Jul 2020 9:38 UTC
5 points
3 comments2 min readLW link

The Rise of Com­mon­sense Reasoning

DragonGod27 Jul 2020 19:01 UTC
8 points
0 comments1 min readLW link
(www.reddit.com)

AI and Efficiency

DragonGod27 Jul 2020 20:58 UTC
9 points
1 comment1 min readLW link
(openai.com)

FHI Re­port: How Will Na­tional Se­cu­rity Con­sid­er­a­tions Affect An­titrust De­ci­sions in AI? An Ex­am­i­na­tion of His­tor­i­cal Precedents

Cullen28 Jul 2020 18:34 UTC
2 points
0 comments1 min readLW link
(www.fhi.ox.ac.uk)

The “best pre­dic­tor is mal­i­cious op­ti­miser” problem

Donald Hobson29 Jul 2020 11:49 UTC
14 points
10 comments2 min readLW link

Suffi­ciently Ad­vanced Lan­guage Models Can Do Re­in­force­ment Learning

Past Account2 Aug 2020 15:32 UTC
21 points
7 comments7 min readLW link

[Question] What are the most im­por­tant pa­pers/​post/​re­sources to read to un­der­stand more of GPT-3?

adamShimi2 Aug 2020 20:53 UTC
22 points
4 comments1 min readLW link

[Question] What should an Ein­stein-like figure in Ma­chine Learn­ing do?

Razied5 Aug 2020 23:52 UTC
3 points
3 comments1 min readLW link

Book re­view: Ar­chi­tects of In­tel­li­gence by Martin Ford (2018)

Ofer11 Aug 2020 17:30 UTC
15 points
0 comments2 min readLW link

[Question] Will OpenAI’s work un­in­ten­tion­ally in­crease ex­is­ten­tial risks re­lated to AI?

adamShimi11 Aug 2020 18:16 UTC
50 points
56 comments1 min readLW link

Blog post: A tale of two re­search communities

Aryeh Englander12 Aug 2020 20:41 UTC
14 points
0 comments4 min readLW link

Map­ping Out Alignment

15 Aug 2020 1:02 UTC
42 points
0 comments5 min readLW link

My Un­der­stand­ing of Paul Chris­ti­ano’s Iter­ated Am­plifi­ca­tion AI Safety Re­search Agenda

Chi Nguyen15 Aug 2020 20:02 UTC
119 points
21 comments39 min readLW link

GPT-3, be­lief, and consistency

skybrian16 Aug 2020 23:12 UTC
18 points
7 comments2 min readLW link

[Question] What pre­cisely do we mean by AI al­ign­ment?

Gordon Seidoh Worley9 Dec 2018 2:23 UTC
27 points
8 comments1 min readLW link

Thoughts on the Fea­si­bil­ity of Pro­saic AGI Align­ment?

iamthouthouarti21 Aug 2020 23:25 UTC
8 points
10 comments1 min readLW link

[Question] Fore­cast­ing Thread: AI Timelines

22 Aug 2020 2:33 UTC
133 points
95 comments2 min readLW link

Learn­ing hu­man prefer­ences: black-box, white-box, and struc­tured white-box access

Stuart_Armstrong24 Aug 2020 11:42 UTC
25 points
9 comments6 min readLW link

Proofs Sec­tion 2.3 (Up­dates, De­ci­sion The­ory)

Diffractor27 Aug 2020 7:49 UTC
7 points
0 comments31 min readLW link

Proofs Sec­tion 2.2 (Iso­mor­phism to Ex­pec­ta­tions)

Diffractor27 Aug 2020 7:52 UTC
7 points
0 comments46 min readLW link

Proofs Sec­tion 2.1 (The­o­rem 1, Lem­mas)

Diffractor27 Aug 2020 7:54 UTC
7 points
0 comments36 min readLW link

Proofs Sec­tion 1.1 (Ini­tial re­sults to LF-du­al­ity)

Diffractor27 Aug 2020 7:59 UTC
7 points
0 comments20 min readLW link

Proofs Sec­tion 1.2 (Mix­tures, Up­dates, Push­for­wards)

Diffractor27 Aug 2020 7:57 UTC
7 points
0 comments14 min readLW link

Ba­sic In­framea­sure Theory

Diffractor27 Aug 2020 8:02 UTC
35 points
16 comments25 min readLW link

Belief Func­tions And De­ci­sion Theory

Diffractor27 Aug 2020 8:00 UTC
15 points
8 comments39 min readLW link

Tech­ni­cal model re­fine­ment formalism

Stuart_Armstrong27 Aug 2020 11:54 UTC
19 points
0 comments6 min readLW link

Pong from pix­els with­out read­ing “Pong from Pix­els”

Ian McKenzie29 Aug 2020 17:26 UTC
15 points
1 comment7 min readLW link

Reflec­tions on AI Timelines Fore­cast­ing Thread

Amandango1 Sep 2020 1:42 UTC
53 points
7 comments5 min readLW link

on “learn­ing to sum­ma­rize”

nostalgebraist12 Sep 2020 3:20 UTC
25 points
13 comments8 min readLW link
(nostalgebraist.tumblr.com)

[Question] The uni­ver­sal­ity of com­pu­ta­tion and mind de­sign space

alanf12 Sep 2020 14:58 UTC
1 point
7 comments1 min readLW link

Clar­ify­ing “What failure looks like”

Sam Clarke20 Sep 2020 20:40 UTC
95 points
14 comments17 min readLW link

Hu­man Bi­ases that Ob­scure AI Progress

Danielle Ensign25 Sep 2020 0:24 UTC
42 points
2 comments4 min readLW link

[Question] Com­pe­tence vs Alignment

Ariel Kwiatkowski30 Sep 2020 21:03 UTC
6 points
4 comments1 min readLW link

AGI safety from first prin­ci­ples: Alignment

Richard_Ngo1 Oct 2020 3:13 UTC
56 points
2 comments13 min readLW link

[Question] GPT-3 + GAN

stick10917 Oct 2020 7:58 UTC
4 points
4 comments1 min readLW link

Book Re­view: Re­in­force­ment Learn­ing by Sut­ton and Barto

billmei20 Oct 2020 19:40 UTC
52 points
3 comments10 min readLW link

GPT-X, Paper­clip Max­i­mizer? An­a­lyz­ing AGI and Fi­nal Goals

meanderingmoose22 Oct 2020 14:33 UTC
8 points
1 comment6 min readLW link

Con­tain­ing the AI… In­side a Si­mu­lated Reality

HumaneAutomation31 Oct 2020 16:16 UTC
1 point
9 comments2 min readLW link

Why those who care about catas­trophic and ex­is­ten­tial risk should care about au­tonomous weapons

aaguirre11 Nov 2020 15:22 UTC
60 points
20 comments19 min readLW link

Euro­pean Master’s Pro­grams in Ma­chine Learn­ing, Ar­tifi­cial In­tel­li­gence, and re­lated fields

Master Programs ML/AI14 Nov 2020 15:51 UTC
32 points
8 comments1 min readLW link

Should we post­pone AGI un­til we reach safety?

otto.barten18 Nov 2020 15:43 UTC
27 points
36 comments3 min readLW link

Com­mit­ment and cred­i­bil­ity in mul­ti­po­lar AI scenarios

anni_leskela4 Dec 2020 18:48 UTC
25 points
3 comments18 min readLW link

[Question] AI Win­ter Is Com­ing—How to profit from it?

maximkazhenkov5 Dec 2020 20:23 UTC
10 points
7 comments1 min readLW link

An­nounc­ing the Tech­ni­cal AI Safety Podcast

Quinn7 Dec 2020 18:51 UTC
42 points
6 comments2 min readLW link
(technical-ai-safety.libsyn.com)

All GPT skills are translation

p.b.13 Dec 2020 20:06 UTC
4 points
0 comments2 min readLW link

[Question] Judg­ing AGI Output

cy6erlion14 Dec 2020 12:43 UTC
3 points
0 comments2 min readLW link

Risk Map of AI Systems

15 Dec 2020 9:16 UTC
25 points
3 comments8 min readLW link

AI Align­ment, Philo­soph­i­cal Plu­ral­ism, and the Rele­vance of Non-Western Philosophy

xuan1 Jan 2021 0:08 UTC
30 points
21 comments20 min readLW link

Are we all mis­al­igned?

Mateusz Mazurkiewicz3 Jan 2021 2:42 UTC
11 points
0 comments5 min readLW link

[Question] What do we *re­ally* ex­pect from a well-al­igned AI?

jan betley4 Jan 2021 20:57 UTC
8 points
10 comments1 min readLW link

Eight claims about multi-agent AGI safety

Richard_Ngo7 Jan 2021 13:34 UTC
73 points
18 comments5 min readLW link

Imi­ta­tive Gen­er­al­i­sa­tion (AKA ‘Learn­ing the Prior’)

Beth Barnes10 Jan 2021 0:30 UTC
92 points
14 comments12 min readLW link

Pre­dic­tion can be Outer Aligned at Optimum

Lukas Finnveden10 Jan 2021 18:48 UTC
15 points
12 comments11 min readLW link

[Question] Poll: Which vari­ables are most strate­gi­cally rele­vant?

22 Jan 2021 17:17 UTC
32 points
34 comments1 min readLW link

AISU 2021

Linda Linsefors30 Jan 2021 17:40 UTC
28 points
2 comments1 min readLW link

Deep­mind has made a gen­eral in­duc­tor (“Mak­ing sense of sen­sory in­put”)

mako yass2 Feb 2021 2:54 UTC
48 points
10 comments1 min readLW link
(www.sciencedirect.com)

Coun­ter­fac­tual Plan­ning in AGI Systems

Koen.Holtman3 Feb 2021 13:54 UTC
7 points
0 comments5 min readLW link

[AN #136]: How well will GPT-N perform on down­stream tasks?

Rohin Shah3 Feb 2021 18:10 UTC
21 points
2 comments9 min readLW link
(mailchi.mp)

For­mal Solu­tion to the In­ner Align­ment Problem

michaelcohen18 Feb 2021 14:51 UTC
47 points
123 comments2 min readLW link

TASP Ep 3 - Op­ti­mal Poli­cies Tend to Seek Power

Quinn11 Mar 2021 1:44 UTC
24 points
0 comments1 min readLW link
(technical-ai-safety.libsyn.com)

Phy­lac­tery De­ci­sion Theory

Bunthut2 Apr 2021 20:55 UTC
14 points
6 comments2 min readLW link

Pre­dic­tive Cod­ing has been Unified with Backpropagation

lsusr2 Apr 2021 21:42 UTC
166 points
44 comments2 min readLW link

[Question] What if we could use the the­ory of Mechanism De­sign from Game The­ory as a medium achieve AI Align­ment?

farari74 Apr 2021 12:56 UTC
4 points
0 comments1 min readLW link

Another (outer) al­ign­ment failure story

paulfchristiano7 Apr 2021 20:12 UTC
210 points
38 comments12 min readLW link

A Sys­tem For Evolv­ing In­creas­ingly Gen­eral Ar­tifi­cial In­tel­li­gence From Cur­rent Technologies

Tsang Chung Shu8 Apr 2021 21:37 UTC
1 point
3 comments11 min readLW link

April 2021 Deep Dive: Trans­form­ers and GPT-3

adamShimi1 May 2021 11:18 UTC
30 points
6 comments7 min readLW link

[Question] [time­boxed ex­er­cise] write me your model of AI hu­man-ex­is­ten­tial safety and the al­ign­ment prob­lems in 15 minutes

Quinn4 May 2021 19:10 UTC
6 points
2 comments1 min readLW link

Mostly ques­tions about Dumb AI Kernels

HorizonHeld12 May 2021 22:00 UTC
1 point
1 comment9 min readLW link

Thoughts on Iter­ated Distil­la­tion and Amplification

Waddington11 May 2021 21:32 UTC
9 points
2 comments20 min readLW link

How do we build or­gani­sa­tions that want to build safe AI?

sxae12 May 2021 15:08 UTC
4 points
4 comments9 min readLW link

[Question] Who has ar­gued in de­tail that a cur­rent AI sys­tem is phe­nom­e­nally con­scious?

Robbo14 May 2021 22:03 UTC
3 points
2 comments1 min readLW link

How I Learned to Stop Wor­ry­ing and Love MUM

Waddington20 May 2021 7:57 UTC
2 points
0 comments3 min readLW link

AI Safety Re­search Pro­ject Ideas

Owain_Evans21 May 2021 13:39 UTC
58 points
2 comments3 min readLW link

[Question] How one uses set the­ory for al­ign­ment prob­lem?

Valentin202629 May 2021 0:28 UTC
8 points
6 comments1 min readLW link

Reflec­tion of Hier­ar­chi­cal Re­la­tion­ship via Nuanced Con­di­tion­ing of Game The­ory Ap­proach for AI Devel­op­ment and Utilization

Kyoung-cheol Kim4 Jun 2021 7:20 UTC
2 points
2 comments9 min readLW link

Re­view of “Learn­ing Nor­ma­tivity: A Re­search Agenda”

6 Jun 2021 13:33 UTC
34 points
0 comments6 min readLW link

Hard­ware for Trans­for­ma­tive AI

MrThink22 Jun 2021 18:13 UTC
17 points
7 comments2 min readLW link

Alex Turner’s Re­search, Com­pre­hen­sive In­for­ma­tion Gathering

adamShimi23 Jun 2021 9:44 UTC
15 points
3 comments3 min readLW link

Dis­cus­sion: Ob­jec­tive Ro­bust­ness and In­ner Align­ment Terminology

23 Jun 2021 23:25 UTC
70 points
7 comments9 min readLW link

The Lan­guage of Bird

johnswentworth27 Jun 2021 4:44 UTC
44 points
9 comments2 min readLW link

[Question] What are some claims or opinions about multi-multi del­e­ga­tion you’ve seen in the meme­plex that you think de­serve scrutiny?

Quinn27 Jun 2021 17:44 UTC
17 points
6 comments2 min readLW link

An ex­am­i­na­tion of Me­tac­u­lus’ re­solved AI pre­dic­tions and their im­pli­ca­tions for AI timelines

CharlesD20 Jul 2021 9:08 UTC
28 points
0 comments7 min readLW link

[Question] How should my timelines in­fluence my ca­reer choice?

Tom Lieberum3 Aug 2021 10:14 UTC
13 points
10 comments1 min readLW link

What is the prob­lem?

Carlos Ramirez11 Aug 2021 22:33 UTC
7 points
0 comments6 min readLW link

OpenAI Codex: First Impressions

specbug13 Aug 2021 16:52 UTC
49 points
8 comments4 min readLW link
(sixeleven.in)

[Question] 1h-vol­un­teers needed for a small AI Safety-re­lated re­search pro­ject

PabloAMC16 Aug 2021 17:53 UTC
2 points
0 comments1 min readLW link

Ex­trac­tion of hu­man prefer­ences 👨→🤖

arunraja-hub24 Aug 2021 16:34 UTC
18 points
2 comments5 min readLW link

Call for re­search on eval­u­at­ing al­ign­ment (fund­ing + ad­vice available)

Beth Barnes31 Aug 2021 23:28 UTC
105 points
11 comments5 min readLW link

Ob­sta­cles to gra­di­ent hacking

leogao5 Sep 2021 22:42 UTC
21 points
11 comments4 min readLW link

[Question] Con­di­tional on the first AGI be­ing al­igned cor­rectly, is a good out­come even still likely?

iamthouthouarti6 Sep 2021 17:30 UTC
2 points
1 comment1 min readLW link

Dist­in­guish­ing AI takeover scenarios

8 Sep 2021 16:19 UTC
67 points
11 comments14 min readLW link

Paths To High-Level Ma­chine Intelligence

Daniel_Eth10 Sep 2021 13:21 UTC
67 points
8 comments33 min readLW link

How truth­ful is GPT-3? A bench­mark for lan­guage models

Owain_Evans16 Sep 2021 10:09 UTC
56 points
24 comments6 min readLW link

In­ves­ti­gat­ing AI Takeover Scenarios

Sammy Martin17 Sep 2021 18:47 UTC
27 points
1 comment27 min readLW link

A suffi­ciently para­noid non-Friendly AGI might self-mod­ify it­self to be­come Friendly

RomanS22 Sep 2021 6:29 UTC
5 points
2 comments1 min readLW link

Towards De­con­fus­ing Gra­di­ent Hacking

leogao24 Oct 2021 0:43 UTC
25 points
1 comment12 min readLW link

A brief re­view of the rea­sons multi-ob­jec­tive RL could be im­por­tant in AI Safety Research

Ben Smith29 Sep 2021 17:09 UTC
27 points
8 comments10 min readLW link

Meta learn­ing to gra­di­ent hack

Quintin Pope1 Oct 2021 19:25 UTC
54 points
11 comments3 min readLW link

Pro­posal: Scal­ing laws for RL generalization

axioman1 Oct 2021 21:32 UTC
14 points
10 comments11 min readLW link

A Frame­work of Pre­dic­tion Technologies

isaduan3 Oct 2021 10:26 UTC
8 points
2 comments9 min readLW link

AI Pre­dic­tion Ser­vices and Risks of War

isaduan3 Oct 2021 10:26 UTC
3 points
2 comments10 min readLW link

Pos­si­ble Wor­lds af­ter Pre­dic­tion Take-off

isaduan3 Oct 2021 10:26 UTC
5 points
0 comments4 min readLW link

[Pro­posal] Method of lo­cat­ing use­ful sub­nets in large models

Quintin Pope13 Oct 2021 20:52 UTC
9 points
0 comments2 min readLW link

Com­men­tary on “AGI Safety From First Prin­ci­ples by Richard Ngo, Septem­ber 2020”

Robert Kralisch14 Oct 2021 15:11 UTC
3 points
0 comments20 min readLW link

The AGI needs to be honest

rokosbasilisk16 Oct 2021 19:24 UTC
2 points
12 comments2 min readLW link

“Re­dun­dant” AI Alignment

Mckay Jensen16 Oct 2021 21:32 UTC
12 points
3 comments1 min readLW link
(quevivasbien.github.io)

[MLSN #1]: ICLR Safety Paper Roundup

Dan_H18 Oct 2021 15:19 UTC
59 points
1 comment2 min readLW link

AMA on Truth­ful AI: Owen Cot­ton-Bar­ratt, Owain Evans & co-authors

Owain_Evans22 Oct 2021 16:23 UTC
31 points
15 comments1 min readLW link

Hegel vs. GPT-3

Bezzi27 Oct 2021 5:55 UTC
9 points
21 comments2 min readLW link

Google an­nounces Path­ways: new gen­er­a­tion mul­ti­task AI Architecture

Ozyrus29 Oct 2021 11:55 UTC
6 points
1 comment1 min readLW link
(blog.google)

What is the most evil AI that we could build, to­day?

ThomasJ1 Nov 2021 19:58 UTC
−2 points
14 comments1 min readLW link

Why we need proso­cial agents

Akbir Khan2 Nov 2021 15:19 UTC
6 points
0 comments2 min readLW link

Pos­si­ble re­search di­rec­tions to im­prove the mechanis­tic ex­pla­na­tion of neu­ral networks

delton1379 Nov 2021 2:36 UTC
29 points
8 comments9 min readLW link

What are red flags for Neu­ral Net­work suffer­ing?

Marius Hobbhahn8 Nov 2021 12:51 UTC
26 points
15 comments12 min readLW link

Us­ing Brain-Com­puter In­ter­faces to get more data for AI alignment

Robbo7 Nov 2021 0:00 UTC
35 points
10 comments7 min readLW link

Hard­code the AGI to need our ap­proval in­definitely?

MichaelStJules11 Nov 2021 7:04 UTC
2 points
2 comments1 min readLW link

Stop but­ton: to­wards a causal solution

tailcalled12 Nov 2021 19:09 UTC
23 points
37 comments9 min readLW link

A FLI post­doc­toral grant ap­pli­ca­tion: AI al­ign­ment via causal anal­y­sis and de­sign of agents

PabloAMC13 Nov 2021 1:44 UTC
4 points
0 comments7 min readLW link

What would we do if al­ign­ment were fu­tile?

Grant Demaree14 Nov 2021 8:09 UTC
73 points
43 comments3 min readLW link

At­tempted Gears Anal­y­sis of AGI In­ter­ven­tion Dis­cus­sion With Eliezer

Zvi15 Nov 2021 3:50 UTC
204 points
48 comments16 min readLW link
(thezvi.wordpress.com)

A pos­i­tive case for how we might suc­ceed at pro­saic AI alignment

evhub16 Nov 2021 1:49 UTC
78 points
47 comments6 min readLW link

Su­per in­tel­li­gent AIs that don’t re­quire alignment

Yair Halberstadt16 Nov 2021 19:55 UTC
10 points
2 comments6 min readLW link

Some real ex­am­ples of gra­di­ent hacking

Oliver Sourbut22 Nov 2021 0:11 UTC
15 points
8 comments2 min readLW link

[linkpost] Ac­qui­si­tion of Chess Knowl­edge in AlphaZero

Quintin Pope23 Nov 2021 7:55 UTC
8 points
1 comment1 min readLW link

AI Tracker: mon­i­tor­ing cur­rent and near-fu­ture risks from su­per­scale models

23 Nov 2021 19:16 UTC
64 points
13 comments3 min readLW link
(aitracker.org)

AI Safety Needs Great Engineers

Andy Jones23 Nov 2021 15:40 UTC
78 points
45 comments4 min readLW link

HIRING: In­form and shape a new pro­ject on AI safety at Part­ner­ship on AI

Madhulika Srikumar24 Nov 2021 8:27 UTC
6 points
0 comments1 min readLW link

How to mea­sure FLOP/​s for Neu­ral Net­works em­piri­cally?

Marius Hobbhahn29 Nov 2021 15:18 UTC
16 points
5 comments7 min readLW link

AI Gover­nance Fun­da­men­tals—Cur­ricu­lum and Application

Mau30 Nov 2021 2:19 UTC
17 points
0 comments16 min readLW link

Be­hav­ior Clon­ing is Miscalibrated

leogao5 Dec 2021 1:36 UTC
53 points
3 comments3 min readLW link

ML Align­ment The­ory Pro­gram un­der Evan Hubinger

6 Dec 2021 0:03 UTC
82 points
3 comments2 min readLW link

In­for­ma­tion bot­tle­neck for coun­ter­fac­tual corrigibility

tailcalled6 Dec 2021 17:11 UTC
8 points
1 comment7 min readLW link

Model­ing Failure Modes of High-Level Ma­chine Intelligence

6 Dec 2021 13:54 UTC
54 points
1 comment12 min readLW link

Find­ing the mul­ti­ple ground truths of CoinRun and image classification

Stuart_Armstrong8 Dec 2021 18:13 UTC
15 points
3 comments2 min readLW link

[Question] What al­ign­ment-re­lated con­cepts should be bet­ter known in the broader ML com­mu­nity?

Lauro Langosco9 Dec 2021 20:44 UTC
6 points
4 comments1 min readLW link

Un­der­stand­ing Gra­di­ent Hacking

peterbarnett10 Dec 2021 15:58 UTC
30 points
5 comments30 min readLW link

What’s the back­ward-for­ward FLOP ra­tio for Neu­ral Net­works?

13 Dec 2021 8:54 UTC
17 points
8 comments10 min readLW link

My Overview of the AI Align­ment Land­scape: A Bird’s Eye View

Neel Nanda15 Dec 2021 23:44 UTC
111 points
9 comments15 min readLW link

Disen­tan­gling Per­spec­tives On Strat­egy-Steal­ing in AI Safety

shawnghu18 Dec 2021 20:13 UTC
20 points
1 comment11 min readLW link

De­mand­ing and De­sign­ing Aligned Cog­ni­tive Architectures

Koen.Holtman21 Dec 2021 17:32 UTC
8 points
5 comments5 min readLW link

Po­ten­tial gears level ex­pla­na­tions of smooth progress

ryan_greenblatt22 Dec 2021 18:05 UTC
4 points
2 comments2 min readLW link

Trans­former Circuits

evhub22 Dec 2021 21:09 UTC
142 points
4 comments3 min readLW link
(transformer-circuits.pub)

Gra­di­ent Hack­ing via Schel­ling Goals

Adam Scherlis28 Dec 2021 20:38 UTC
33 points
4 comments4 min readLW link

Reader-gen­er­ated Essays

Henrik Karlsson3 Jan 2022 8:56 UTC
17 points
0 comments6 min readLW link
(escapingflatland.substack.com)

Brain Effi­ciency: Much More than You Wanted to Know

jacob_cannell6 Jan 2022 3:38 UTC
195 points
87 comments28 min readLW link

Un­der­stand­ing the two-head strat­egy for teach­ing ML to an­swer ques­tions honestly

Adam Scherlis11 Jan 2022 23:24 UTC
28 points
1 comment10 min readLW link

Plan B in AI Safety approach

avturchin13 Jan 2022 12:03 UTC
33 points
9 comments2 min readLW link

Truth­ful LMs as a warm-up for al­igned AGI

Jacob_Hilton17 Jan 2022 16:49 UTC
65 points
14 comments13 min readLW link

How I’m think­ing about GPT-N

delton13717 Jan 2022 17:11 UTC
46 points
21 comments18 min readLW link

Align­ment Prob­lems All the Way Down

peterbarnett22 Jan 2022 0:19 UTC
26 points
7 comments10 min readLW link

[Question] How fea­si­ble/​costly would it be to train a very large AI model on dis­tributed clusters of GPUs?

Anonymous25 Jan 2022 19:20 UTC
7 points
4 comments1 min readLW link

Causal­ity, Trans­for­ma­tive AI and al­ign­ment—part I

Marius Hobbhahn27 Jan 2022 16:18 UTC
13 points
11 comments8 min readLW link

2+2: On­tolog­i­cal Framework

Lyrialtus1 Feb 2022 1:07 UTC
−15 points
2 comments12 min readLW link

QNR prospects are im­por­tant for AI al­ign­ment research

Eric Drexler3 Feb 2022 15:20 UTC
82 points
10 comments11 min readLW link

Paradigm-build­ing: Introduction

Cameron Berg8 Feb 2022 0:06 UTC
25 points
0 comments2 min readLW link

Paradigm-build­ing: The hi­er­ar­chi­cal ques­tion framework

Cameron Berg9 Feb 2022 16:47 UTC
11 points
16 comments3 min readLW link

Ques­tion 1: Pre­dicted ar­chi­tec­ture of AGI learn­ing al­gorithm(s)

Cameron Berg10 Feb 2022 17:22 UTC
12 points
1 comment7 min readLW link

Ques­tion 2: Pre­dicted bad out­comes of AGI learn­ing architecture

Cameron Berg11 Feb 2022 22:23 UTC
5 points
1 comment10 min readLW link

Ques­tion 3: Con­trol pro­pos­als for min­i­miz­ing bad outcomes

Cameron Berg12 Feb 2022 19:13 UTC
5 points
1 comment7 min readLW link

Ques­tion 4: Im­ple­ment­ing the con­trol proposals

Cameron Berg13 Feb 2022 17:12 UTC
6 points
2 comments5 min readLW link

Ques­tion 5: The timeline hyperparameter

Cameron Berg14 Feb 2022 16:38 UTC
5 points
3 comments7 min readLW link

Paradigm-build­ing: Con­clu­sion and prac­ti­cal takeaways

Cameron Berg15 Feb 2022 16:11 UTC
2 points
1 comment2 min readLW link

How com­plex are my­opic imi­ta­tors?

Vivek Hebbar8 Feb 2022 12:00 UTC
23 points
1 comment15 min readLW link

Me­tac­u­lus launches con­test for es­says with quan­ti­ta­tive pre­dic­tions about AI

8 Feb 2022 16:07 UTC
25 points
2 comments1 min readLW link
(www.metaculus.com)

Hy­poth­e­sis: gra­di­ent de­scent prefers gen­eral circuits

Quintin Pope8 Feb 2022 21:12 UTC
40 points
26 comments11 min readLW link

Com­pute Trends Across Three eras of Ma­chine Learning

16 Feb 2022 14:18 UTC
91 points
13 comments2 min readLW link

[Question] Is the com­pe­ti­tion/​co­op­er­a­tion be­tween sym­bolic AI and statis­ti­cal AI (ML) about his­tor­i­cal ap­proach to re­search /​ en­g­ineer­ing, or is it more fun­da­men­tally about what in­tel­li­gent agents “are”?

Edward Hammond17 Feb 2022 23:11 UTC
1 point
1 comment2 min readLW link

HCH and Ad­ver­sar­ial Questions

David Udell19 Feb 2022 0:52 UTC
15 points
7 comments26 min readLW link

Thoughts on Danger­ous Learned Optimization

peterbarnett19 Feb 2022 10:46 UTC
4 points
2 comments4 min readLW link

Rel­a­tivized Defi­ni­tions as a Method to Sidestep the Löbian Obstacle

homotowat27 Feb 2022 6:37 UTC
27 points
4 comments7 min readLW link

What we know about ma­chine learn­ing’s repli­ca­tion crisis

Younes Kamel5 Mar 2022 23:55 UTC
35 points
4 comments6 min readLW link
(youneskamel.substack.com)

Pro­ject­ing com­pute trends in Ma­chine Learning

7 Mar 2022 15:32 UTC
59 points
5 comments6 min readLW link

[Sur­vey] Ex­pec­ta­tions of a Post-ASI Order

Lone Pine9 Mar 2022 19:17 UTC
5 points
0 comments1 min readLW link

A Longlist of The­o­ries of Im­pact for Interpretability

Neel Nanda11 Mar 2022 14:55 UTC
106 points
29 comments5 min readLW link

New GPT3 Im­pres­sive Ca­pa­bil­ities—In­struc­tGPT3 [1/​2]

simeon_c13 Mar 2022 10:58 UTC
71 points
10 comments7 min readLW link

Phase tran­si­tions and AGI

17 Mar 2022 17:22 UTC
44 points
19 comments9 min readLW link
(www.metaculus.com)

Can we simu­late hu­man evolu­tion to cre­ate a some­what al­igned AGI?

Thomas Kwa28 Mar 2022 22:55 UTC
21 points
7 comments7 min readLW link

Pro­ject In­tro: Selec­tion The­o­rems for Modularity

4 Apr 2022 12:59 UTC
69 points
20 comments16 min readLW link

My agenda for re­search into trans­former ca­pa­bil­ities—Introduction

p.b.5 Apr 2022 21:23 UTC
11 points
1 comment3 min readLW link

Re­search agenda: Can trans­form­ers do sys­tem 2 think­ing?

p.b.6 Apr 2022 13:31 UTC
20 points
0 comments2 min readLW link

PaLM in “Ex­trap­o­lat­ing GPT-N perfor­mance”

Lukas Finnveden6 Apr 2022 13:05 UTC
80 points
19 comments2 min readLW link

Re­search agenda—Build­ing a multi-modal chess-lan­guage model

p.b.7 Apr 2022 12:25 UTC
8 points
2 comments2 min readLW link

Is GPT3 a Good Ra­tion­al­ist? - In­struc­tGPT3 [2/​2]

simeon_c7 Apr 2022 13:46 UTC
11 points
0 comments7 min readLW link

Play­ing with DALL·E 2

Dave Orr7 Apr 2022 18:49 UTC
165 points
116 comments6 min readLW link

Progress Re­port 4: logit lens redux

Nathan Helm-Burger8 Apr 2022 18:35 UTC
3 points
0 comments2 min readLW link

Hyper­bolic takeoff

Ege Erdil9 Apr 2022 15:57 UTC
17 points
8 comments10 min readLW link
(www.metaculus.com)

Elicit: Lan­guage Models as Re­search Assistants

9 Apr 2022 14:56 UTC
70 points
7 comments13 min readLW link

Is it time to start think­ing about what AI Friendli­ness means?

Victor Novikov11 Apr 2022 9:32 UTC
18 points
6 comments3 min readLW link

What more com­pute does for brain-like mod­els: re­sponse to Rohin

Nathan Helm-Burger13 Apr 2022 3:40 UTC
22 points
14 comments11 min readLW link

Align­ment and Deep Learning

Aiyen17 Apr 2022 0:02 UTC
44 points
35 comments8 min readLW link

[$20K in Prizes] AI Safety Ar­gu­ments Competition

26 Apr 2022 16:13 UTC
74 points
543 comments3 min readLW link

SERI ML Align­ment The­ory Schol­ars Pro­gram 2022

27 Apr 2022 0:43 UTC
56 points
6 comments3 min readLW link

[Question] What is a train­ing “step” vs. “epi­sode” in ma­chine learn­ing?

Evan R. Murphy28 Apr 2022 21:53 UTC
9 points
4 comments1 min readLW link

Prize for Align­ment Re­search Tasks

29 Apr 2022 8:57 UTC
63 points
36 comments10 min readLW link

Quick Thoughts on A.I. Governance

Nicholas / Heather Kross30 Apr 2022 14:49 UTC
66 points
8 comments2 min readLW link
(www.thinkingmuchbetter.com)

What DALL-E 2 can and can­not do

Swimmer963 (Miranda Dixon-Luinenburg) 1 May 2022 23:51 UTC
351 points
305 comments9 min readLW link

Open Prob­lems in Nega­tive Side Effect Minimization

6 May 2022 9:37 UTC
12 points
7 comments17 min readLW link

[Linkpost] diffu­sion mag­ne­tizes man­i­folds (DALL-E 2 in­tu­ition build­ing)

Paul Bricman7 May 2022 11:01 UTC
1 point
0 comments1 min readLW link
(paulbricman.com)

Up­dat­ing Utility Functions

9 May 2022 9:44 UTC
36 points
7 comments8 min readLW link

Con­di­tions for math­e­mat­i­cal equiv­alence of Stochas­tic Gra­di­ent Des­cent and Nat­u­ral Selection

Oliver Sourbut9 May 2022 21:38 UTC
54 points
12 comments10 min readLW link

AI safety should be made more ac­cessible us­ing non text-based media

Massimog10 May 2022 3:14 UTC
2 points
4 comments4 min readLW link

The limits of AI safety via debate

Marius Hobbhahn10 May 2022 13:33 UTC
28 points
7 comments10 min readLW link

In­tro­duc­tion to the se­quence: In­ter­pretabil­ity Re­search for the Most Im­por­tant Century

Evan R. Murphy12 May 2022 19:59 UTC
16 points
0 comments8 min readLW link

Gato as the Dawn of Early AGI

David Udell15 May 2022 6:52 UTC
84 points
29 comments12 min readLW link

Is AI Progress Im­pos­si­ble To Pre­dict?

alyssavance15 May 2022 18:30 UTC
276 points
38 comments2 min readLW link

Deep­Mind’s gen­er­al­ist AI, Gato: A non-tech­ni­cal explainer

16 May 2022 21:21 UTC
57 points
6 comments6 min readLW link

Gato’s Gen­er­al­i­sa­tion: Pre­dic­tions and Ex­per­i­ments I’d Like to See

Oliver Sourbut18 May 2022 7:15 UTC
43 points
3 comments10 min readLW link

Un­der­stand­ing Gato’s Su­per­vised Re­in­force­ment Learning

lorepieri18 May 2022 11:08 UTC
3 points
5 comments1 min readLW link
(lorenzopieri.com)

A Story of AI Risk: In­struc­tGPT-N

peterbarnett26 May 2022 23:22 UTC
24 points
0 comments8 min readLW link

[Linkpost] A Chi­nese AI op­ti­mized for killing

RomanS3 Jun 2022 9:17 UTC
−2 points
4 comments1 min readLW link

Give the AI safe tools

Adam Jermyn3 Jun 2022 17:04 UTC
3 points
0 comments4 min readLW link

Towards a For­mal­i­sa­tion of Re­turns on Cog­ni­tive Rein­vest­ment (Part 1)

DragonGod4 Jun 2022 18:42 UTC
17 points
8 comments13 min readLW link

Give the model a model-builder

Adam Jermyn6 Jun 2022 12:21 UTC
3 points
0 comments5 min readLW link

AGI Safety FAQ /​ all-dumb-ques­tions-al­lowed thread

Aryeh Englander7 Jun 2022 5:47 UTC
221 points
515 comments4 min readLW link

Em­bod­i­ment is Indis­pens­able for AGI

P. G. Keerthana Gopalakrishnan7 Jun 2022 21:31 UTC
6 points
1 comment6 min readLW link
(keerthanapg.com)

You Only Get One Shot: an In­tu­ition Pump for Embed­ded Agency

Oliver Sourbut9 Jun 2022 21:38 UTC
22 points
4 comments2 min readLW link

Sum­mary of “AGI Ruin: A List of Lethal­ities”

Stephen McAleese10 Jun 2022 22:35 UTC
32 points
2 comments8 min readLW link

Poorly-Aimed Death Rays

Thane Ruthenis11 Jun 2022 18:29 UTC
43 points
5 comments4 min readLW link

ELK Pro­posal—Make the Re­porter care about the Pre­dic­tor’s beliefs

11 Jun 2022 22:53 UTC
8 points
0 comments6 min readLW link

Grokking “Semi-in­for­ma­tive pri­ors over AI timelines”

anson.ho12 Jun 2022 22:17 UTC
15 points
7 comments14 min readLW link

[Question] Favourite new AI pro­duc­tivity tools?

Gabe M15 Jun 2022 1:08 UTC
14 points
5 comments1 min readLW link

Con­tra Hofs­tadter on GPT-3 Nonsense

rictic15 Jun 2022 21:53 UTC
235 points
22 comments2 min readLW link

[Question] What if LaMDA is in­deed sen­tient /​ self-aware /​ worth hav­ing rights?

RomanS16 Jun 2022 9:10 UTC
22 points
13 comments1 min readLW link

Ten ex­per­i­ments in mod­u­lar­ity, which we’d like you to run!

16 Jun 2022 9:17 UTC
59 points
2 comments9 min readLW link

Align­ment re­search for “meta” purposes

acylhalide16 Jun 2022 14:03 UTC
15 points
0 comments1 min readLW link

[Question] AI mis­al­ign­ment risk from GPT-like sys­tems?

fiso6419 Jun 2022 17:35 UTC
10 points
8 comments1 min readLW link

Half-baked al­ign­ment idea: train­ing to generalize

Aaron Bergman19 Jun 2022 20:16 UTC
7 points
2 comments4 min readLW link

Get­ting from an un­al­igned AGI to an al­igned AGI?

Tor Økland Barstad21 Jun 2022 12:36 UTC
9 points
7 comments9 min readLW link

Miti­gat­ing the dam­age from un­al­igned ASI by co­op­er­at­ing with aliens that don’t ex­ist yet

MSRayne21 Jun 2022 16:12 UTC
−8 points
7 comments6 min readLW link

AI Train­ing Should Allow Opt-Out

alyssavance23 Jun 2022 1:33 UTC
76 points
13 comments6 min readLW link

Up­dated Defer­ence is not a strong ar­gu­ment against the util­ity un­cer­tainty ap­proach to alignment

Ivan Vendrov24 Jun 2022 19:32 UTC
20 points
8 comments4 min readLW link

SunPJ in Alenia

FlorianH25 Jun 2022 19:39 UTC
7 points
19 comments8 min readLW link
(plausiblestuff.com)

Con­di­tion­ing Gen­er­a­tive Models

Adam Jermyn25 Jun 2022 22:15 UTC
22 points
18 comments10 min readLW link

Train­ing Trace Pri­ors and Speed Priors

Adam Jermyn26 Jun 2022 18:07 UTC
17 points
0 comments3 min readLW link

De­liber­a­tion Every­where: Sim­ple Examples

Oliver Sourbut27 Jun 2022 17:26 UTC
14 points
0 comments15 min readLW link

De­liber­a­tion, Re­ac­tions, and Con­trol: Ten­ta­tive Defi­ni­tions and a Res­tate­ment of In­stru­men­tal Convergence

Oliver Sourbut27 Jun 2022 17:25 UTC
10 points
0 comments11 min readLW link

For­mal Philos­o­phy and Align­ment Pos­si­ble Projects

Whispermute30 Jun 2022 10:42 UTC
33 points
5 comments8 min readLW link

Refram­ing the AI Risk

Thane Ruthenis1 Jul 2022 18:44 UTC
26 points
7 comments6 min readLW link

Trends in GPU price-performance

1 Jul 2022 15:51 UTC
85 points
10 comments1 min readLW link
(epochai.org)

Fol­low along with Columbia EA’s Ad­vanced AI Safety Fel­low­ship!

RohanS2 Jul 2022 17:45 UTC
3 points
0 comments2 min readLW link
(forum.effectivealtruism.org)

Can we achieve AGI Align­ment by bal­anc­ing mul­ti­ple hu­man ob­jec­tives?

Ben Smith3 Jul 2022 2:51 UTC
11 points
1 comment4 min readLW link

We Need a Con­soli­dated List of Bad AI Align­ment Solutions

Double4 Jul 2022 6:54 UTC
9 points
14 comments1 min readLW link

A com­pressed take on re­cent disagreements

kman4 Jul 2022 4:39 UTC
33 points
9 comments1 min readLW link

My Most Likely Rea­son to Die Young is AI X-Risk

AISafetyIsNotLongtermist4 Jul 2022 17:08 UTC
61 points
24 comments4 min readLW link
(forum.effectivealtruism.org)

The cu­ri­ous case of Pretty Good hu­man in­ner/​outer alignment

PavleMiha5 Jul 2022 19:04 UTC
41 points
45 comments4 min readLW link

In­tro­duc­ing the Fund for Align­ment Re­search (We’re Hiring!)

6 Jul 2022 2:07 UTC
59 points
0 comments4 min readLW link

Outer vs in­ner mis­al­ign­ment: three framings

Richard_Ngo6 Jul 2022 19:46 UTC
43 points
4 comments9 min readLW link

Re­sponse to Blake Richards: AGI, gen­er­al­ity, al­ign­ment, & loss functions

Steven Byrnes12 Jul 2022 13:56 UTC
59 points
9 comments15 min readLW link

Goal Align­ment Is Ro­bust To the Sharp Left Turn

Thane Ruthenis13 Jul 2022 20:23 UTC
45 points
15 comments4 min readLW link

De­cep­tion?! I ain’t got time for that!

Paul Colognese18 Jul 2022 0:06 UTC
50 points
5 comments13 min readLW link

Four ques­tions I ask AI safety researchers

Akash17 Jul 2022 17:25 UTC
17 points
0 comments1 min readLW link

A dis­til­la­tion of Evan Hub­inger’s train­ing sto­ries (for SERI MATS)

Daphne_W18 Jul 2022 3:38 UTC
15 points
1 comment10 min readLW link

Con­di­tion­ing Gen­er­a­tive Models for Alignment

Jozdien18 Jul 2022 7:11 UTC
40 points
8 comments22 min readLW link

In­for­ma­tion the­o­retic model anal­y­sis may not lend much in­sight, but we may have been do­ing them wrong!

Garrett Baker24 Jul 2022 0:42 UTC
7 points
0 comments10 min readLW link

How to Diver­sify Con­cep­tual Align­ment: the Model Be­hind Refine

adamShimi20 Jul 2022 10:44 UTC
78 points
11 comments8 min readLW link

Our Ex­ist­ing Solu­tions to AGI Align­ment (semi-safe)

Michael Soareverix21 Jul 2022 19:00 UTC
12 points
1 comment3 min readLW link

Re­ward is not the op­ti­miza­tion target

TurnTrout25 Jul 2022 0:03 UTC
252 points
97 comments10 min readLW link

What En­vi­ron­ment Prop­er­ties Select Agents For World-Model­ing?

Thane Ruthenis23 Jul 2022 19:27 UTC
24 points
1 comment12 min readLW link

AGI Safety Needs Peo­ple With All Skil­lsets!

Severin T. Seehrich25 Jul 2022 13:32 UTC
28 points
0 comments2 min readLW link

Con­jec­ture: In­ter­nal In­fo­haz­ard Policy

29 Jul 2022 19:07 UTC
119 points
6 comments19 min readLW link

Hu­mans Reflect­ing on HRH

leogao29 Jul 2022 21:56 UTC
20 points
4 comments2 min readLW link

[Question] Would “Man­hat­tan Pro­ject” style be benefi­cial or dele­te­ri­ous for AI Align­ment?

Valentin20264 Aug 2022 19:12 UTC
5 points
1 comment1 min readLW link

Con­ver­gence Towards World-Models: A Gears-Level Model

Thane Ruthenis4 Aug 2022 23:31 UTC
37 points
1 comment13 min readLW link

How To Go From In­ter­pretabil­ity To Align­ment: Just Re­tar­get The Search

johnswentworth10 Aug 2022 16:08 UTC
143 points
30 comments3 min readLW link

For­mal­iz­ing Alignment

Marv K10 Aug 2022 18:50 UTC
3 points
0 comments2 min readLW link

My sum­mary of the al­ign­ment problem

Peter Hroššo11 Aug 2022 19:42 UTC
16 points
3 comments2 min readLW link
(threadreaderapp.com)

Ar­tifi­cial in­tel­li­gence wireheading

Big Tony12 Aug 2022 3:06 UTC
3 points
2 comments1 min readLW link

In­fant AI Scenario

Nathan112312 Aug 2022 21:20 UTC
1 point
0 comments3 min readLW link

Gra­di­ent de­scent doesn’t se­lect for in­ner search

Ivan Vendrov13 Aug 2022 4:15 UTC
36 points
23 comments4 min readLW link

No short­cuts to knowl­edge: Why AI needs to ease up on scal­ing and learn how to code

Yldedly15 Aug 2022 8:42 UTC
4 points
0 comments1 min readLW link
(deoxyribose.github.io)

Mesa-op­ti­miza­tion for goals defined only within a train­ing en­vi­ron­ment is dangerous

Rubi J. Hudson17 Aug 2022 3:56 UTC
6 points
2 comments4 min readLW link

The longest train­ing run

17 Aug 2022 17:18 UTC
68 points
11 comments9 min readLW link
(epochai.org)

Matt Ygle­sias on AI Policy

Grant Demaree17 Aug 2022 23:57 UTC
25 points
1 comment1 min readLW link
(www.slowboring.com)

Epistemic Arte­facts of (con­cep­tual) AI al­ign­ment research

19 Aug 2022 17:18 UTC
30 points
1 comment5 min readLW link

A Bite Sized In­tro­duc­tion to ELK

Luk2718217 Sep 2022 0:28 UTC
5 points
0 comments6 min readLW link

Bench­mark­ing Pro­pos­als on Risk Scenarios

Paul Bricman20 Aug 2022 10:01 UTC
25 points
2 comments14 min readLW link

The ‘Bit­ter Les­son’ is Wrong

deepthoughtlife20 Aug 2022 16:15 UTC
−9 points
14 comments2 min readLW link

My Plan to Build Aligned Superintelligence

apollonianblues21 Aug 2022 13:16 UTC
18 points
7 comments8 min readLW link

Beliefs and Disagree­ments about Au­tomat­ing Align­ment Research

Ian McKenzie24 Aug 2022 18:37 UTC
92 points
4 comments7 min readLW link

Google AI in­te­grates PaLM with robotics: SayCan up­date [Linkpost]

Evan R. Murphy24 Aug 2022 20:54 UTC
25 points
0 comments1 min readLW link
(sites.research.google)

The Shard The­ory Align­ment Scheme

David Udell25 Aug 2022 4:52 UTC
47 points
33 comments2 min readLW link

[Question] What would you ex­pect a mas­sive mul­ti­modal on­line fed­er­ated learner to be ca­pa­ble of?

Aryeh Englander27 Aug 2022 17:31 UTC
13 points
4 comments1 min readLW link

(My un­der­stand­ing of) What Every­one in Tech­ni­cal Align­ment is Do­ing and Why

29 Aug 2022 1:23 UTC
345 points
83 comments38 min readLW link

Break­ing down the train­ing/​de­ploy­ment dichotomy

Erik Jenner28 Aug 2022 21:45 UTC
29 points
4 comments3 min readLW link

Strat­egy For Con­di­tion­ing Gen­er­a­tive Models

1 Sep 2022 4:34 UTC
28 points
4 comments18 min readLW link

Gra­di­ent Hacker De­sign Prin­ci­ples From Biology

johnswentworth1 Sep 2022 19:03 UTC
52 points
13 comments3 min readLW link

No, hu­man brains are not (much) more effi­cient than computers

Jesse Hoogland6 Sep 2022 13:53 UTC
19 points
16 comments4 min readLW link
(www.jessehoogland.com)

Can “Re­ward Eco­nomics” solve AI Align­ment?

Q Home7 Sep 2022 7:58 UTC
3 points
15 comments18 min readLW link

Gen­er­a­tors Of Disagree­ment With AI Alignment

George3d67 Sep 2022 18:15 UTC
26 points
9 comments9 min readLW link
(www.epistem.ink)

Search­ing for Mo­du­lar­ity in Large Lan­guage Models

8 Sep 2022 2:25 UTC
43 points
3 comments14 min readLW link

We may be able to see sharp left turns coming

3 Sep 2022 2:55 UTC
50 points
26 comments1 min readLW link

Gate­keeper Vic­tory: AI Box Reflection

9 Sep 2022 21:38 UTC
4 points
5 comments9 min readLW link

Can you force a neu­ral net­work to keep gen­er­al­iz­ing?

Q Home12 Sep 2022 10:14 UTC
2 points
10 comments5 min readLW link

Align­ment via proso­cial brain algorithms

Cameron Berg12 Sep 2022 13:48 UTC
42 points
28 comments6 min readLW link

[Linkpost] A sur­vey on over 300 works about in­ter­pretabil­ity in deep networks

scasper12 Sep 2022 19:07 UTC
96 points
7 comments2 min readLW link
(arxiv.org)

Try­ing to find the un­der­ly­ing struc­ture of com­pu­ta­tional systems

Matthias G. Mayer13 Sep 2022 21:16 UTC
17 points
9 comments4 min readLW link

[Question] Are Speed Su­per­in­tel­li­gences Fea­si­ble for Modern ML Tech­niques?

DragonGod14 Sep 2022 12:59 UTC
8 points
5 comments1 min readLW link

The Defen­der’s Ad­van­tage of Interpretability

Marius Hobbhahn14 Sep 2022 14:05 UTC
41 points
4 comments6 min readLW link

When does tech­ni­cal work to re­duce AGI con­flict make a differ­ence?: Introduction

14 Sep 2022 19:38 UTC
42 points
3 comments6 min readLW link

ACT-1: Trans­former for Actions

Daniel Kokotajlo14 Sep 2022 19:09 UTC
52 points
4 comments1 min readLW link
(www.adept.ai)

[Question] Fore­cast­ing thread: How does AI risk level vary based on timelines?

elifland14 Sep 2022 23:56 UTC
33 points
7 comments1 min readLW link

Gen­eral ad­vice for tran­si­tion­ing into The­o­ret­i­cal AI Safety

Martín Soto15 Sep 2022 5:23 UTC
9 points
0 comments10 min readLW link

Why de­cep­tive al­ign­ment mat­ters for AGI safety

Marius Hobbhahn15 Sep 2022 13:38 UTC
48 points
12 comments13 min readLW link

Un­der­stand­ing Con­jec­ture: Notes from Con­nor Leahy interview

Akash15 Sep 2022 18:37 UTC
103 points
24 comments15 min readLW link

or­der­ing ca­pa­bil­ity thresholds

Tamsin Leake16 Sep 2022 16:36 UTC
27 points
0 comments4 min readLW link
(carado.moe)

Levels of goals and alignment

zeshen16 Sep 2022 16:44 UTC
27 points
4 comments6 min readLW link

Katja Grace on Slow­ing Down AI, AI Ex­pert Sur­veys And Es­ti­mat­ing AI Risk

Michaël Trazzi16 Sep 2022 17:45 UTC
40 points
2 comments3 min readLW link
(theinsideview.ai)

Sum­maries: Align­ment Fun­da­men­tals Curriculum

Leon Lang18 Sep 2022 13:08 UTC
43 points
3 comments1 min readLW link
(docs.google.com)

Lev­er­ag­ing Le­gal In­for­mat­ics to Align AI

John Nay18 Sep 2022 20:39 UTC
11 points
0 comments3 min readLW link
(forum.effectivealtruism.org)

Align­ment Org Cheat Sheet

20 Sep 2022 17:36 UTC
63 points
6 comments4 min readLW link

Public-fac­ing Cen­sor­ship Is Safety Theater, Caus­ing Rep­u­ta­tional Da­m­age

Yitz23 Sep 2022 5:08 UTC
144 points
42 comments6 min readLW link

Nearcast-based “de­ploy­ment prob­lem” analysis

HoldenKarnofsky21 Sep 2022 18:52 UTC
78 points
2 comments26 min readLW link

Math­e­mat­i­cal Cir­cuits in Neu­ral Networks

Sean Osier22 Sep 2022 3:48 UTC
34 points
4 comments1 min readLW link
(www.youtube.com)

Un­der­stand­ing In­fra-Bayesi­anism: A Begin­ner-Friendly Video Series

22 Sep 2022 13:25 UTC
114 points
6 comments2 min readLW link

In­ter­lude: But Who Op­ti­mizes The Op­ti­mizer?

Paul Bricman23 Sep 2022 15:30 UTC
15 points
0 comments10 min readLW link

[Question] What Do AI Safety Pitches Not Get About Your Field?

Aris22 Sep 2022 21:27 UTC
28 points
3 comments1 min readLW link

Let’s Com­pare Notes

Shoshannah Tekofsky22 Sep 2022 20:47 UTC
17 points
3 comments6 min readLW link

Brain-over-body bi­ases, and the em­bod­ied value prob­lem in AI alignment

geoffreymiller24 Sep 2022 22:24 UTC
10 points
6 comments25 min readLW link

Brief Notes on Transformers

Adam Jermyn26 Sep 2022 14:46 UTC
32 points
2 comments2 min readLW link

You are Un­der­es­ti­mat­ing The Like­li­hood That Con­ver­gent In­stru­men­tal Sub­goals Lead to Aligned AGI

Mark Neyer26 Sep 2022 14:22 UTC
3 points
6 comments3 min readLW link

7 traps that (we think) new al­ign­ment re­searchers of­ten fall into

27 Sep 2022 23:13 UTC
157 points
10 comments4 min readLW link

Threat-Re­sis­tant Bar­gain­ing Me­ga­post: In­tro­duc­ing the ROSE Value

Diffractor28 Sep 2022 1:20 UTC
89 points
11 comments53 min readLW link

Failure modes in a shard the­ory al­ign­ment plan

Thomas Kwa27 Sep 2022 22:34 UTC
24 points
2 comments7 min readLW link

QAPR 3: in­ter­pretabil­ity-guided train­ing of neu­ral nets

Quintin Pope28 Sep 2022 16:02 UTC
47 points
2 comments10 min readLW link

[Question] What’s the ac­tual ev­i­dence that AI mar­ket­ing tools are chang­ing prefer­ences in a way that makes them eas­ier to pre­dict?

Emrik1 Oct 2022 15:21 UTC
10 points
7 comments1 min readLW link

[Question] Any fur­ther work on AI Safety Suc­cess Sto­ries?

Krieger2 Oct 2022 9:53 UTC
7 points
6 comments1 min readLW link

AI Timelines via Cu­mu­la­tive Op­ti­miza­tion Power: Less Long, More Short

jacob_cannell6 Oct 2022 0:21 UTC
111 points
32 comments6 min readLW link

con­fu­sion about al­ign­ment requirements

Tamsin Leake6 Oct 2022 10:32 UTC
28 points
10 comments3 min readLW link
(carado.moe)

Good on­tolo­gies in­duce com­mu­ta­tive diagrams

Erik Jenner9 Oct 2022 0:06 UTC
40 points
5 comments14 min readLW link

Un­con­trol­lable AI as an Ex­is­ten­tial Risk

Karl von Wendt9 Oct 2022 10:36 UTC
19 points
0 comments20 min readLW link

Ob­jects in Mir­ror Are Closer Than They Ap­pear...

Vestozia11 Oct 2022 4:34 UTC
2 points
7 comments9 min readLW link

Misal­ign­ment Harms Can Be Caused by Low In­tel­li­gence Systems

DialecticEel11 Oct 2022 13:39 UTC
11 points
3 comments1 min readLW link

Build­ing a trans­former from scratch—AI safety up-skil­ling challenge

Marius Hobbhahn12 Oct 2022 15:40 UTC
42 points
1 comment5 min readLW link

Help out Red­wood Re­search’s in­ter­pretabil­ity team by find­ing heuris­tics im­ple­mented by GPT-2 small

12 Oct 2022 21:25 UTC
49 points
11 comments4 min readLW link

Science of Deep Learn­ing—a tech­ni­cal agenda

Marius Hobbhahn18 Oct 2022 14:54 UTC
35 points
7 comments4 min readLW link

Re­sponse to Katja Grace’s AI x-risk counterarguments

19 Oct 2022 1:17 UTC
75 points
18 comments15 min readLW link

[Question] What Does AI Align­ment Suc­cess Look Like?

Shmi20 Oct 2022 0:32 UTC
23 points
7 comments1 min readLW link

AI Re­search Pro­gram Pre­dic­tion Markets

tailcalled20 Oct 2022 13:42 UTC
38 points
10 comments1 min readLW link

Learn­ing so­cietal val­ues from law as part of an AGI al­ign­ment strategy

John Nay21 Oct 2022 2:03 UTC
3 points
18 comments54 min readLW link

Im­proved Se­cu­rity to Prevent Hacker-AI and Digi­tal Ghosts

Erland Wittkotter21 Oct 2022 10:11 UTC
4 points
3 comments12 min readLW link

What will the scaled up GATO look like? (Up­dated with ques­tions)

Amal 25 Oct 2022 12:44 UTC
33 points
20 comments1 min readLW link

In­tent al­ign­ment should not be the goal for AGI x-risk reduction

John Nay26 Oct 2022 1:24 UTC
−6 points
10 comments3 min readLW link

Re­sources that (I think) new al­ign­ment re­searchers should know about

Akash28 Oct 2022 22:13 UTC
69 points
8 comments4 min readLW link

Boundaries vs Frames

Scott Garrabrant31 Oct 2022 15:14 UTC
47 points
7 comments7 min readLW link

Ad­ver­sar­ial Poli­cies Beat Pro­fes­sional-Level Go AIs

sanxiyn3 Nov 2022 13:27 UTC
31 points
35 comments1 min readLW link
(goattack.alignmentfund.org)

The Sin­gu­lar Value De­com­po­si­tions of Trans­former Weight Ma­tri­ces are Highly Interpretable

28 Nov 2022 12:54 UTC
159 points
27 comments31 min readLW link

Sim­ple Way to Prevent Power-Seek­ing AI

research_prime_space7 Dec 2022 0:26 UTC
7 points
1 comment1 min readLW link

You can still fetch the coffee to­day if you’re dead tomorrow

davidad9 Dec 2022 14:06 UTC
58 points
15 comments5 min readLW link

Ex­tract­ing and Eval­u­at­ing Causal Direc­tion in LLMs’ Activations

14 Dec 2022 14:33 UTC
22 points
2 comments11 min readLW link

Real­ism about rationality

Richard_Ngo16 Sep 2018 10:46 UTC
180 points
145 comments4 min readLW link3 reviews
(thinkingcomplete.blogspot.com)

De­bate on In­stru­men­tal Con­ver­gence be­tween LeCun, Rus­sell, Ben­gio, Zador, and More

Ben Pace4 Oct 2019 4:08 UTC
205 points
60 comments15 min readLW link2 reviews

The Parable of Pre­dict-O-Matic

abramdemski15 Oct 2019 0:49 UTC
291 points
42 comments14 min readLW link2 reviews

2018 AI Align­ment Liter­a­ture Re­view and Char­ity Comparison

Larks18 Dec 2018 4:46 UTC
190 points
26 comments62 min readLW link1 review

An Ortho­dox Case Against Utility Functions

abramdemski7 Apr 2020 19:18 UTC
128 points
53 comments8 min readLW link2 reviews

“How con­ser­va­tive” should the par­tial max­imisers be?

Stuart_Armstrong13 Apr 2020 15:50 UTC
30 points
8 comments2 min readLW link

[AN #95]: A frame­work for think­ing about how to make AI go well

Rohin Shah15 Apr 2020 17:10 UTC
20 points
2 comments10 min readLW link
(mailchi.mp)

AI Align­ment Pod­cast: An Overview of Tech­ni­cal AI Align­ment in 2018 and 2019 with Buck Sh­legeris and Ro­hin Shah

Palus Astra16 Apr 2020 0:50 UTC
58 points
27 comments89 min readLW link

Open ques­tion: are min­i­mal cir­cuits dae­mon-free?

paulfchristiano5 May 2018 22:40 UTC
81 points
70 comments2 min readLW link1 review

Disen­tan­gling ar­gu­ments for the im­por­tance of AI safety

Richard_Ngo21 Jan 2019 12:41 UTC
129 points
23 comments8 min readLW link

In­te­grat­ing Hid­den Vari­ables Im­proves Approximation

johnswentworth16 Apr 2020 21:43 UTC
15 points
4 comments1 min readLW link

AI Ser­vices as a Re­search Paradigm

VojtaKovarik20 Apr 2020 13:00 UTC
30 points
12 comments4 min readLW link
(docs.google.com)

Databases of hu­man be­havi­our and prefer­ences?

Stuart_Armstrong21 Apr 2020 18:06 UTC
10 points
9 comments1 min readLW link

Critch on ca­reer ad­vice for ju­nior AI-x-risk-con­cerned researchers

Rob Bensinger12 May 2018 2:13 UTC
117 points
25 comments4 min readLW link

Refram­ing Impact

TurnTrout20 Sep 2019 19:03 UTC
90 points
15 comments3 min readLW link1 review

De­scrip­tion vs simu­lated prediction

Richard Korzekwa 22 Apr 2020 16:40 UTC
26 points
0 comments5 min readLW link
(aiimpacts.org)

Deep­Mind team on speci­fi­ca­tion gaming

JoshuaFox23 Apr 2020 8:01 UTC
30 points
2 comments1 min readLW link
(deepmind.com)

[Question] Does Agent-like Be­hav­ior Im­ply Agent-like Ar­chi­tec­ture?

Scott Garrabrant23 Aug 2019 2:01 UTC
54 points
7 comments1 min readLW link

Risks from Learned Op­ti­miza­tion: Con­clu­sion and Re­lated Work

7 Jun 2019 19:53 UTC
78 points
4 comments6 min readLW link

De­cep­tive Alignment

5 Jun 2019 20:16 UTC
97 points
11 comments17 min readLW link

The In­ner Align­ment Problem

4 Jun 2019 1:20 UTC
99 points
17 comments13 min readLW link

How the MtG Color Wheel Ex­plains AI Safety

Scott Garrabrant15 Feb 2019 23:42 UTC
57 points
4 comments6 min readLW link

[Question] How does Gra­di­ent Des­cent In­ter­act with Good­hart?

Scott Garrabrant2 Feb 2019 0:14 UTC
68 points
19 comments4 min readLW link

For­mal Open Prob­lem in De­ci­sion Theory

Scott Garrabrant29 Nov 2018 3:25 UTC
35 points
11 comments4 min readLW link

The Ubiquitous Con­verse Law­vere Problem

Scott Garrabrant29 Nov 2018 3:16 UTC
21 points
0 comments2 min readLW link

Embed­ded Curiosities

8 Nov 2018 14:19 UTC
88 points
1 comment2 min readLW link

Sub­sys­tem Alignment

6 Nov 2018 16:16 UTC
100 points
12 comments1 min readLW link

Ro­bust Delegation

4 Nov 2018 16:38 UTC
110 points
10 comments1 min readLW link

Embed­ded World-Models

2 Nov 2018 16:07 UTC
87 points
16 comments1 min readLW link

De­ci­sion Theory

31 Oct 2018 18:41 UTC
114 points
46 comments1 min readLW link

(A → B) → A

Scott Garrabrant11 Sep 2018 22:38 UTC
62 points
11 comments2 min readLW link

His­tory of the Devel­op­ment of Log­i­cal Induction

Scott Garrabrant29 Aug 2018 3:15 UTC
89 points
4 comments5 min readLW link

Op­ti­miza­tion Amplifies

Scott Garrabrant27 Jun 2018 1:51 UTC
98 points
12 comments4 min readLW link

What makes coun­ter­fac­tu­als com­pa­rable?

Chris_Leong24 Apr 2020 22:47 UTC
11 points
6 comments3 min readLW link

New Paper Ex­pand­ing on the Good­hart Taxonomy

Scott Garrabrant14 Mar 2018 9:01 UTC
17 points
4 comments1 min readLW link
(arxiv.org)

Sources of in­tu­itions and data on AGI

Scott Garrabrant31 Jan 2018 23:30 UTC
84 points
26 comments3 min readLW link

Corrigibility

paulfchristiano27 Nov 2018 21:50 UTC
52 points
7 comments6 min readLW link

AI pre­dic­tion case study 5: Omo­hun­dro’s AI drives

Stuart_Armstrong15 Mar 2013 9:09 UTC
10 points
5 comments8 min readLW link

Toy model: con­ver­gent in­stru­men­tal goals

Stuart_Armstrong25 Feb 2016 14:03 UTC
15 points
2 comments4 min readLW link

AI-cre­ated pseudo-deontology

Stuart_Armstrong12 Feb 2015 21:11 UTC
10 points
35 comments1 min readLW link

Eth­i­cal Injunctions

Eliezer Yudkowsky20 Oct 2008 23:00 UTC
66 points
76 comments9 min readLW link

Mo­ti­vat­ing Ab­strac­tion-First De­ci­sion Theory

johnswentworth29 Apr 2020 17:47 UTC
42 points
16 comments5 min readLW link

[AN #97]: Are there his­tor­i­cal ex­am­ples of large, ro­bust dis­con­ti­nu­ities?

Rohin Shah29 Apr 2020 17:30 UTC
15 points
0 comments10 min readLW link
(mailchi.mp)

My Up­dat­ing Thoughts on AI policy

Ben Pace1 Mar 2020 7:06 UTC
20 points
1 comment9 min readLW link

Use­ful Does Not Mean Secure

Ben Pace30 Nov 2019 2:05 UTC
46 points
12 comments11 min readLW link

[Question] What is the al­ter­na­tive to in­tent al­ign­ment called?

Richard_Ngo30 Apr 2020 2:16 UTC
12 points
6 comments1 min readLW link

Op­ti­mis­ing So­ciety to Con­strain Risk of War from an Ar­tifi­cial Su­per­in­tel­li­gence

JohnCDraper30 Apr 2020 10:47 UTC
3 points
1 comment51 min readLW link

Stan­ford En­cy­clo­pe­dia of Philos­o­phy on AI ethics and superintelligence

Kaj_Sotala2 May 2020 7:35 UTC
43 points
19 comments7 min readLW link
(plato.stanford.edu)

[Question] How does iter­ated am­plifi­ca­tion ex­ceed hu­man abil­ities?

riceissa2 May 2020 23:44 UTC
19 points
9 comments2 min readLW link

How uniform is the neo­cor­tex?

zhukeepa4 May 2020 2:16 UTC
78 points
23 comments11 min readLW link1 review

Scott Garrabrant’s prob­lem on re­cov­er­ing Brouwer as a corol­lary of Lawvere

Rupert4 May 2020 10:01 UTC
26 points
2 comments2 min readLW link

“AI and Effi­ciency”, OA (44✕ im­prove­ment in CNNs since 2012)

gwern5 May 2020 16:32 UTC
47 points
0 comments1 min readLW link
(openai.com)

Com­pet­i­tive safety via gra­dated curricula

Richard_Ngo5 May 2020 18:11 UTC
38 points
5 comments5 min readLW link

Model­ing nat­u­ral­ized de­ci­sion prob­lems in lin­ear logic

jessicata6 May 2020 0:15 UTC
14 points
2 comments6 min readLW link
(unstableontology.com)

[AN #98]: Un­der­stand­ing neu­ral net train­ing by see­ing which gra­di­ents were helpful

Rohin Shah6 May 2020 17:10 UTC
22 points
3 comments9 min readLW link
(mailchi.mp)

[Question] Is AI safety re­search less par­alleliz­able than AI re­search?

Mati_Roy10 May 2020 20:43 UTC
9 points
5 comments1 min readLW link

Thoughts on im­ple­ment­ing cor­rigible ro­bust alignment

Steven Byrnes26 Nov 2019 14:06 UTC
26 points
2 comments6 min readLW link

Wire­head­ing is in the eye of the beholder

Stuart_Armstrong30 Jan 2019 18:23 UTC
26 points
10 comments1 min readLW link

Wire­head­ing as a po­ten­tial prob­lem with the new im­pact measure

Stuart_Armstrong25 Sep 2018 14:15 UTC
25 points
20 comments4 min readLW link

Wire­head­ing and discontinuity

Michele Campolo18 Feb 2020 10:49 UTC
21 points
4 comments3 min readLW link

[AN #99]: Dou­bling times for the effi­ciency of AI algorithms

Rohin Shah13 May 2020 17:20 UTC
29 points
0 comments10 min readLW link
(mailchi.mp)

How should AIs up­date a prior over hu­man prefer­ences?

Stuart_Armstrong15 May 2020 13:14 UTC
17 points
9 comments2 min readLW link

Con­jec­ture Workshop

johnswentworth15 May 2020 22:41 UTC
34 points
2 comments2 min readLW link

Multi-agent safety

Richard_Ngo16 May 2020 1:59 UTC
31 points
8 comments5 min readLW link

The Mechanis­tic and Nor­ma­tive Struc­ture of Agency

Gordon Seidoh Worley18 May 2020 16:03 UTC
15 points
4 comments1 min readLW link
(philpapers.org)

“Star­wink” by Alicorn

Zack_M_Davis18 May 2020 8:17 UTC
44 points
1 comment1 min readLW link
(alicorn.elcenia.com)

[AN #100]: What might go wrong if you learn a re­ward func­tion while acting

Rohin Shah20 May 2020 17:30 UTC
33 points
2 comments12 min readLW link
(mailchi.mp)

Prob­a­bil­ities, weights, sums: pretty much the same for re­ward functions

Stuart_Armstrong20 May 2020 15:19 UTC
11 points
1 comment2 min readLW link

[Question] Source code size vs learned model size in ML and in hu­mans?

riceissa20 May 2020 8:47 UTC
11 points
6 comments1 min readLW link

Com­par­ing re­ward learn­ing/​re­ward tam­per­ing formalisms

Stuart_Armstrong21 May 2020 12:03 UTC
9 points
3 comments3 min readLW link

AGIs as collectives

Richard_Ngo22 May 2020 20:36 UTC
22 points
23 comments4 min readLW link

[AN #101]: Why we should rigor­ously mea­sure and fore­cast AI progress

Rohin Shah27 May 2020 17:20 UTC
15 points
0 comments10 min readLW link
(mailchi.mp)

AI Safety Dis­cus­sion Days

Linda Linsefors27 May 2020 16:54 UTC
13 points
1 comment3 min readLW link

Build­ing brain-in­spired AGI is in­finitely eas­ier than un­der­stand­ing the brain

Steven Byrnes2 Jun 2020 14:13 UTC
51 points
14 comments7 min readLW link

Spar­sity and in­ter­pretabil­ity?

1 Jun 2020 13:25 UTC
41 points
3 comments7 min readLW link

GPT-3: A Summary

leogao2 Jun 2020 18:14 UTC
20 points
0 comments1 min readLW link
(leogao.dev)

Inac­cessible information

paulfchristiano3 Jun 2020 5:10 UTC
84 points
17 comments14 min readLW link2 reviews
(ai-alignment.com)

[AN #102]: Meta learn­ing by GPT-3, and a list of full pro­pos­als for AI alignment

Rohin Shah3 Jun 2020 17:20 UTC
38 points
6 comments10 min readLW link
(mailchi.mp)

Feed­back is cen­tral to agency

Alex Flint1 Jun 2020 12:56 UTC
28 points
1 comment3 min readLW link

Think­ing About Su­per-Hu­man AI: An Ex­am­i­na­tion of Likely Paths and Ul­ti­mate Constitution

meanderingmoose4 Jun 2020 23:22 UTC
−3 points
0 comments7 min readLW link

Emer­gence and Con­trol: An ex­am­i­na­tion of our abil­ity to gov­ern the be­hav­ior of in­tel­li­gent systems

meanderingmoose5 Jun 2020 17:10 UTC
1 point
0 comments6 min readLW link

GAN Discrim­i­na­tors Don’t Gen­er­al­ize?

tryactions8 Jun 2020 20:36 UTC
18 points
7 comments2 min readLW link

More on dis­am­biguat­ing “dis­con­ti­nu­ity”

Aryeh Englander9 Jun 2020 15:16 UTC
16 points
1 comment3 min readLW link

[AN #103]: ARCHES: an agenda for ex­is­ten­tial safety, and com­bin­ing nat­u­ral lan­guage with deep RL

Rohin Shah10 Jun 2020 17:20 UTC
27 points
1 comment10 min readLW link
(mailchi.mp)

Dutch-Book­ing CDT: Re­vised Argument

abramdemski27 Oct 2020 4:31 UTC
50 points
22 comments16 min readLW link

[Question] List of pub­lic pre­dic­tions of what GPT-X can or can’t do?

Daniel Kokotajlo14 Jun 2020 14:25 UTC
20 points
9 comments1 min readLW link

Achiev­ing AI al­ign­ment through de­liber­ate un­cer­tainty in mul­ti­a­gent systems

Florian Dietz15 Jun 2020 12:19 UTC
3 points
10 comments7 min readLW link

Su­per­ex­po­nen­tial His­toric Growth, by David Roodman

Ben Pace15 Jun 2020 21:49 UTC
43 points
6 comments5 min readLW link
(www.openphilanthropy.org)

Re­lat­ing HCH and Log­i­cal Induction

abramdemski16 Jun 2020 22:08 UTC
47 points
4 comments5 min readLW link

Image GPT

Daniel Kokotajlo18 Jun 2020 11:41 UTC
29 points
27 comments1 min readLW link
(openai.com)

[AN #104]: The per­ils of in­ac­cessible in­for­ma­tion, and what we can learn about AI al­ign­ment from COVID

Rohin Shah18 Jun 2020 17:10 UTC
19 points
5 comments8 min readLW link
(mailchi.mp)

[Question] If AI is based on GPT, how to en­sure its safety?

avturchin18 Jun 2020 20:33 UTC
20 points
11 comments1 min readLW link

What’s Your Cog­ni­tive Al­gorithm?

Raemon18 Jun 2020 22:16 UTC
71 points
23 comments13 min readLW link

Rele­vant pre-AGI possibilities

Daniel Kokotajlo20 Jun 2020 10:52 UTC
38 points
7 comments19 min readLW link
(aiimpacts.org)

Plau­si­ble cases for HRAD work, and lo­cat­ing the crux in the “re­al­ism about ra­tio­nal­ity” debate

riceissa22 Jun 2020 1:10 UTC
85 points
15 comments10 min readLW link

The In­dex­ing Problem

johnswentworth22 Jun 2020 19:11 UTC
35 points
2 comments4 min readLW link

[Question] Re­quest­ing feed­back/​ad­vice: what Type The­ory to study for AI safety?

rvnnt23 Jun 2020 17:03 UTC
7 points
4 comments3 min readLW link

Lo­cal­ity of goals

adamShimi22 Jun 2020 21:56 UTC
16 points
8 comments6 min readLW link

[Question] What is “In­stru­men­tal Cor­rigi­bil­ity”?

joebernstein23 Jun 2020 20:24 UTC
4 points
1 comment1 min readLW link

Models, myths, dreams, and Cheshire cat grins

Stuart_Armstrong24 Jun 2020 10:50 UTC
21 points
7 comments2 min readLW link

[AN #105]: The eco­nomic tra­jec­tory of hu­man­ity, and what we might mean by optimization

Rohin Shah24 Jun 2020 17:30 UTC
24 points
3 comments11 min readLW link
(mailchi.mp)

There’s an Awe­some AI Ethics List and it’s a lit­tle thin

AABoyles25 Jun 2020 13:43 UTC
13 points
1 comment1 min readLW link
(github.com)

GPT-3 Fic­tion Samples

gwern25 Jun 2020 16:12 UTC
63 points
18 comments1 min readLW link
(www.gwern.net)

Walk­through: The Trans­former Ar­chi­tec­ture [Part 1/​2]

Matthew Barnett30 Jul 2019 13:54 UTC
35 points
0 comments6 min readLW link

Ro­bust­ness as a Path to AI Alignment

abramdemski10 Oct 2017 8:14 UTC
45 points
9 comments9 min readLW link

Rad­i­cal Prob­a­bil­ism [Tran­script]

26 Jun 2020 22:14 UTC
46 points
12 comments6 min readLW link

AI safety via mar­ket making

evhub26 Jun 2020 23:07 UTC
55 points
45 comments11 min readLW link

[Question] Have gen­eral de­com­posers been for­mal­ized?

Quinn27 Jun 2020 18:09 UTC
8 points
5 comments1 min readLW link

Gary Mar­cus vs Cor­ti­cal Uniformity

Steven Byrnes28 Jun 2020 18:18 UTC
18 points
0 comments8 min readLW link

Web AI dis­cus­sion Groups

Donald Hobson30 Jun 2020 11:22 UTC
11 points
0 comments2 min readLW link

Com­par­ing AI Align­ment Ap­proaches to Min­i­mize False Pos­i­tive Risk

Gordon Seidoh Worley30 Jun 2020 19:34 UTC
5 points
0 comments9 min readLW link

AvE: As­sis­tance via Empowerment

FactorialCode30 Jun 2020 22:07 UTC
12 points
1 comment1 min readLW link
(arxiv.org)

Evan Hub­inger on In­ner Align­ment, Outer Align­ment, and Pro­pos­als for Build­ing Safe Ad­vanced AI

Palus Astra1 Jul 2020 17:30 UTC
35 points
4 comments67 min readLW link

[AN #106]: Eval­u­at­ing gen­er­al­iza­tion abil­ity of learned re­ward models

Rohin Shah1 Jul 2020 17:20 UTC
14 points
2 comments11 min readLW link
(mailchi.mp)

The “AI De­bate” Debate

michaelcohen2 Jul 2020 10:16 UTC
20 points
20 comments3 min readLW link

Idea: Imi­ta­tion/​Value Learn­ing AIXI

Past Account3 Jul 2020 17:10 UTC
3 points
6 comments1 min readLW link

Split­ting De­bate up into Two Subsystems

Nandi3 Jul 2020 20:11 UTC
13 points
5 comments4 min readLW link

AI Un­safety via Non-Zero-Sum Debate

VojtaKovarik3 Jul 2020 22:03 UTC
25 points
10 comments5 min readLW link

Clas­sify­ing games like the Pri­soner’s Dilemma

philh4 Jul 2020 17:10 UTC
100 points
28 comments6 min readLW link1 review
(reasonableapproximation.net)

AI-Feyn­man as a bench­mark for what we should be aiming for

Faustus24 Jul 2020 9:24 UTC
8 points
1 comment2 min readLW link

Learn­ing the prior

paulfchristiano5 Jul 2020 21:00 UTC
79 points
29 comments8 min readLW link
(ai-alignment.com)

Bet­ter pri­ors as a safety problem

paulfchristiano5 Jul 2020 21:20 UTC
64 points
7 comments5 min readLW link
(ai-alignment.com)

[Question] How far is AGI?

Roko Jelavić5 Jul 2020 17:58 UTC
6 points
5 comments1 min readLW link

Clas­sify­ing speci­fi­ca­tion prob­lems as var­i­ants of Good­hart’s Law

Vika19 Aug 2019 20:40 UTC
70 points
5 comments5 min readLW link1 review

New safety re­search agenda: scal­able agent al­ign­ment via re­ward modeling

Vika20 Nov 2018 17:29 UTC
34 points
13 comments1 min readLW link
(medium.com)

De­sign­ing agent in­cen­tives to avoid side effects

11 Mar 2019 20:55 UTC
29 points
0 comments2 min readLW link
(medium.com)

Dis­cus­sion on the ma­chine learn­ing ap­proach to AI safety

Vika1 Nov 2018 20:54 UTC
26 points
3 comments4 min readLW link

Speci­fi­ca­tion gam­ing ex­am­ples in AI

Vika3 Apr 2018 12:30 UTC
43 points
9 comments1 min readLW link2 reviews

[Question] (an­swered: yes) Has any­one writ­ten up a con­sid­er­a­tion of Downs’s “Para­dox of Vot­ing” from the per­spec­tive of MIRI-ish de­ci­sion the­o­ries (UDT, FDT, or even just EDT)?

Jameson Quinn6 Jul 2020 18:26 UTC
10 points
24 comments1 min readLW link

New Deep­Mind AI Safety Re­search Blog

Vika27 Sep 2018 16:28 UTC
43 points
0 comments1 min readLW link
(medium.com)

Con­test: $1,000 for good ques­tions to ask to an Or­a­cle AI

Stuart_Armstrong31 Jul 2019 18:48 UTC
57 points
156 comments3 min readLW link

De­con­fus­ing Hu­man Values Re­search Agenda v1

Gordon Seidoh Worley23 Mar 2020 16:25 UTC
27 points
12 comments4 min readLW link

[Question] How “hon­est” is GPT-3?

abramdemski8 Jul 2020 19:38 UTC
72 points
18 comments5 min readLW link

What does it mean to ap­ply de­ci­sion the­ory?

abramdemski8 Jul 2020 20:31 UTC
51 points
5 comments8 min readLW link

AI Re­search Con­sid­er­a­tions for Hu­man Ex­is­ten­tial Safety (ARCHES)

habryka9 Jul 2020 2:49 UTC
60 points
8 comments1 min readLW link
(arxiv.org)

The Un­rea­son­able Effec­tive­ness of Deep Learning

Richard_Ngo30 Sep 2018 15:48 UTC
85 points
5 comments13 min readLW link
(thinkingcomplete.blogspot.com)

mAIry’s room: AI rea­son­ing to solve philo­soph­i­cal problems

Stuart_Armstrong5 Mar 2019 20:24 UTC
92 points
41 comments6 min readLW link2 reviews

Failures of an em­bod­ied AIXI

So8res15 Jun 2014 18:29 UTC
48 points
46 comments12 min readLW link

The Prob­lem with AIXI

Rob Bensinger18 Mar 2014 1:55 UTC
43 points
78 comments23 min readLW link

Ver­sions of AIXI can be ar­bi­trar­ily stupid

Stuart_Armstrong10 Aug 2015 13:23 UTC
29 points
59 comments1 min readLW link

Reflec­tive AIXI and Anthropics

Diffractor24 Sep 2018 2:15 UTC
17 points
13 comments8 min readLW link

AIXI and Ex­is­ten­tial Despair

paulfchristiano8 Dec 2011 20:03 UTC
23 points
38 comments6 min readLW link

How to make AIXI-tl in­ca­pable of learning

itaibn027 Jan 2014 0:05 UTC
7 points
5 comments2 min readLW link

Help re­quest: What is the Kol­mogorov com­plex­ity of com­putable ap­prox­i­ma­tions to AIXI?

AnnaSalamon5 Dec 2010 10:23 UTC
9 points
9 comments1 min readLW link

“AIXIjs: A Soft­ware Demo for Gen­eral Re­in­force­ment Learn­ing”, As­lanides 2017

gwern29 May 2017 21:09 UTC
7 points
1 comment1 min readLW link
(arxiv.org)

Can AIXI be trained to do any­thing a hu­man can?

Stuart_Armstrong20 Oct 2014 13:12 UTC
5 points
9 comments2 min readLW link

Shap­ing eco­nomic in­cen­tives for col­lab­o­ra­tive AGI

Kaj_Sotala29 Jun 2018 16:26 UTC
45 points
15 comments4 min readLW link

Is the Star Trek Fed­er­a­tion re­ally in­ca­pable of build­ing AI?

Kaj_Sotala18 Mar 2018 10:30 UTC
19 points
4 comments2 min readLW link
(kajsotala.fi)

Some con­cep­tual high­lights from “Disjunc­tive Sce­nar­ios of Catas­trophic AI Risk”

Kaj_Sotala12 Feb 2018 12:30 UTC
33 points
4 comments6 min readLW link
(kajsotala.fi)

Mis­con­cep­tions about con­tin­u­ous takeoff

Matthew Barnett8 Oct 2019 21:31 UTC
79 points
38 comments4 min readLW link

Dist­in­guish­ing defi­ni­tions of takeoff

Matthew Barnett14 Feb 2020 0:16 UTC
60 points
6 comments6 min readLW link

Book re­view: Ar­tifi­cial In­tel­li­gence Safety and Security

PeterMcCluskey8 Dec 2018 3:47 UTC
27 points
3 comments8 min readLW link
(www.bayesianinvestor.com)

Why AI may not foom

John_Maxwell24 Mar 2013 8:11 UTC
29 points
81 comments12 min readLW link

Hu­mans Who Are Not Con­cen­trat­ing Are Not Gen­eral Intelligences

sarahconstantin25 Feb 2019 20:40 UTC
181 points
35 comments6 min readLW link1 review
(srconstantin.wordpress.com)

The Hacker Learns to Trust

Ben Pace22 Jun 2019 0:27 UTC
80 points
18 comments8 min readLW link
(medium.com)

Book Re­view: Hu­man Compatible

Scott Alexander31 Jan 2020 5:20 UTC
77 points
6 comments16 min readLW link
(slatestarcodex.com)

SSC Jour­nal Club: AI Timelines

Scott Alexander8 Jun 2017 19:00 UTC
12 points
15 comments8 min readLW link

Ar­gu­ments against my­opic training

Richard_Ngo9 Jul 2020 16:07 UTC
56 points
39 comments12 min readLW link

On mo­ti­va­tions for MIRI’s highly re­li­able agent de­sign research

jessicata29 Jan 2017 19:34 UTC
27 points
1 comment5 min readLW link

Why is the im­pact penalty time-in­con­sis­tent?

Stuart_Armstrong9 Jul 2020 17:26 UTC
16 points
1 comment2 min readLW link

My cur­rent take on the Paul-MIRI dis­agree­ment on al­ignabil­ity of messy AI

jessicata29 Jan 2017 20:52 UTC
21 points
0 comments10 min readLW link

Ben Go­ertzel: The Sin­gu­lar­ity In­sti­tute’s Scary Idea (and Why I Don’t Buy It)

Paul Crowley30 Oct 2010 9:31 UTC
42 points
442 comments1 min readLW link

An An­a­lytic Per­spec­tive on AI Alignment

DanielFilan1 Mar 2020 4:10 UTC
54 points
45 comments8 min readLW link
(danielfilan.com)

Mechanis­tic Trans­parency for Ma­chine Learning

DanielFilan11 Jul 2018 0:34 UTC
54 points
9 comments4 min readLW link

A model I use when mak­ing plans to re­duce AI x-risk

Ben Pace19 Jan 2018 0:21 UTC
69 points
41 comments6 min readLW link

AI Re­searchers On AI Risk

Scott Alexander22 May 2015 11:16 UTC
18 points
0 comments16 min readLW link

Mini ad­vent cal­en­dar of Xrisks: Ar­tifi­cial Intelligence

Stuart_Armstrong7 Dec 2012 11:26 UTC
5 points
5 comments1 min readLW link

For FAI: Is “Molec­u­lar Nan­otech­nol­ogy” putting our best foot for­ward?

leplen22 Jun 2013 4:44 UTC
79 points
118 comments3 min readLW link

UFAI can­not be the Great Filter

Thrasymachus22 Dec 2012 11:26 UTC
59 points
92 comments3 min readLW link

Don’t Fear The Filter

Scott Alexander29 May 2014 0:45 UTC
11 points
18 comments6 min readLW link

The Great Filter is early, or AI is hard

Stuart_Armstrong29 Aug 2014 16:17 UTC
32 points
76 comments1 min readLW link

Talk: Key Is­sues In Near-Term AI Safety Research

Aryeh Englander10 Jul 2020 18:36 UTC
22 points
1 comment1 min readLW link

Mesa-Op­ti­miz­ers vs “Steered Op­ti­miz­ers”

Steven Byrnes10 Jul 2020 16:49 UTC
45 points
7 comments8 min readLW link

AlphaS­tar: Im­pres­sive for RL progress, not for AGI progress

orthonormal2 Nov 2019 1:50 UTC
113 points
58 comments2 min readLW link1 review

The Catas­trophic Con­ver­gence Conjecture

TurnTrout14 Feb 2020 21:16 UTC
44 points
15 comments8 min readLW link

[Question] How well can the GPT ar­chi­tec­ture solve the par­ity task?

FactorialCode11 Jul 2020 19:02 UTC
19 points
3 comments1 min readLW link

Sun­day July 12 — talks by Scott Garrabrant, Alexflint, alexei, Stu­art_Armstrong

8 Jul 2020 0:27 UTC
19 points
2 comments1 min readLW link

[Link] Word-vec­tor based DL sys­tem achieves hu­man par­ity in ver­bal IQ tests

jacob_cannell13 Jun 2015 23:38 UTC
17 points
8 comments1 min readLW link

The Power of Intelligence

Eliezer Yudkowsky1 Jan 2007 20:00 UTC
66 points
4 comments4 min readLW link

Com­ments on CAIS

Richard_Ngo12 Jan 2019 15:20 UTC
76 points
14 comments7 min readLW link

[Question] What are CAIS’ bold­est near/​medium-term pre­dic­tions?

jacobjacob28 Mar 2019 13:14 UTC
31 points
17 comments1 min readLW link

Drexler on AI Risk

PeterMcCluskey1 Feb 2019 5:11 UTC
34 points
10 comments9 min readLW link
(www.bayesianinvestor.com)

Six AI Risk/​Strat­egy Ideas

Wei Dai27 Aug 2019 0:40 UTC
64 points
18 comments4 min readLW link1 review

New re­port: In­tel­li­gence Ex­plo­sion Microeconomics

Eliezer Yudkowsky29 Apr 2013 23:14 UTC
72 points
251 comments3 min readLW link

Book re­view: Hu­man Compatible

PeterMcCluskey19 Jan 2020 3:32 UTC
37 points
2 comments5 min readLW link
(www.bayesianinvestor.com)

Thoughts on “Hu­man-Com­pat­i­ble”

TurnTrout10 Oct 2019 5:24 UTC
63 points
35 comments5 min readLW link

Book Re­view: The AI Does Not Hate You

PeterMcCluskey28 Oct 2019 17:45 UTC
26 points
0 comments5 min readLW link
(www.bayesianinvestor.com)

[Link] Book Re­view: ‘The AI Does Not Hate You’ by Tom Chivers (Scott Aaron­son)

eigen7 Oct 2019 18:16 UTC
19 points
0 comments1 min readLW link

Book Re­view: Life 3.0: Be­ing Hu­man in the Age of Ar­tifi­cial Intelligence

J Thomas Moros18 Jan 2018 17:18 UTC
8 points
0 comments1 min readLW link
(ferocioustruth.com)

Book Re­view: Weapons of Math Destruction

Zvi4 Jun 2017 21:20 UTC
1 point
0 comments16 min readLW link

DARPA Digi­tal Tu­tor: Four Months to To­tal Tech­ni­cal Ex­per­tise?

JohnBuridan6 Jul 2020 23:34 UTC
200 points
19 comments7 min readLW link

Paper: Su­per­in­tel­li­gence as a Cause or Cure for Risks of Astro­nom­i­cal Suffering

Kaj_Sotala3 Jan 2018 14:39 UTC
1 point
6 comments1 min readLW link
(www.informatica.si)

Prevent­ing s-risks via in­dex­i­cal un­cer­tainty, acausal trade and dom­i­na­tion in the multiverse

avturchin27 Sep 2018 10:09 UTC
11 points
6 comments4 min readLW link

Pre­face to CLR’s Re­search Agenda on Co­op­er­a­tion, Con­flict, and TAI

JesseClifton13 Dec 2019 21:02 UTC
59 points
10 comments2 min readLW link

Sec­tions 1 & 2: In­tro­duc­tion, Strat­egy and Governance

JesseClifton17 Dec 2019 21:27 UTC
34 points
5 comments14 min readLW link

Sec­tions 3 & 4: Cred­i­bil­ity, Peace­ful Bar­gain­ing Mechanisms

JesseClifton17 Dec 2019 21:46 UTC
19 points
2 comments12 min readLW link

Sec­tions 5 & 6: Con­tem­po­rary Ar­chi­tec­tures, Hu­mans in the Loop

JesseClifton20 Dec 2019 3:52 UTC
27 points
4 comments10 min readLW link

Sec­tion 7: Foun­da­tions of Ra­tional Agency

JesseClifton22 Dec 2019 2:05 UTC
14 points
4 comments8 min readLW link

What counts as defec­tion?

TurnTrout12 Jul 2020 22:03 UTC
81 points
21 comments5 min readLW link1 review

The Com­mit­ment Races problem

Daniel Kokotajlo23 Aug 2019 1:58 UTC
122 points
39 comments5 min readLW link

Align­ment Newslet­ter #36

Rohin Shah12 Dec 2018 1:10 UTC
21 points
0 comments11 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #47

Rohin Shah4 Mar 2019 4:30 UTC
18 points
0 comments8 min readLW link
(mailchi.mp)

Un­der­stand­ing “Deep Dou­ble Des­cent”

evhub6 Dec 2019 0:00 UTC
135 points
51 comments5 min readLW link4 reviews

[LINK] Strong AI Startup Raises $15M

olalonde21 Aug 2012 20:47 UTC
24 points
13 comments1 min readLW link

An­nounc­ing the AI Align­ment Prize

cousin_it3 Nov 2017 15:47 UTC
95 points
78 comments1 min readLW link

I’m leav­ing AI al­ign­ment – you bet­ter stay

rmoehn12 Mar 2020 5:58 UTC
150 points
19 comments5 min readLW link

New pa­per: AGI Agent Safety by Iter­a­tively Im­prov­ing the Utility Function

Koen.Holtman15 Jul 2020 14:05 UTC
21 points
2 comments6 min readLW link

[Question] How should AI de­bate be judged?

abramdemski15 Jul 2020 22:20 UTC
49 points
27 comments6 min readLW link

Align­ment pro­pos­als and com­plex­ity classes

evhub16 Jul 2020 0:27 UTC
33 points
26 comments13 min readLW link

[AN #107]: The con­ver­gent in­stru­men­tal sub­goals of goal-di­rected agents

Rohin Shah16 Jul 2020 6:47 UTC
13 points
1 comment8 min readLW link
(mailchi.mp)

[AN #108]: Why we should scru­ti­nize ar­gu­ments for AI risk

Rohin Shah16 Jul 2020 6:47 UTC
19 points
6 comments12 min readLW link
(mailchi.mp)

En­vi­ron­ments as a bot­tle­neck in AGI development

Richard_Ngo17 Jul 2020 5:02 UTC
36 points
19 comments6 min readLW link

[Question] Can an agent use in­ter­ac­tive proofs to check the al­ign­ment of suc­ce­sors?

PabloAMC17 Jul 2020 19:07 UTC
7 points
2 comments1 min readLW link

Les­sons on AI Takeover from the conquistadors

17 Jul 2020 22:35 UTC
58 points
30 comments5 min readLW link

What Would I Do? Self-pre­dic­tion in Sim­ple Algorithms

Scott Garrabrant20 Jul 2020 4:27 UTC
54 points
13 comments5 min readLW link

Wri­teup: Progress on AI Safety via Debate

5 Feb 2020 21:04 UTC
94 points
18 comments33 min readLW link

Oper­a­tional­iz­ing Interpretability

lifelonglearner20 Jul 2020 5:22 UTC
20 points
0 comments4 min readLW link

Learn­ing Values in Practice

Stuart_Armstrong20 Jul 2020 18:38 UTC
24 points
0 comments5 min readLW link

Par­allels Between AI Safety by De­bate and Ev­i­dence Law

Cullen20 Jul 2020 22:52 UTC
10 points
1 comment2 min readLW link
(cullenokeefe.com)

The Redis­cov­ery of In­te­ri­or­ity in Ma­chine Learning

DanB21 Jul 2020 5:02 UTC
5 points
4 comments1 min readLW link
(danburfoot.net)

The “AI Dun­geons” Dragon Model is heav­ily path de­pen­dent (test­ing GPT-3 on ethics)

Rafael Harth21 Jul 2020 12:14 UTC
44 points
9 comments6 min readLW link

How good is hu­man­ity at co­or­di­na­tion?

Buck21 Jul 2020 20:01 UTC
78 points
44 comments3 min readLW link

Align­ment As A Bot­tle­neck To Use­ful­ness Of GPT-3

johnswentworth21 Jul 2020 20:02 UTC
111 points
57 comments3 min readLW link

$1000 bounty for OpenAI to show whether GPT3 was “de­liber­ately” pre­tend­ing to be stupi­der than it is

jacobjacob21 Jul 2020 18:42 UTC
59 points
40 comments2 min readLW link
(twitter.com)

[Preprint] The Com­pu­ta­tional Limits of Deep Learning

Gordon Seidoh Worley21 Jul 2020 21:25 UTC
9 points
2 comments1 min readLW link
(arxiv.org)

[AN #109]: Teach­ing neu­ral nets to gen­er­al­ize the way hu­mans would

Rohin Shah22 Jul 2020 17:10 UTC
17 points
3 comments9 min readLW link
(mailchi.mp)

Re­search agenda for AI safety and a bet­ter civilization

agilecaveman22 Jul 2020 6:35 UTC
12 points
2 comments16 min readLW link

Weak HCH ac­cesses EXP

evhub22 Jul 2020 22:36 UTC
14 points
0 comments3 min readLW link

GPT-3 Gems

TurnTrout23 Jul 2020 0:46 UTC
33 points
10 comments48 min readLW link

Op­ti­miz­ing ar­bi­trary ex­pres­sions with a lin­ear num­ber of queries to a Log­i­cal In­duc­tion Or­a­cle (Car­toon Guide)

Donald Hobson23 Jul 2020 21:37 UTC
3 points
2 comments2 min readLW link

[Question] Con­struct a port­fo­lio to profit from AI progress.

sapphire25 Jul 2020 8:18 UTC
29 points
13 comments1 min readLW link

Think­ing soberly about the con­text and con­se­quences of Friendly AI

Mitchell_Porter16 Oct 2012 4:33 UTC
21 points
39 comments1 min readLW link

Goal re­ten­tion dis­cus­sion with Eliezer

Max Tegmark4 Sep 2014 22:23 UTC
93 points
26 comments6 min readLW link

[Question] Where do peo­ple dis­cuss do­ing things with GPT-3?

skybrian26 Jul 2020 14:31 UTC
2 points
7 comments1 min readLW link

You Can Prob­a­bly Am­plify GPT3 Directly

Past Account26 Jul 2020 21:58 UTC
34 points
14 comments6 min readLW link

[up­dated] how does gpt2′s train­ing cor­pus cap­ture in­ter­net dis­cus­sion? not well

nostalgebraist27 Jul 2020 22:30 UTC
25 points
3 comments2 min readLW link
(nostalgebraist.tumblr.com)

Agen­tic Lan­guage Model Memes

FactorialCode1 Aug 2020 18:03 UTC
16 points
1 comment2 min readLW link

A com­mu­nity-cu­rated repos­i­tory of in­ter­est­ing GPT-3 stuff

Rudi C28 Jul 2020 14:16 UTC
8 points
0 comments1 min readLW link
(github.com)

[Question] Does the lot­tery ticket hy­poth­e­sis sug­gest the scal­ing hy­poth­e­sis?

Daniel Kokotajlo28 Jul 2020 19:52 UTC
14 points
17 comments1 min readLW link

[Question] To what ex­tent are the scal­ing prop­er­ties of Trans­former net­works ex­cep­tional?

abramdemski28 Jul 2020 20:06 UTC
30 points
1 comment1 min readLW link

[Question] What hap­pens to var­i­ance as neu­ral net­work train­ing is scaled? What does it im­ply about “lot­tery tick­ets”?

abramdemski28 Jul 2020 20:22 UTC
25 points
4 comments1 min readLW link

[Question] How will in­ter­net fo­rums like LW be able to defend against GPT-style spam?

ChristianKl28 Jul 2020 20:12 UTC
14 points
18 comments1 min readLW link

Pre­dic­tions for GPT-N

hippke29 Jul 2020 1:16 UTC
36 points
31 comments1 min readLW link

An­nounce­ment: AI al­ign­ment prize win­ners and next round

cousin_it15 Jan 2018 14:33 UTC
80 points
68 comments2 min readLW link

Jeff Hawk­ins on neu­ro­mor­phic AGI within 20 years

Steven Byrnes15 Jul 2019 19:16 UTC
167 points
24 comments12 min readLW link

Cas­cades, Cy­cles, In­sight...

Eliezer Yudkowsky24 Nov 2008 9:33 UTC
31 points
31 comments8 min readLW link

...Re­cur­sion, Magic

Eliezer Yudkowsky25 Nov 2008 9:10 UTC
27 points
28 comments5 min readLW link

Refer­ences & Re­sources for LessWrong

XiXiDu10 Oct 2010 14:54 UTC
153 points
106 comments20 min readLW link

[Question] A game de­signed to beat AI?

Long try17 Mar 2020 3:51 UTC
13 points
29 comments1 min readLW link

Truly Part Of You

Eliezer Yudkowsky21 Nov 2007 2:18 UTC
149 points
59 comments4 min readLW link

[AN #110]: Learn­ing fea­tures from hu­man feed­back to en­able re­ward learning

Rohin Shah29 Jul 2020 17:20 UTC
13 points
2 comments10 min readLW link
(mailchi.mp)

Struc­tured Tasks for Lan­guage Models

Past Account29 Jul 2020 14:17 UTC
5 points
0 comments1 min readLW link

En­gag­ing Se­ri­ously with Short Timelines

sapphire29 Jul 2020 19:21 UTC
43 points
23 comments3 min readLW link

What Failure Looks Like: Distill­ing the Discussion

Ben Pace29 Jul 2020 21:49 UTC
79 points
14 comments7 min readLW link

Learn­ing the prior and generalization

evhub29 Jul 2020 22:49 UTC
34 points
16 comments4 min readLW link

[Question] Is the work on AI al­ign­ment rele­vant to GPT?

Richard_Kennaway30 Jul 2020 12:23 UTC
20 points
5 comments1 min readLW link

Ver­ifi­ca­tion and Transparency

DanielFilan8 Aug 2019 1:50 UTC
34 points
6 comments2 min readLW link
(danielfilan.com)

Robin Han­son on Lump­iness of AI Services

DanielFilan17 Feb 2019 23:08 UTC
15 points
2 comments2 min readLW link
(www.overcomingbias.com)

One Way to Think About ML Transparency

Matthew Barnett2 Sep 2019 23:27 UTC
26 points
28 comments5 min readLW link

What is In­ter­pretabil­ity?

17 Mar 2020 20:23 UTC
34 points
0 comments11 min readLW link

Re­laxed ad­ver­sar­ial train­ing for in­ner alignment

evhub10 Sep 2019 23:03 UTC
61 points
28 comments1 min readLW link

Con­clu­sion to ‘Refram­ing Im­pact’

TurnTrout28 Feb 2020 16:05 UTC
39 points
17 comments2 min readLW link

Bayesian Evolv­ing-to-Extinction

abramdemski14 Feb 2020 23:55 UTC
38 points
13 comments5 min readLW link

Do Suffi­ciently Ad­vanced Agents Use Logic?

abramdemski13 Sep 2019 19:53 UTC
41 points
11 comments9 min readLW link

World State is the Wrong Ab­strac­tion for Impact

TurnTrout1 Oct 2019 21:03 UTC
62 points
19 comments2 min readLW link

At­tain­able Utility Preser­va­tion: Concepts

TurnTrout17 Feb 2020 5:20 UTC
38 points
20 comments1 min readLW link

At­tain­able Utility Preser­va­tion: Em­piri­cal Results

22 Feb 2020 0:38 UTC
61 points
8 comments10 min readLW link1 review

How Low Should Fruit Hang Be­fore We Pick It?

TurnTrout25 Feb 2020 2:08 UTC
28 points
9 comments12 min readLW link

At­tain­able Utility Preser­va­tion: Scal­ing to Superhuman

TurnTrout27 Feb 2020 0:52 UTC
28 points
21 comments8 min readLW link

Rea­sons for Ex­cite­ment about Im­pact of Im­pact Mea­sure Research

TurnTrout27 Feb 2020 21:42 UTC
33 points
8 comments4 min readLW link

Power as Easily Ex­ploitable Opportunities

TurnTrout1 Aug 2020 2:14 UTC
30 points
5 comments6 min readLW link

[Question] Would AGIs par­ent young AGIs?

Vishrut Arya2 Aug 2020 0:57 UTC
3 points
6 comments1 min readLW link

If I were a well-in­ten­tioned AI… I: Image classifier

Stuart_Armstrong26 Feb 2020 12:39 UTC
35 points
4 comments5 min readLW link

Non-Con­se­quen­tial­ist Co­op­er­a­tion?

abramdemski11 Jan 2019 9:15 UTC
48 points
15 comments7 min readLW link

Cu­ri­os­ity Killed the Cat and the Asymp­tot­i­cally Op­ti­mal Agent

michaelcohen20 Feb 2020 17:28 UTC
27 points
15 comments1 min readLW link

If I were a well-in­ten­tioned AI… IV: Mesa-optimising

Stuart_Armstrong2 Mar 2020 12:16 UTC
26 points
2 comments6 min readLW link

Re­sponse to Oren Etz­ioni’s “How to know if ar­tifi­cial in­tel­li­gence is about to de­stroy civ­i­liza­tion”

Daniel Kokotajlo27 Feb 2020 18:10 UTC
27 points
5 comments8 min readLW link

Clar­ify­ing Power-Seek­ing and In­stru­men­tal Convergence

TurnTrout20 Dec 2019 19:59 UTC
42 points
7 comments3 min readLW link

How im­por­tant are MDPs for AGI (Safety)?

michaelcohen26 Mar 2020 20:32 UTC
14 points
8 comments2 min readLW link

Syn­the­siz­ing am­plifi­ca­tion and debate

evhub5 Feb 2020 22:53 UTC
33 points
10 comments4 min readLW link

is gpt-3 few-shot ready for real ap­pli­ca­tions?

nostalgebraist3 Aug 2020 19:50 UTC
31 points
5 comments9 min readLW link
(nostalgebraist.tumblr.com)

In­ter­pretabil­ity in ML: A Broad Overview

lifelonglearner4 Aug 2020 19:03 UTC
52 points
5 comments15 min readLW link

In­finite Data/​Com­pute Ar­gu­ments in Alignment

johnswentworth4 Aug 2020 20:21 UTC
49 points
6 comments2 min readLW link

Four Ways An Im­pact Mea­sure Could Help Alignment

Matthew Barnett8 Aug 2019 0:10 UTC
21 points
1 comment8 min readLW link

Un­der­stand­ing Re­cent Im­pact Measures

Matthew Barnett7 Aug 2019 4:57 UTC
16 points
6 comments7 min readLW link

A Sur­vey of Early Im­pact Measures

Matthew Barnett6 Aug 2019 1:22 UTC
23 points
0 comments8 min readLW link

Op­ti­miza­tion Reg­u­lariza­tion through Time Penalty

Linda Linsefors1 Jan 2019 13:05 UTC
11 points
4 comments3 min readLW link

Stable Poin­t­ers to Value III: Re­cur­sive Quantilization

abramdemski21 Jul 2018 8:06 UTC
19 points
4 comments4 min readLW link

Thoughts on Quantilizers

Stuart_Armstrong2 Jun 2017 16:24 UTC
2 points
0 comments2 min readLW link

Quan­tiliz­ers max­i­mize ex­pected util­ity sub­ject to a con­ser­va­tive cost constraint

jessicata28 Sep 2015 2:17 UTC
25 points
0 comments5 min readLW link

Quan­tilal con­trol for finite MDPs

Vanessa Kosoy12 Apr 2018 9:21 UTC
14 points
0 comments13 min readLW link

The limits of corrigibility

Stuart_Armstrong10 Apr 2018 10:49 UTC
27 points
9 comments4 min readLW link

Align­ment Newslet­ter #16: 07/​23/​18

Rohin Shah23 Jul 2018 16:20 UTC
42 points
0 comments12 min readLW link
(mailchi.mp)

Mea­sur­ing hard­ware overhang

hippke5 Aug 2020 19:59 UTC
106 points
14 comments4 min readLW link

[AN #111]: The Cir­cuits hy­pothe­ses for deep learning

Rohin Shah5 Aug 2020 17:40 UTC
23 points
0 comments9 min readLW link
(mailchi.mp)

Self-Fulfilling Prophe­cies Aren’t Always About Self-Awareness

John_Maxwell18 Nov 2019 23:11 UTC
14 points
7 comments4 min readLW link

The Good­hart Game

John_Maxwell18 Nov 2019 23:22 UTC
13 points
5 comments5 min readLW link

Why don’t sin­gu­lar­i­tar­i­ans bet on the cre­ation of AGI by buy­ing stocks?

John_Maxwell11 Mar 2020 16:27 UTC
43 points
20 comments4 min readLW link

The Dual­ist Pre­dict-O-Matic ($100 prize)

John_Maxwell17 Oct 2019 6:45 UTC
16 points
35 comments5 min readLW link

[Question] What AI safety prob­lems need solv­ing for safe AI re­search as­sis­tants?

John_Maxwell5 Nov 2019 2:09 UTC
14 points
13 comments1 min readLW link

Refin­ing the Evolu­tion­ary Anal­ogy to AI

lberglund7 Aug 2020 23:13 UTC
9 points
2 comments4 min readLW link

The Fu­sion Power Gen­er­a­tor Scenario

johnswentworth8 Aug 2020 18:31 UTC
136 points
29 comments3 min readLW link

[Question] How much is known about the “in­fer­ence rules” of log­i­cal in­duc­tion?

Eigil Rischel8 Aug 2020 10:45 UTC
11 points
7 comments1 min readLW link

If I were a well-in­ten­tioned AI… II: Act­ing in a world

Stuart_Armstrong27 Feb 2020 11:58 UTC
20 points
0 comments3 min readLW link

If I were a well-in­ten­tioned AI… III: Ex­tremal Goodhart

Stuart_Armstrong28 Feb 2020 11:24 UTC
22 points
0 comments5 min readLW link

Towards a For­mal­i­sa­tion of Log­i­cal Counterfactuals

Bunthut8 Aug 2020 22:14 UTC
6 points
2 comments2 min readLW link

[Question] 10/​50/​90% chance of GPT-N Trans­for­ma­tive AI?

human_generated_text9 Aug 2020 0:10 UTC
24 points
8 comments1 min readLW link

[Question] Can we ex­pect more value from AI al­ign­ment than from an ASI with the goal of run­ning al­ter­nate tra­jec­to­ries of our uni­verse?

Maxime Riché9 Aug 2020 17:17 UTC
2 points
5 comments1 min readLW link

In defense of Or­a­cle (“Tool”) AI research

Steven Byrnes7 Aug 2019 19:14 UTC
21 points
11 comments4 min readLW link

How GPT-N will es­cape from its AI-box

hippke12 Aug 2020 19:34 UTC
7 points
9 comments1 min readLW link

Strong im­pli­ca­tion of prefer­ence uncertainty

Stuart_Armstrong12 Aug 2020 19:02 UTC
20 points
3 comments2 min readLW link

[AN #112]: Eng­ineer­ing a Safer World

Rohin Shah13 Aug 2020 17:20 UTC
25 points
2 comments12 min readLW link
(mailchi.mp)

Room and Board for Peo­ple Self-Learn­ing ML or Do­ing In­de­pen­dent ML Research

SamuelKnoche14 Aug 2020 17:19 UTC
7 points
1 comment1 min readLW link

Talk and Q&A—Dan Hendrycks—Paper: Align­ing AI With Shared Hu­man Values. On Dis­cord at Aug 28, 2020 8:00-10:00 AM GMT+8.

wassname14 Aug 2020 23:57 UTC
1 point
0 comments1 min readLW link

Search ver­sus design

Alex Flint16 Aug 2020 16:53 UTC
89 points
41 comments36 min readLW link1 review

Work on Se­cu­rity In­stead of Friendli­ness?

Wei Dai21 Jul 2012 18:28 UTC
51 points
107 comments2 min readLW link

Goal-Direct­ed­ness: What Suc­cess Looks Like

adamShimi16 Aug 2020 18:33 UTC
9 points
0 comments2 min readLW link

[Question] A way to beat su­per­ra­tional/​EDT agents?

Abhimanyu Pallavi Sudhir17 Aug 2020 14:33 UTC
5 points
13 comments1 min readLW link

Learn­ing hu­man prefer­ences: op­ti­mistic and pes­simistic scenarios

Stuart_Armstrong18 Aug 2020 13:05 UTC
27 points
6 comments6 min readLW link

Mesa-Search vs Mesa-Control

abramdemski18 Aug 2020 18:51 UTC
54 points
45 comments7 min readLW link

Why we want un­bi­ased learn­ing processes

Stuart_Armstrong20 Feb 2018 14:48 UTC
13 points
3 comments3 min readLW link

In­tu­itive ex­am­ples of re­ward func­tion learn­ing?

Stuart_Armstrong6 Mar 2018 16:54 UTC
7 points
3 comments2 min readLW link

Open-Cat­e­gory Classification

TurnTrout28 Mar 2018 14:49 UTC
13 points
6 comments10 min readLW link

Look­ing for ad­ver­sar­ial col­lab­o­ra­tors to test our De­bate protocol

Beth Barnes19 Aug 2020 3:15 UTC
52 points
5 comments1 min readLW link

Walk­through of ‘For­mal­iz­ing Con­ver­gent In­stru­men­tal Goals’

TurnTrout26 Feb 2018 2:20 UTC
10 points
2 comments10 min readLW link

Am­bi­guity Detection

TurnTrout1 Mar 2018 4:23 UTC
11 points
9 comments4 min readLW link

Pe­nal­iz­ing Im­pact via At­tain­able Utility Preservation

TurnTrout28 Dec 2018 21:46 UTC
24 points
0 comments3 min readLW link
(arxiv.org)

What You See Isn’t Always What You Want

TurnTrout13 Sep 2019 4:17 UTC
30 points
12 comments3 min readLW link

[Question] In­stru­men­tal Oc­cam?

abramdemski31 Jan 2020 19:27 UTC
30 points
15 comments1 min readLW link

Com­pact vs. Wide Models

Vaniver16 Jul 2018 4:09 UTC
31 points
5 comments3 min readLW link

Alex Ir­pan: “My AI Timelines Have Sped Up”

Vaniver19 Aug 2020 16:23 UTC
43 points
20 comments1 min readLW link
(www.alexirpan.com)

[AN #113]: Check­ing the eth­i­cal in­tu­itions of large lan­guage models

Rohin Shah19 Aug 2020 17:10 UTC
23 points
0 comments9 min readLW link
(mailchi.mp)

AI safety as feather­less bipeds *with broad flat nails*

Stuart_Armstrong19 Aug 2020 10:22 UTC
37 points
1 comment1 min readLW link

Time Magaz­ine has an ar­ti­cle about the Sin­gu­lar­ity...

Raemon11 Feb 2011 2:20 UTC
40 points
13 comments1 min readLW link

How rapidly are GPUs im­prov­ing in price perfor­mance?

gallabytes25 Nov 2018 19:54 UTC
31 points
9 comments1 min readLW link
(mediangroup.org)

Our val­ues are un­der­defined, change­able, and manipulable

Stuart_Armstrong2 Nov 2017 11:09 UTC
25 points
6 comments3 min readLW link

[Question] What fund­ing sources ex­ist for tech­ni­cal AI safety re­search?

johnswentworth1 Oct 2019 15:30 UTC
26 points
5 comments1 min readLW link

Hu­mans can drive cars

Apprentice30 Jan 2014 11:55 UTC
53 points
89 comments2 min readLW link

A Less Wrong sin­gu­lar­ity ar­ti­cle?

Kaj_Sotala17 Nov 2009 14:15 UTC
31 points
215 comments1 min readLW link

The Bayesian Tyrant

abramdemski20 Aug 2020 0:08 UTC
132 points
20 comments6 min readLW link1 review

Con­cept Safety: Pro­duc­ing similar AI-hu­man con­cept spaces

Kaj_Sotala14 Apr 2015 20:39 UTC
50 points
45 comments8 min readLW link

[LINK] What should a rea­son­able per­son be­lieve about the Sin­gu­lar­ity?

Kaj_Sotala13 Jan 2011 9:32 UTC
38 points
14 comments2 min readLW link

The many ways AIs be­have badly

Stuart_Armstrong24 Apr 2018 11:40 UTC
10 points
3 comments2 min readLW link

July 2020 gw­ern.net newsletter

gwern20 Aug 2020 16:39 UTC
29 points
0 comments1 min readLW link
(www.gwern.net)

Do what we mean vs. do what we say

Rohin Shah30 Aug 2018 22:03 UTC
34 points
14 comments1 min readLW link

[Question] What’s a De­com­pos­able Align­ment Topic?

Logan Riggs21 Aug 2020 22:57 UTC
26 points
16 comments1 min readLW link

Tools ver­sus agents

Stuart_Armstrong16 May 2012 13:00 UTC
47 points
39 comments5 min readLW link

An un­al­igned benchmark

paulfchristiano17 Nov 2018 15:51 UTC
31 points
0 comments9 min readLW link

Fol­low­ing hu­man norms

Rohin Shah20 Jan 2019 23:59 UTC
30 points
10 comments5 min readLW link

nos­talge­braist: Re­cur­sive Good­hart’s Law

Kaj_Sotala26 Aug 2020 11:07 UTC
53 points
27 comments1 min readLW link
(nostalgebraist.tumblr.com)

[AN #114]: The­ory-in­spired safety solu­tions for pow­er­ful Bayesian RL agents

Rohin Shah26 Aug 2020 17:20 UTC
21 points
3 comments8 min readLW link
(mailchi.mp)

[Question] How hard would it be to change GPT-3 in a way that al­lows au­dio?

ChristianKl28 Aug 2020 14:42 UTC
8 points
5 comments1 min readLW link

Safe Scram­bling?

Hoagy29 Aug 2020 14:31 UTC
3 points
1 comment2 min readLW link

(Hu­mor) AI Align­ment Crit­i­cal Failure Table

Kaj_Sotala31 Aug 2020 19:51 UTC
24 points
2 comments1 min readLW link
(sl4.org)

What is am­bi­tious value learn­ing?

Rohin Shah1 Nov 2018 16:20 UTC
49 points
28 comments2 min readLW link

The easy goal in­fer­ence prob­lem is still hard

paulfchristiano3 Nov 2018 14:41 UTC
50 points
19 comments4 min readLW link

[AN #115]: AI safety re­search prob­lems in the AI-GA framework

Rohin Shah2 Sep 2020 17:10 UTC
19 points
16 comments6 min readLW link
(mailchi.mp)

Emo­tional valence vs RL re­ward: a video game analogy

Steven Byrnes3 Sep 2020 15:28 UTC
12 points
6 comments4 min readLW link

Us­ing GPT-N to Solve In­ter­pretabil­ity of Neu­ral Net­works: A Re­search Agenda

3 Sep 2020 18:27 UTC
67 points
12 comments2 min readLW link

“Learn­ing to Sum­ma­rize with Hu­man Feed­back”—OpenAI

[deleted]7 Sep 2020 17:59 UTC
57 points
3 comments1 min readLW link

[AN #116]: How to make ex­pla­na­tions of neu­rons compositional

Rohin Shah9 Sep 2020 17:20 UTC
21 points
2 comments9 min readLW link
(mailchi.mp)

Safer sand­box­ing via col­lec­tive separation

Richard_Ngo9 Sep 2020 19:49 UTC
24 points
6 comments4 min readLW link

[Question] Do mesa-op­ti­mizer risk ar­gu­ments rely on the train-test paradigm?

Ben Cottier10 Sep 2020 15:36 UTC
12 points
7 comments1 min readLW link

Safety via se­lec­tion for obedience

Richard_Ngo10 Sep 2020 10:04 UTC
31 points
1 comment5 min readLW link

How Much Com­pu­ta­tional Power Does It Take to Match the Hu­man Brain?

habryka12 Sep 2020 6:38 UTC
44 points
1 comment1 min readLW link
(www.openphilanthropy.org)

De­ci­sion The­ory is multifaceted

Michele Campolo13 Sep 2020 22:30 UTC
7 points
12 comments8 min readLW link

AI Safety Dis­cus­sion Day

Linda Linsefors15 Sep 2020 14:40 UTC
20 points
0 comments1 min readLW link

[AN #117]: How neu­ral nets would fare un­der the TEVV framework

Rohin Shah16 Sep 2020 17:20 UTC
27 points
0 comments7 min readLW link
(mailchi.mp)

Ap­ply­ing the Coun­ter­fac­tual Pri­soner’s Dilemma to Log­i­cal Uncertainty

Chris_Leong16 Sep 2020 10:34 UTC
9 points
5 comments2 min readLW link

Ar­tifi­cial In­tel­li­gence: A Modern Ap­proach (4th edi­tion) on the Align­ment Problem

Zack_M_Davis17 Sep 2020 2:23 UTC
72 points
12 comments5 min readLW link
(aima.cs.berkeley.edu)

The “Backchain­ing to Lo­cal Search” Tech­nique in AI Alignment

adamShimi18 Sep 2020 15:05 UTC
28 points
1 comment2 min readLW link

Draft re­port on AI timelines

Ajeya Cotra18 Sep 2020 23:47 UTC
207 points
56 comments1 min readLW link1 review

Why GPT wants to mesa-op­ti­mize & how we might change this

John_Maxwell19 Sep 2020 13:48 UTC
55 points
32 comments9 min readLW link

My (Mis)Ad­ven­tures With Al­gorith­mic Ma­chine Learning

AHartNtkn20 Sep 2020 5:31 UTC
16 points
4 comments41 min readLW link

[Question] What AI com­pa­nies would be most likely to have a pos­i­tive long-term im­pact on the world as a re­sult of in­vest­ing in them?

MikkW21 Sep 2020 23:41 UTC
8 points
2 comments2 min readLW link

An­thro­po­mor­phi­sa­tion vs value learn­ing: type 1 vs type 2 errors

Stuart_Armstrong22 Sep 2020 10:46 UTC
16 points
10 comments1 min readLW link

AI Ad­van­tages [Gems from the Wiki]

22 Sep 2020 22:44 UTC
22 points
7 comments2 min readLW link
(www.lesswrong.com)

A long re­ply to Ben Garfinkel on Scru­ti­niz­ing Clas­sic AI Risk Arguments

Søren Elverlin27 Sep 2020 17:51 UTC
17 points
6 comments1 min readLW link

De­hu­man­i­sa­tion *er­rors*

Stuart_Armstrong23 Sep 2020 9:51 UTC
13 points
0 comments1 min readLW link

[AN #118]: Risks, solu­tions, and pri­ori­ti­za­tion in a world with many AI systems

Rohin Shah23 Sep 2020 18:20 UTC
15 points
6 comments10 min readLW link
(mailchi.mp)

[Question] David Deutsch on Univer­sal Ex­plain­ers and AI

alanf24 Sep 2020 7:50 UTC
3 points
8 comments2 min readLW link

KL Diver­gence as Code Patch­ing Efficiency

Past Account27 Sep 2020 16:06 UTC
17 points
0 comments8 min readLW link

[Question] What to do with imi­ta­tion hu­mans, other than ask­ing them what the right thing to do is?

Charlie Steiner27 Sep 2020 21:51 UTC
10 points
6 comments1 min readLW link

[Question] What De­ci­sion The­ory is Im­plied By Pre­dic­tive Pro­cess­ing?

johnswentworth28 Sep 2020 17:20 UTC
55 points
17 comments1 min readLW link

AGI safety from first prin­ci­ples: Superintelligence

Richard_Ngo28 Sep 2020 19:53 UTC
80 points
6 comments9 min readLW link

AGI safety from first prin­ci­ples: Introduction

Richard_Ngo28 Sep 2020 19:53 UTC
109 points
18 comments2 min readLW link1 review

[Question] Ex­am­ples of self-gov­er­nance to re­duce tech­nol­ogy risk?

Jia29 Sep 2020 19:31 UTC
10 points
4 comments1 min readLW link

AGI safety from first prin­ci­ples: Goals and Agency

Richard_Ngo29 Sep 2020 19:06 UTC
70 points
15 comments15 min readLW link

“Un­su­per­vised” trans­la­tion as an (in­tent) al­ign­ment problem

paulfchristiano30 Sep 2020 0:50 UTC
61 points
15 comments4 min readLW link
(ai-alignment.com)

[AN #119]: AI safety when agents are shaped by en­vi­ron­ments, not rewards

Rohin Shah30 Sep 2020 17:10 UTC
11 points
0 comments11 min readLW link
(mailchi.mp)

AGI safety from first prin­ci­ples: Control

Richard_Ngo2 Oct 2020 21:51 UTC
61 points
4 comments9 min readLW link

AI race con­sid­er­a­tions in a re­port by the U.S. House Com­mit­tee on Armed Services

NunoSempere4 Oct 2020 12:11 UTC
42 points
4 comments13 min readLW link

[Question] Is there any work on in­cor­po­rat­ing aleatoric un­cer­tainty and/​or in­her­ent ran­dom­ness into AIXI?

David Scott Krueger (formerly: capybaralet)4 Oct 2020 8:10 UTC
9 points
7 comments1 min readLW link

AGI safety from first prin­ci­ples: Conclusion

Richard_Ngo4 Oct 2020 23:06 UTC
65 points
4 comments3 min readLW link

Univer­sal Eudaimonia

hg005 Oct 2020 13:45 UTC
19 points
6 comments2 min readLW link

The Align­ment Prob­lem: Ma­chine Learn­ing and Hu­man Values

Rohin Shah6 Oct 2020 17:41 UTC
120 points
7 comments6 min readLW link1 review
(www.amazon.com)

[AN #120]: Trac­ing the in­tel­lec­tual roots of AI and AI alignment

Rohin Shah7 Oct 2020 17:10 UTC
13 points
4 comments10 min readLW link
(mailchi.mp)

[Question] Brain­storm­ing pos­i­tive vi­sions of AI

jungofthewon7 Oct 2020 16:09 UTC
52 points
25 comments1 min readLW link

[Question] How can an AI demon­strate purely through chat that it is an AI, and not a hu­man?

hugh.mann7 Oct 2020 17:53 UTC
3 points
4 comments1 min readLW link

[Question] Why isn’t JS a pop­u­lar lan­guage for deep learn­ing?

Will Clark8 Oct 2020 14:36 UTC
12 points
21 comments1 min readLW link

[Question] If GPT-6 is hu­man-level AGI but costs $200 per page of out­put, what would hap­pen?

Daniel Kokotajlo9 Oct 2020 12:00 UTC
28 points
30 comments1 min readLW link

[Question] Shouldn’t there be a Chi­nese trans­la­tion of Hu­man Com­pat­i­ble?

mako yass9 Oct 2020 8:47 UTC
18 points
13 comments1 min readLW link

Ideal­ized Fac­tored Cognition

Rafael Harth30 Nov 2020 18:49 UTC
34 points
6 comments11 min readLW link

[Question] Re­views of the book ‘The Align­ment Prob­lem’

Mati_Roy11 Oct 2020 7:41 UTC
8 points
3 comments1 min readLW link

[Question] Re­views of TV show NeXt (about AI safety)

Mati_Roy11 Oct 2020 4:31 UTC
25 points
4 comments1 min readLW link

The Achilles Heel Hy­poth­e­sis for AI

scasper13 Oct 2020 14:35 UTC
20 points
6 comments1 min readLW link

Toy Prob­lem: De­tec­tive Story Alignment

johnswentworth13 Oct 2020 21:02 UTC
34 points
4 comments2 min readLW link

[Question] Does any­one worry about A.I. fo­rums like this where they re­in­force each other’s bi­ases/​ are led by big tech?

misabella1613 Oct 2020 15:14 UTC
4 points
3 comments1 min readLW link

[AN #121]: Fore­cast­ing trans­for­ma­tive AI timelines us­ing biolog­i­cal anchors

Rohin Shah14 Oct 2020 17:20 UTC
27 points
5 comments14 min readLW link
(mailchi.mp)

Gra­di­ent hacking

evhub16 Oct 2019 0:53 UTC
99 points
39 comments3 min readLW link2 reviews

Im­pact mea­sure­ment and value-neu­tral­ity verification

evhub15 Oct 2019 0:06 UTC
31 points
13 comments6 min readLW link

Outer al­ign­ment and imi­ta­tive amplification

evhub10 Jan 2020 0:26 UTC
24 points
11 comments9 min readLW link

Safe ex­plo­ra­tion and corrigibility

evhub28 Dec 2019 23:12 UTC
17 points
4 comments4 min readLW link

[Question] What are some non-purely-sam­pling ways to do deep RL?

evhub5 Dec 2019 0:09 UTC
15 points
9 comments2 min readLW link

More vari­a­tions on pseudo-alignment

evhub4 Nov 2019 23:24 UTC
26 points
8 comments3 min readLW link

Towards an em­piri­cal in­ves­ti­ga­tion of in­ner alignment

evhub23 Sep 2019 20:43 UTC
44 points
9 comments6 min readLW link

Are min­i­mal cir­cuits de­cep­tive?

evhub7 Sep 2019 18:11 UTC
66 points
11 comments8 min readLW link

Con­crete ex­per­i­ments in in­ner alignment

evhub6 Sep 2019 22:16 UTC
63 points
12 comments6 min readLW link

Towards a mechanis­tic un­der­stand­ing of corrigibility

evhub22 Aug 2019 23:20 UTC
44 points
26 comments6 min readLW link

A Con­crete Pro­posal for Ad­ver­sar­ial IDA

evhub26 Mar 2019 19:50 UTC
16 points
5 comments5 min readLW link

Nuances with as­crip­tion universality

evhub12 Feb 2019 23:38 UTC
20 points
1 comment2 min readLW link

Box in­ver­sion hypothesis

Jan Kulveit20 Oct 2020 16:20 UTC
59 points
4 comments3 min readLW link

[Question] Has any­one re­searched speci­fi­ca­tion gam­ing with biolog­i­cal an­i­mals?

David Scott Krueger (formerly: capybaralet)21 Oct 2020 0:20 UTC
9 points
3 comments1 min readLW link

Sun­day Oc­to­ber 25, 12:00PM (PT) — Scott Garrabrant on “Carte­sian Frames”

Ben Pace21 Oct 2020 3:27 UTC
48 points
3 comments2 min readLW link

[Question] Could we use recom­mender sys­tems to figure out hu­man val­ues?

Olga Babeeva20 Oct 2020 21:35 UTC
7 points
2 comments1 min readLW link

[Question] When was the term “AI al­ign­ment” coined?

David Scott Krueger (formerly: capybaralet)21 Oct 2020 18:27 UTC
11 points
8 comments1 min readLW link

[AN #122]: Ar­gu­ing for AGI-driven ex­is­ten­tial risk from first principles

Rohin Shah21 Oct 2020 17:10 UTC
28 points
0 comments9 min readLW link
(mailchi.mp)

[Question] What’s the differ­ence be­tween GAI and a gov­ern­ment?

DirectedEvolution21 Oct 2020 23:04 UTC
11 points
5 comments1 min readLW link

Mo­ral AI: Options

Manfred11 Jul 2015 21:46 UTC
14 points
6 comments4 min readLW link

Can few-shot learn­ing teach AI right from wrong?

Charlie Steiner20 Jul 2018 7:45 UTC
13 points
3 comments6 min readLW link

Some Com­ments on Stu­art Arm­strong’s “Re­search Agenda v0.9”

Charlie Steiner8 Jul 2019 19:03 UTC
21 points
12 comments4 min readLW link

The Ar­tifi­cial In­ten­tional Stance

Charlie Steiner27 Jul 2019 7:00 UTC
12 points
0 comments4 min readLW link

What’s the dream for giv­ing nat­u­ral lan­guage com­mands to AI?

Charlie Steiner8 Oct 2019 13:42 UTC
8 points
8 comments7 min readLW link

Su­per­vised learn­ing of out­puts in the brain

Steven Byrnes26 Oct 2020 14:32 UTC
27 points
9 comments10 min readLW link

Hu­mans are stun­ningly ra­tio­nal and stun­ningly irrational

Stuart_Armstrong23 Oct 2020 14:13 UTC
21 points
4 comments2 min readLW link

Re­ply to Je­bari and Lund­borg on Ar­tifi­cial Superintelligence

Richard_Ngo25 Oct 2020 13:50 UTC
31 points
4 comments5 min readLW link
(thinkingcomplete.blogspot.com)

Ad­di­tive Oper­a­tions on Carte­sian Frames

Scott Garrabrant26 Oct 2020 15:12 UTC
61 points
6 comments11 min readLW link

Se­cu­rity Mind­set and Take­off Speeds

DanielFilan27 Oct 2020 3:20 UTC
54 points
23 comments8 min readLW link
(danielfilan.com)

Biex­ten­sional Equivalence

Scott Garrabrant28 Oct 2020 14:07 UTC
43 points
13 comments10 min readLW link

Draft pa­pers for REALab and De­cou­pled Ap­proval on tampering

28 Oct 2020 16:01 UTC
47 points
2 comments1 min readLW link

[AN #123]: In­fer­ring what is valuable in or­der to al­ign recom­mender systems

Rohin Shah28 Oct 2020 17:00 UTC
20 points
1 comment8 min readLW link
(mailchi.mp)

“Scal­ing Laws for Au­tore­gres­sive Gen­er­a­tive Model­ing”, Henighan et al 2020 {OA}

gwern29 Oct 2020 1:45 UTC
26 points
11 comments1 min readLW link
(arxiv.org)

Con­trol­lables and Ob­serv­ables, Revisited

Scott Garrabrant29 Oct 2020 16:38 UTC
34 points
5 comments8 min readLW link

AI risk hub in Sin­ga­pore?

Daniel Kokotajlo29 Oct 2020 11:45 UTC
57 points
18 comments4 min readLW link

Func­tors and Coarse Worlds

Scott Garrabrant30 Oct 2020 15:19 UTC
50 points
4 comments8 min readLW link

[Question] Re­sponses to Chris­ti­ano on take­off speeds?

Richard_Ngo30 Oct 2020 15:16 UTC
29 points
8 comments1 min readLW link

/​r/​MLS­cal­ing: new sub­red­dit for NN scal­ing re­search/​discussion

gwern30 Oct 2020 20:50 UTC
20 points
0 comments1 min readLW link
(www.reddit.com)

“In­ner Align­ment Failures” Which Are Ac­tu­ally Outer Align­ment Failures

johnswentworth31 Oct 2020 20:18 UTC
61 points
38 comments5 min readLW link

Au­to­mated in­tel­li­gence is not AI

KatjaGrace1 Nov 2020 23:30 UTC
54 points
10 comments2 min readLW link
(meteuphoric.com)

Con­fu­ci­anism in AI Alignment

johnswentworth2 Nov 2020 21:16 UTC
33 points
28 comments6 min readLW link

[AN #124]: Prov­ably safe ex­plo­ra­tion through shielding

Rohin Shah4 Nov 2020 18:20 UTC
13 points
0 comments9 min readLW link
(mailchi.mp)

Defin­ing ca­pa­bil­ity and al­ign­ment in gra­di­ent descent

Edouard Harris5 Nov 2020 14:36 UTC
22 points
6 comments10 min readLW link

Sub-Sums and Sub-Tensors

Scott Garrabrant5 Nov 2020 18:06 UTC
34 points
4 comments8 min readLW link

Mul­ti­plica­tive Oper­a­tions on Carte­sian Frames

Scott Garrabrant3 Nov 2020 19:27 UTC
34 points
23 comments12 min readLW link

Subagents of Carte­sian Frames

Scott Garrabrant2 Nov 2020 22:02 UTC
48 points
5 comments8 min readLW link

[Question] What con­sid­er­a­tions in­fluence whether I have more in­fluence over short or long timelines?

Daniel Kokotajlo5 Nov 2020 19:56 UTC
27 points
30 comments1 min readLW link

Ad­di­tive and Mul­ti­plica­tive Subagents

Scott Garrabrant6 Nov 2020 14:26 UTC
20 points
7 comments12 min readLW link

Com­mit­ting, As­sum­ing, Ex­ter­nal­iz­ing, and Internalizing

Scott Garrabrant9 Nov 2020 16:59 UTC
31 points
25 comments10 min readLW link

Build­ing AGI Us­ing Lan­guage Models

leogao9 Nov 2020 16:33 UTC
11 points
1 comment1 min readLW link
(leogao.dev)

Why You Should Care About Goal-Directedness

adamShimi9 Nov 2020 12:48 UTC
37 points
15 comments9 min readLW link

Clar­ify­ing in­ner al­ign­ment terminology

evhub9 Nov 2020 20:40 UTC
98 points
17 comments3 min readLW link1 review

Eight Defi­ni­tions of Observability

Scott Garrabrant10 Nov 2020 23:37 UTC
34 points
26 comments12 min readLW link

[AN #125]: Neu­ral net­work scal­ing laws across mul­ti­ple modalities

Rohin Shah11 Nov 2020 18:20 UTC
25 points
7 comments9 min readLW link
(mailchi.mp)

Time in Carte­sian Frames

Scott Garrabrant11 Nov 2020 20:25 UTC
48 points
16 comments7 min readLW link

Learn­ing Nor­ma­tivity: A Re­search Agenda

abramdemski11 Nov 2020 21:59 UTC
76 points
18 comments19 min readLW link

[Question] Any work on hon­ey­pots (to de­tect treach­er­ous turn at­tempts)?

David Scott Krueger (formerly: capybaralet)12 Nov 2020 5:41 UTC
17 points
4 comments1 min readLW link

Misal­ign­ment and mi­suse: whose val­ues are man­i­fest?

KatjaGrace13 Nov 2020 10:10 UTC
42 points
7 comments2 min readLW link
(meteuphoric.com)

A Self-Embed­ded Prob­a­bil­is­tic Model

johnswentworth13 Nov 2020 20:36 UTC
30 points
2 comments5 min readLW link

TU Darm­stadt, Com­puter Science Master’s with a fo­cus on Ma­chine Learning

Master Programs ML/AI14 Nov 2020 15:50 UTC
6 points
0 comments8 min readLW link

EPF Lau­sanne, ML re­lated MSc programs

Master Programs ML/AI14 Nov 2020 15:51 UTC
3 points
0 comments4 min readLW link

ETH Zurich, ML re­lated MSc programs

Master Programs ML/AI14 Nov 2020 15:49 UTC
3 points
0 comments10 min readLW link

Univer­sity of Oxford, Master’s Statis­ti­cal Science

Master Programs ML/AI14 Nov 2020 15:51 UTC
3 points
0 comments3 min readLW link

Univer­sity of Ed­in­burgh, Master’s Ar­tifi­cial Intelligence

Master Programs ML/AI14 Nov 2020 15:49 UTC
4 points
0 comments12 min readLW link

Univer­sity of Am­s­ter­dam (UvA), Master’s Ar­tifi­cial Intelligence

Master Programs ML/AI14 Nov 2020 15:49 UTC
16 points
6 comments21 min readLW link

Univer­sity of Tübin­gen, Master’s Ma­chine Learning

Master Programs ML/AI14 Nov 2020 15:50 UTC
14 points
0 comments7 min readLW link

A guide to Iter­ated Am­plifi­ca­tion & Debate

Rafael Harth15 Nov 2020 17:14 UTC
68 points
10 comments15 min readLW link

Solomonoff In­duc­tion and Sleep­ing Beauty

ike17 Nov 2020 2:28 UTC
7 points
0 comments2 min readLW link

The Poin­t­ers Prob­lem: Hu­man Values Are A Func­tion Of Hu­mans’ La­tent Variables

johnswentworth18 Nov 2020 17:47 UTC
104 points
43 comments11 min readLW link2 reviews

The ethics of AI for the Rout­ledge En­cy­clo­pe­dia of Philosophy

Stuart_Armstrong18 Nov 2020 17:55 UTC
45 points
8 comments1 min readLW link

Per­sua­sion Tools: AI takeover with­out AGI or agency?

Daniel Kokotajlo20 Nov 2020 16:54 UTC
74 points
24 comments11 min readLW link1 review

UDT might not pay a Coun­ter­fac­tual Mugger

winwonce21 Nov 2020 23:27 UTC
5 points
18 comments2 min readLW link

Chang­ing the AI race pay­off matrix

Gurkenglas22 Nov 2020 22:25 UTC
7 points
2 comments1 min readLW link

Syn­tax, se­man­tics, and sym­bol ground­ing, simplified

Stuart_Armstrong23 Nov 2020 16:12 UTC
30 points
4 comments9 min readLW link

Com­men­tary on AGI Safety from First Principles

Richard_Ngo23 Nov 2020 21:37 UTC
80 points
4 comments54 min readLW link

[Question] Cri­tiques of the Agent Foun­da­tions agenda?

Jsevillamol24 Nov 2020 16:11 UTC
16 points
3 comments1 min readLW link

[Question] How should OpenAI com­mu­ni­cate about the com­mer­cial perfor­mances of the GPT-3 API?

Maxime Riché24 Nov 2020 8:34 UTC
2 points
0 comments1 min readLW link

[AN #126]: Avoid­ing wire­head­ing by de­cou­pling ac­tion feed­back from ac­tion effects

Rohin Shah26 Nov 2020 23:20 UTC
24 points
1 comment10 min readLW link
(mailchi.mp)

[Question] Is this a good way to bet on short timelines?

Daniel Kokotajlo28 Nov 2020 12:51 UTC
16 points
8 comments1 min readLW link

Pre­face to the Se­quence on Fac­tored Cognition

Rafael Harth30 Nov 2020 18:49 UTC
35 points
7 comments2 min readLW link

[Linkpost] AlphaFold: a solu­tion to a 50-year-old grand challenge in biology

adamShimi30 Nov 2020 17:33 UTC
54 points
22 comments1 min readLW link
(deepmind.com)

What is “pro­tein fold­ing”? A brief explanation

jasoncrawford1 Dec 2020 2:46 UTC
69 points
9 comments4 min readLW link
(rootsofprogress.org)

[Question] In a mul­ti­po­lar sce­nario, how do peo­ple ex­pect sys­tems to be trained to in­ter­act with sys­tems de­vel­oped by other labs?

JesseClifton1 Dec 2020 20:04 UTC
11 points
6 comments1 min readLW link

[AN #127]: Re­think­ing agency: Carte­sian frames as a for­mal­iza­tion of ways to carve up the world into an agent and its environment

Rohin Shah2 Dec 2020 18:20 UTC
46 points
0 comments13 min readLW link
(mailchi.mp)

Beyond 175 billion pa­ram­e­ters: Can we an­ti­ci­pate fu­ture GPT-X Ca­pa­bil­ities?

bakztfuture4 Dec 2020 23:42 UTC
−1 points
1 comment2 min readLW link

Thoughts on Robin Han­son’s AI Im­pacts interview

Steven Byrnes24 Nov 2019 1:40 UTC
25 points
3 comments7 min readLW link

[RXN#7] Rus­sian x-risks newslet­ter fall 2020

avturchin5 Dec 2020 16:28 UTC
12 points
0 comments3 min readLW link

The AI Safety Game (UPDATED)

Daniel Kokotajlo5 Dec 2020 10:27 UTC
44 points
9 comments3 min readLW link

Values Form a Shift­ing Land­scape (and why you might care)

VojtaKovarik5 Dec 2020 23:56 UTC
28 points
6 comments4 min readLW link

AI Prob­lems Shared by Non-AI Systems

VojtaKovarik5 Dec 2020 22:15 UTC
7 points
2 comments4 min readLW link

Chance that “AI safety ba­si­cally [doesn’t need] to be solved, we’ll just solve it by de­fault un­less we’re com­pletely com­pletely care­less”

8 Dec 2020 21:08 UTC
27 points
0 comments5 min readLW link

Min­i­mal Maps, Semi-De­ci­sions, and Neu­ral Representations

Past Account6 Dec 2020 15:15 UTC
30 points
2 comments4 min readLW link

Launch­ing the Fore­cast­ing AI Progress Tournament

Tamay7 Dec 2020 14:08 UTC
20 points
0 comments1 min readLW link
(www.metaculus.com)

[AN #128]: Pri­ori­tiz­ing re­search on AI ex­is­ten­tial safety based on its ap­pli­ca­tion to gov­er­nance demands

Rohin Shah9 Dec 2020 18:20 UTC
16 points
2 comments10 min readLW link
(mailchi.mp)

Sum­mary of AI Re­search Con­sid­er­a­tions for Hu­man Ex­is­ten­tial Safety (ARCHES)

peterbarnett9 Dec 2020 23:28 UTC
10 points
0 comments13 min readLW link

Clar­ify­ing Fac­tored Cognition

Rafael Harth13 Dec 2020 20:02 UTC
23 points
2 comments3 min readLW link

Ho­mo­gene­ity vs. het­ero­gene­ity in AI take­off scenarios

evhub16 Dec 2020 1:37 UTC
95 points
48 comments4 min readLW link

LBIT Proofs 8: Propo­si­tions 53-58

Diffractor16 Dec 2020 3:29 UTC
7 points
0 comments18 min readLW link

LBIT Proofs 6: Propo­si­tions 39-47

Diffractor16 Dec 2020 3:33 UTC
7 points
0 comments23 min readLW link

LBIT Proofs 5: Propo­si­tions 29-38

Diffractor16 Dec 2020 3:35 UTC
7 points
0 comments21 min readLW link

LBIT Proofs 3: Propo­si­tions 19-22

Diffractor16 Dec 2020 3:40 UTC
7 points
0 comments17 min readLW link

LBIT Proofs 2: Propo­si­tions 10-18

Diffractor16 Dec 2020 3:45 UTC
7 points
0 comments20 min readLW link

LBIT Proofs 1: Propo­si­tions 1-9

Diffractor16 Dec 2020 3:48 UTC
7 points
0 comments25 min readLW link

LBIT Proofs 4: Propo­si­tions 22-28

Diffractor16 Dec 2020 3:38 UTC
7 points
0 comments17 min readLW link

LBIT Proofs 7: Propo­si­tions 48-52

Diffractor16 Dec 2020 3:31 UTC
7 points
0 comments20 min readLW link

Less Ba­sic In­framea­sure Theory

Diffractor16 Dec 2020 3:52 UTC
22 points
1 comment61 min readLW link

[AN #129]: Ex­plain­ing dou­ble de­scent by mea­sur­ing bias and variance

Rohin Shah16 Dec 2020 18:10 UTC
14 points
1 comment7 min readLW link
(mailchi.mp)

Ma­chine learn­ing could be fun­da­men­tally unexplainable

George3d616 Dec 2020 13:32 UTC
26 points
15 comments15 min readLW link
(cerebralab.com)

Beta test GPT-3 based re­search assistant

jungofthewon16 Dec 2020 13:42 UTC
34 points
2 comments1 min readLW link

[Question] How long till In­verse AlphaFold?

Daniel Kokotajlo17 Dec 2020 19:56 UTC
41 points
18 comments1 min readLW link

Hier­ar­chi­cal plan­ning: con­text agents

Charlie Steiner19 Dec 2020 11:24 UTC
21 points
6 comments9 min readLW link

[Question] Is there a com­mu­nity al­igned with the idea of cre­at­ing species of AGI sys­tems for them to be­come our suc­ces­sors?

iamhefesto20 Dec 2020 19:06 UTC
−2 points
7 comments1 min readLW link

Intuition

Rafael Harth20 Dec 2020 21:49 UTC
26 points
1 comment6 min readLW link

2020 AI Align­ment Liter­a­ture Re­view and Char­ity Comparison

Larks21 Dec 2020 15:27 UTC
137 points
14 comments68 min readLW link

TAI Safety Biblio­graphic Database

JessRiedel22 Dec 2020 17:42 UTC
70 points
10 comments17 min readLW link

An­nounc­ing AXRP, the AI X-risk Re­search Podcast

DanielFilan23 Dec 2020 20:00 UTC
54 points
6 comments1 min readLW link
(danielfilan.com)

[AN #130]: A new AI x-risk pod­cast, and re­views of the field

Rohin Shah24 Dec 2020 18:20 UTC
8 points
0 comments7 min readLW link
(mailchi.mp)

Can we model tech­nolog­i­cal sin­gu­lar­ity as the phase tran­si­tion?

Valentin202626 Dec 2020 3:20 UTC
4 points
3 comments4 min readLW link

AGI Align­ment Should Solve Cor­po­rate Alignment

magfrump27 Dec 2020 2:23 UTC
19 points
6 comments6 min readLW link

Against GDP as a met­ric for timelines and take­off speeds

Daniel Kokotajlo29 Dec 2020 17:42 UTC
131 points
16 comments14 min readLW link1 review

AXRP Epi­sode 3 - Ne­go­tiable Re­in­force­ment Learn­ing with An­drew Critch

DanielFilan29 Dec 2020 20:45 UTC
26 points
0 comments27 min readLW link

AXRP Epi­sode 1 - Ad­ver­sar­ial Poli­cies with Adam Gleave

DanielFilan29 Dec 2020 20:41 UTC
12 points
5 comments33 min readLW link

AXRP Epi­sode 2 - Learn­ing Hu­man Bi­ases with Ro­hin Shah

DanielFilan29 Dec 2020 20:43 UTC
13 points
0 comments35 min readLW link

Dario Amodei leaves OpenAI

Daniel Kokotajlo29 Dec 2020 19:31 UTC
69 points
12 comments1 min readLW link

[Question] What Are Some Alter­na­tive Ap­proaches to Un­der­stand­ing Agency/​In­tel­li­gence?

interstice29 Dec 2020 23:21 UTC
15 points
12 comments1 min readLW link

Why Neu­ral Net­works Gen­er­al­ise, and Why They Are (Kind of) Bayesian

Joar Skalse29 Dec 2020 13:33 UTC
67 points
58 comments1 min readLW link1 review

De­bate Minus Fac­tored Cognition

abramdemski29 Dec 2020 22:59 UTC
37 points
42 comments11 min readLW link

[AN #131]: For­mal­iz­ing the ar­gu­ment of ig­nored at­tributes in a util­ity function

Rohin Shah31 Dec 2020 18:20 UTC
13 points
4 comments9 min readLW link
(mailchi.mp)

Reflec­tions on Larks’ 2020 AI al­ign­ment liter­a­ture review

Alex Flint1 Jan 2021 22:53 UTC
79 points
8 comments6 min readLW link

Men­tal sub­agent im­pli­ca­tions for AI Safety

moridinamael3 Jan 2021 18:59 UTC
11 points
0 comments3 min readLW link

The Na­tional Defense Autho­riza­tion Act Con­tains AI Provisions

ryan_b5 Jan 2021 15:51 UTC
30 points
24 comments1 min readLW link

The Poin­t­ers Prob­lem: Clar­ifi­ca­tions/​Variations

abramdemski5 Jan 2021 17:29 UTC
50 points
14 comments18 min readLW link

[AN #132]: Com­plex and sub­tly in­cor­rect ar­gu­ments as an ob­sta­cle to debate

Rohin Shah6 Jan 2021 18:20 UTC
19 points
1 comment19 min readLW link
(mailchi.mp)

Out-of-body rea­son­ing (OOBR)

Jon Zero9 Jan 2021 16:10 UTC
5 points
0 comments4 min readLW link

Re­view of Soft Take­off Can Still Lead to DSA

Daniel Kokotajlo10 Jan 2021 18:10 UTC
75 points
15 comments6 min readLW link

Re­view of ‘De­bate on In­stru­men­tal Con­ver­gence be­tween LeCun, Rus­sell, Ben­gio, Zador, and More’

TurnTrout12 Jan 2021 3:57 UTC
40 points
1 comment2 min readLW link

[AN #133]: Build­ing ma­chines that can co­op­er­ate (with hu­mans, in­sti­tu­tions, or other ma­chines)

Rohin Shah13 Jan 2021 18:10 UTC
14 points
0 comments9 min readLW link
(mailchi.mp)

An Ex­plo­ra­tory Toy AI Take­off Model

niplav13 Jan 2021 18:13 UTC
10 points
3 comments12 min readLW link

Some re­cent sur­vey pa­pers on (mostly near-term) AI safety, se­cu­rity, and assurance

Aryeh Englander13 Jan 2021 21:50 UTC
11 points
0 comments3 min readLW link

Thoughts on Ia­son Gabriel’s Ar­tifi­cial In­tel­li­gence, Values, and Alignment

Alex Flint14 Jan 2021 12:58 UTC
35 points
14 comments4 min readLW link

Why I’m ex­cited about Debate

Richard_Ngo15 Jan 2021 23:37 UTC
73 points
12 comments7 min readLW link

Ex­cerpt from Ar­bital Solomonoff in­duc­tion dialogue

Richard_Ngo17 Jan 2021 3:49 UTC
36 points
6 comments5 min readLW link
(arbital.com)

Short sum­mary of mAIry’s room

Stuart_Armstrong18 Jan 2021 18:11 UTC
26 points
2 comments4 min readLW link

DALL-E does sym­bol grounding

p.b.17 Jan 2021 21:20 UTC
6 points
0 comments1 min readLW link

Some thoughts on risks from nar­row, non-agen­tic AI

Richard_Ngo19 Jan 2021 0:04 UTC
35 points
21 comments16 min readLW link

Against the Back­ward Ap­proach to Goal-Directedness

adamShimi19 Jan 2021 18:46 UTC
19 points
6 comments4 min readLW link

[AN #134]: Un­der­speci­fi­ca­tion as a cause of frag­ility to dis­tri­bu­tion shift

Rohin Shah21 Jan 2021 18:10 UTC
13 points
0 comments7 min readLW link
(mailchi.mp)

Coun­ter­fac­tual con­trol incentives

Stuart_Armstrong21 Jan 2021 16:54 UTC
21 points
10 comments9 min readLW link

Policy re­stric­tions and Se­cret keep­ing AI

Donald Hobson24 Jan 2021 20:59 UTC
6 points
3 comments3 min readLW link

FC fi­nal: Can Fac­tored Cog­ni­tion schemes scale?

Rafael Harth24 Jan 2021 22:18 UTC
15 points
0 comments17 min readLW link

[AN #135]: Five prop­er­ties of goal-di­rected systems

Rohin Shah27 Jan 2021 18:10 UTC
33 points
0 comments8 min readLW link
(mailchi.mp)

AMA on EA Fo­rum: Ajeya Co­tra, re­searcher at Open Phil

Ajeya Cotra29 Jan 2021 23:05 UTC
23 points
0 comments1 min readLW link
(forum.effectivealtruism.org)

Play with neu­ral net

KatjaGrace30 Jan 2021 10:50 UTC
17 points
0 comments1 min readLW link
(worldspiritsockpuppet.com)

A Cri­tique of Non-Obstruction

Joe Collman3 Feb 2021 8:45 UTC
13 points
10 comments4 min readLW link

Dist­in­guish­ing claims about train­ing vs deployment

Richard_Ngo3 Feb 2021 11:30 UTC
61 points
30 comments9 min readLW link

Graph­i­cal World Models, Coun­ter­fac­tu­als, and Ma­chine Learn­ing Agents

Koen.Holtman17 Feb 2021 11:07 UTC
6 points
2 comments10 min readLW link

OpenAI: “Scal­ing Laws for Trans­fer”, Her­nan­dez et al.

Lukas Finnveden4 Feb 2021 12:49 UTC
13 points
3 comments1 min readLW link
(arxiv.org)

Evolu­tions Build­ing Evolu­tions: Lay­ers of Gen­er­ate and Test

plex5 Feb 2021 18:21 UTC
11 points
1 comment6 min readLW link

Episte­mol­ogy of HCH

adamShimi9 Feb 2021 11:46 UTC
16 points
2 comments10 min readLW link

[Question] Math­e­mat­i­cal Models of Progress?

abramdemski16 Feb 2021 0:21 UTC
28 points
8 comments2 min readLW link

[Question] Sugges­tions of posts on the AF to review

adamShimi16 Feb 2021 12:40 UTC
56 points
20 comments1 min readLW link

Disen­tan­gling Cor­rigi­bil­ity: 2015-2021

Koen.Holtman16 Feb 2021 18:01 UTC
17 points
20 comments9 min readLW link

Carte­sian frames as gen­er­al­ised models

Stuart_Armstrong16 Feb 2021 16:09 UTC
20 points
0 comments5 min readLW link

[AN #138]: Why AI gov­er­nance should find prob­lems rather than just solv­ing them

Rohin Shah17 Feb 2021 18:50 UTC
12 points
0 comments9 min readLW link
(mailchi.mp)

Safely con­trol­ling the AGI agent re­ward function

Koen.Holtman17 Feb 2021 14:47 UTC
7 points
0 comments5 min readLW link

AXRP Epi­sode 4 - Risks from Learned Op­ti­miza­tion with Evan Hubinger

DanielFilan18 Feb 2021 0:03 UTC
41 points
10 comments86 min readLW link

Utility Max­i­miza­tion = De­scrip­tion Length Minimization

johnswentworth18 Feb 2021 18:04 UTC
183 points
40 comments5 min readLW link

Google’s Eth­i­cal AI team and AI Safety

magfrump20 Feb 2021 9:42 UTC
12 points
16 comments7 min readLW link

AI Safety Begin­ners Meetup (Euro­pean Time)

Linda Linsefors20 Feb 2021 13:20 UTC
8 points
2 comments1 min readLW link

Min­i­mal Map Constraints

Past Account21 Feb 2021 17:49 UTC
6 points
0 comments3 min readLW link

[AN #139]: How the sim­plic­ity of re­al­ity ex­plains the suc­cess of neu­ral nets

Rohin Shah24 Feb 2021 18:30 UTC
26 points
6 comments12 min readLW link
(mailchi.mp)

My Thoughts on the Ap­per­cep­tion Engine

J Bostock25 Feb 2021 19:43 UTC
4 points
1 comment3 min readLW link

The Case for Pri­vacy Optimism

bmgarfinkel10 Mar 2020 20:30 UTC
43 points
1 comment32 min readLW link
(benmgarfinkel.wordpress.com)

[Question] How might cryp­tocur­ren­cies af­fect AGI timelines?

Dawn Drescher28 Feb 2021 19:16 UTC
13 points
40 comments2 min readLW link

Fun with +12 OOMs of Compute

Daniel Kokotajlo1 Mar 2021 13:30 UTC
212 points
78 comments12 min readLW link1 review

Links for Feb 2021

ike1 Mar 2021 5:13 UTC
6 points
0 comments6 min readLW link
(misinfounderload.substack.com)

In­tro­duc­tion to Re­in­force­ment Learning

Dr. Birdbrain28 Feb 2021 23:03 UTC
4 points
1 comment3 min readLW link

Cu­ri­os­ity about Align­ing Values

esweet3 Mar 2021 0:22 UTC
3 points
7 comments1 min readLW link

How does bee learn­ing com­pare with ma­chine learn­ing?

eleni4 Mar 2021 1:59 UTC
62 points
15 comments24 min readLW link

Some re­cent in­ter­views with AI/​math lu­mi­nar­ies.

fowlertm4 Mar 2021 1:26 UTC
2 points
0 comments1 min readLW link

A Semitech­ni­cal In­tro­duc­tory Dialogue on Solomonoff Induction

Eliezer Yudkowsky4 Mar 2021 17:27 UTC
127 points
34 comments54 min readLW link

Con­nect­ing the good reg­u­la­tor the­o­rem with se­man­tics and sym­bol grounding

Stuart_Armstrong4 Mar 2021 14:35 UTC
11 points
0 comments2 min readLW link

[AN #140]: The­o­ret­i­cal mod­els that pre­dict scal­ing laws

Rohin Shah4 Mar 2021 18:10 UTC
45 points
0 comments10 min readLW link
(mailchi.mp)

Take­aways from the In­tel­li­gence Ris­ing RPG

5 Mar 2021 10:27 UTC
50 points
8 comments12 min readLW link

GPT-3 and the fu­ture of knowl­edge work

fowlertm5 Mar 2021 17:40 UTC
16 points
0 comments2 min readLW link

The case for al­ign­ing nar­rowly su­per­hu­man models

Ajeya Cotra5 Mar 2021 22:29 UTC
187 points
74 comments38 min readLW link

MIRI com­ments on Co­tra’s “Case for Align­ing Nar­rowly Su­per­hu­man Models”

Rob Bensinger5 Mar 2021 23:43 UTC
136 points
13 comments26 min readLW link

[Question] What are the biggest cur­rent im­pacts of AI?

Sam Clarke7 Mar 2021 21:44 UTC
15 points
5 comments1 min readLW link

CLR’s re­cent work on multi-agent systems

JesseClifton9 Mar 2021 2:28 UTC
54 points
1 comment13 min readLW link

De-con­fus­ing my­self about Pas­cal’s Mug­ging and New­comb’s Problem

DirectedEvolution9 Mar 2021 20:45 UTC
7 points
1 comment3 min readLW link

Open Prob­lems with Myopia

10 Mar 2021 18:38 UTC
57 points
16 comments8 min readLW link

[AN #141]: The case for prac­tic­ing al­ign­ment work on GPT-3 and other large models

Rohin Shah10 Mar 2021 18:30 UTC
27 points
4 comments8 min readLW link
(mailchi.mp)

[Link] Whit­tle­stone et al., The So­cietal Im­pli­ca­tions of Deep Re­in­force­ment Learning

Aryeh Englander10 Mar 2021 18:13 UTC
11 points
1 comment1 min readLW link
(jair.org)

Four Mo­ti­va­tions for Learn­ing Normativity

abramdemski11 Mar 2021 20:13 UTC
42 points
7 comments5 min readLW link

[Question] What’s a good way to test ba­sic ma­chine learn­ing code?

Kenny11 Mar 2021 21:27 UTC
5 points
9 comments1 min readLW link

[Video] In­tel­li­gence and Stu­pidity: The Orthog­o­nal­ity Thesis

plex13 Mar 2021 0:32 UTC
5 points
1 comment1 min readLW link
(www.youtube.com)

AI x-risk re­duc­tion: why I chose academia over industry

David Scott Krueger (formerly: capybaralet)14 Mar 2021 17:25 UTC
56 points
14 comments3 min readLW link

[Question] Par­tial-Con­scious­ness as se­man­tic/​sym­bolic rep­re­sen­ta­tional lan­guage model trained on NN

Joe Kwon16 Mar 2021 18:51 UTC
2 points
3 comments1 min readLW link

[AN #142]: The quest to un­der­stand a net­work well enough to reim­ple­ment it by hand

Rohin Shah17 Mar 2021 17:10 UTC
34 points
4 comments8 min readLW link
(mailchi.mp)

In­ter­mit­tent Distil­la­tions #1

Mark Xu17 Mar 2021 5:15 UTC
25 points
1 comment10 min readLW link

HCH Spec­u­la­tion Post #2A

Charlie Steiner17 Mar 2021 13:26 UTC
42 points
7 comments9 min readLW link

The Age of Imag­i­na­tive Machines

Yuli_Ban18 Mar 2021 0:35 UTC
10 points
1 comment11 min readLW link

Gen­er­al­iz­ing POWER to multi-agent games

22 Mar 2021 2:41 UTC
52 points
17 comments7 min readLW link

My re­search methodology

paulfchristiano22 Mar 2021 21:20 UTC
148 points
36 comments16 min readLW link
(ai-alignment.com)

“In­fra-Bayesi­anism with Vanessa Kosoy” – Watch/​Dis­cuss Party

Ben Pace22 Mar 2021 23:44 UTC
27 points
45 comments1 min readLW link

Prefer­ences and bi­ases, the in­for­ma­tion argument

Stuart_Armstrong23 Mar 2021 12:44 UTC
14 points
5 comments1 min readLW link

[AN #143]: How to make em­bed­ded agents that rea­son prob­a­bil­is­ti­cally about their environments

Rohin Shah24 Mar 2021 17:20 UTC
13 points
3 comments8 min readLW link
(mailchi.mp)

Toy model of prefer­ence, bias, and ex­tra information

Stuart_Armstrong24 Mar 2021 10:14 UTC
9 points
0 comments4 min readLW link

On lan­guage mod­el­ing and fu­ture ab­stract rea­son­ing research

alexlyzhov25 Mar 2021 17:43 UTC
3 points
1 comment1 min readLW link
(docs.google.com)

In­framea­sures and Do­main Theory

Diffractor28 Mar 2021 9:19 UTC
27 points
3 comments33 min readLW link

In­fra-Do­main Proofs 2

Diffractor28 Mar 2021 9:15 UTC
13 points
0 comments21 min readLW link

In­fra-Do­main proofs 1

Diffractor28 Mar 2021 9:16 UTC
13 points
0 comments23 min readLW link

Sce­nar­ios and Warn­ing Signs for Ajeya’s Ag­gres­sive, Con­ser­va­tive, and Best Guess AI Timelines

Kevin Liu29 Mar 2021 1:38 UTC
25 points
1 comment9 min readLW link
(kliu.io)

[Question] How do we pre­pare for fi­nal crunch time?

Eli Tyre30 Mar 2021 5:47 UTC
116 points
30 comments8 min readLW link1 review

[Question] TAI?

Logan Zoellner30 Mar 2021 12:41 UTC
12 points
8 comments1 min readLW link

A use for Clas­si­cal AI—Ex­pert Systems

Glpusna31 Mar 2021 2:37 UTC
1 point
2 comments2 min readLW link

What Mul­tipo­lar Failure Looks Like, and Ro­bust Agent-Ag­nos­tic Pro­cesses (RAAPs)

Andrew_Critch31 Mar 2021 23:50 UTC
203 points
60 comments22 min readLW link

AI and the Prob­a­bil­ity of Conflict

tonyoconnor1 Apr 2021 7:00 UTC
8 points
10 comments8 min readLW link

“AI and Com­pute” trend isn’t pre­dic­tive of what is happening

alexlyzhov2 Apr 2021 0:44 UTC
133 points
15 comments1 min readLW link

[AN #144]: How lan­guage mod­els can also be fine­tuned for non-lan­guage tasks

Rohin Shah2 Apr 2021 17:20 UTC
19 points
0 comments6 min readLW link
(mailchi.mp)

2012 Robin Han­son com­ment on “In­tel­li­gence Ex­plo­sion: Ev­i­dence and Im­port”

Rob Bensinger2 Apr 2021 16:26 UTC
28 points
4 comments3 min readLW link

My take on Michael Littman on “The HCI of HAI”

Alex Flint2 Apr 2021 19:51 UTC
59 points
4 comments7 min readLW link

[Question] How do scal­ing laws work for fine-tun­ing?

Daniel Kokotajlo4 Apr 2021 12:18 UTC
24 points
10 comments1 min readLW link

Avert­ing suffer­ing with sen­tience throt­tlers (pro­posal)

Quinn5 Apr 2021 10:54 UTC
8 points
7 comments3 min readLW link

Reflec­tive Bayesianism

abramdemski6 Apr 2021 19:48 UTC
58 points
27 comments13 min readLW link

[Question] What will GPT-4 be in­ca­pable of?

Michaël Trazzi6 Apr 2021 19:57 UTC
34 points
32 comments1 min readLW link

I Trained a Neu­ral Net­work to Play Helltaker

lsusr7 Apr 2021 8:24 UTC
29 points
5 comments3 min readLW link

[AN #145]: Our three year an­niver­sary!

Rohin Shah9 Apr 2021 17:48 UTC
19 points
0 comments8 min readLW link
(mailchi.mp)

Align­ment Newslet­ter Three Year Retrospective

Rohin Shah7 Apr 2021 14:39 UTC
55 points
0 comments5 min readLW link

Which coun­ter­fac­tu­als should an AI fol­low?

Stuart_Armstrong7 Apr 2021 16:47 UTC
19 points
5 comments7 min readLW link

Solv­ing the whole AGI con­trol prob­lem, ver­sion 0.0001

Steven Byrnes8 Apr 2021 15:14 UTC
60 points
7 comments26 min readLW link

The Ja­panese Quiz: a Thought Ex­per­i­ment of Statis­ti­cal Epistemology

DanB8 Apr 2021 17:37 UTC
11 points
0 comments9 min readLW link

A pos­si­ble prefer­ence algorithm

Stuart_Armstrong8 Apr 2021 18:25 UTC
22 points
0 comments4 min readLW link

If you don’t de­sign for ex­trap­o­la­tion, you’ll ex­trap­o­late poorly—pos­si­bly fatally

Stuart_Armstrong8 Apr 2021 18:10 UTC
17 points
0 comments4 min readLW link

AXRP Epi­sode 6 - De­bate and Imi­ta­tive Gen­er­al­iza­tion with Beth Barnes

DanielFilan8 Apr 2021 21:20 UTC
24 points
3 comments59 min readLW link

My Cur­rent Take on Counterfactuals

abramdemski9 Apr 2021 17:51 UTC
53 points
57 comments25 min readLW link

Opinions on In­ter­pretable Ma­chine Learn­ing and 70 Sum­maries of Re­cent Papers

9 Apr 2021 19:19 UTC
139 points
16 comments102 min readLW link

Why un­rig­gable *al­most* im­plies uninfluenceable

Stuart_Armstrong9 Apr 2021 17:07 UTC
11 points
0 comments4 min readLW link

In­ter­mit­tent Distil­la­tions #2

Mark Xu14 Apr 2021 6:47 UTC
32 points
4 comments9 min readLW link

Test Cases for Im­pact Reg­u­lari­sa­tion Methods

DanielFilan6 Feb 2019 21:50 UTC
58 points
5 comments12 min readLW link
(danielfilan.com)

Su­per­ra­tional Agents Kelly Bet In­fluence!

abramdemski16 Apr 2021 22:08 UTC
41 points
5 comments5 min readLW link

Defin­ing “op­ti­mizer”

Chantiel17 Apr 2021 15:38 UTC
9 points
6 comments1 min readLW link

Alex Flint on “A soft­ware en­g­ineer’s per­spec­tive on log­i­cal in­duc­tion”

Raemon17 Apr 2021 6:56 UTC
21 points
8 comments1 min readLW link

[Question] Pa­ram­e­ter count of ML sys­tems through time?

Jsevillamol19 Apr 2021 12:54 UTC
31 points
4 comments1 min readLW link

Gra­da­tions of In­ner Align­ment Obstacles

abramdemski20 Apr 2021 22:18 UTC
80 points
22 comments9 min readLW link

Where are in­ten­tions to be found?

Alex Flint21 Apr 2021 0:51 UTC
44 points
12 comments9 min readLW link

[AN #147]: An overview of the in­ter­pretabil­ity landscape

Rohin Shah21 Apr 2021 17:10 UTC
14 points
2 comments7 min readLW link
(mailchi.mp)

NTK/​GP Models of Neu­ral Nets Can’t Learn Features

interstice22 Apr 2021 3:01 UTC
31 points
33 comments3 min readLW link

[Question] Is there any­thing that can stop AGI de­vel­op­ment in the near term?

Wulky Wilkinsen22 Apr 2021 20:37 UTC
5 points
5 comments1 min readLW link

Prob­a­bil­ity the­ory and log­i­cal in­duc­tion as lenses

Alex Flint23 Apr 2021 2:41 UTC
43 points
7 comments6 min readLW link

Nat­u­ral­ism and AI alignment

Michele Campolo24 Apr 2021 16:16 UTC
5 points
12 comments8 min readLW link

Mal­i­cious non-state ac­tors and AI safety

keti25 Apr 2021 3:21 UTC
2 points
13 comments2 min readLW link

An­nounc­ing the Align­ment Re­search Center

paulfchristiano26 Apr 2021 23:30 UTC
177 points
6 comments1 min readLW link
(ai-alignment.com)

[Linkpost] Treach­er­ous turns in the wild

Mark Xu26 Apr 2021 22:51 UTC
31 points
6 comments1 min readLW link
(lukemuehlhauser.com)

FAQ: Ad­vice for AI Align­ment Researchers

Rohin Shah26 Apr 2021 18:59 UTC
67 points
2 comments1 min readLW link
(rohinshah.com)

Pit­falls of the agent model

Alex Flint27 Apr 2021 22:19 UTC
19 points
4 comments20 min readLW link

[AN #148]: An­a­lyz­ing gen­er­al­iza­tion across more axes than just ac­cu­racy or loss

Rohin Shah28 Apr 2021 18:30 UTC
24 points
5 comments11 min readLW link
(mailchi.mp)

AMA: Paul Chris­ti­ano, al­ign­ment researcher

paulfchristiano28 Apr 2021 18:55 UTC
117 points
198 comments1 min readLW link

25 Min Talk on Me­taEth­i­cal.AI with Ques­tions from Stu­art Armstrong

June Ku29 Apr 2021 15:38 UTC
21 points
7 comments1 min readLW link

Low-stakes alignment

paulfchristiano30 Apr 2021 0:10 UTC
70 points
9 comments7 min readLW link1 review
(ai-alignment.com)

[Weekly Event] Align­ment Re­searcher Coffee Time (in Walled Gar­den)

adamShimi2 May 2021 12:59 UTC
37 points
0 comments1 min readLW link

Pars­ing Abram on Gra­da­tions of In­ner Align­ment Obstacles

Alex Flint4 May 2021 17:44 UTC
22 points
4 comments6 min readLW link

Mun­dane solu­tions to ex­otic problems

paulfchristiano4 May 2021 18:20 UTC
56 points
8 comments5 min readLW link
(ai-alignment.com)

April 15, 2040

Nisan4 May 2021 21:18 UTC
97 points
19 comments2 min readLW link

[AN #149]: The newslet­ter’s ed­i­to­rial policy

Rohin Shah5 May 2021 17:10 UTC
19 points
3 comments8 min readLW link
(mailchi.mp)

Pars­ing Chris Min­gard on Neu­ral Networks

Alex Flint6 May 2021 22:16 UTC
67 points
27 comments6 min readLW link

Life and ex­pand­ing steer­able consequences

Alex Flint7 May 2021 18:33 UTC
46 points
3 comments4 min readLW link

Do­main The­ory and the Pri­soner’s Dilemma: FairBot

Gurkenglas7 May 2021 7:33 UTC
14 points
5 comments2 min readLW link

Pre-Train­ing + Fine-Tun­ing Fa­vors Deception

Mark Xu8 May 2021 18:36 UTC
27 points
2 comments3 min readLW link

[Event] Weekly Align­ment Re­search Coffee Time (05/​10)

adamShimi9 May 2021 11:05 UTC
16 points
2 comments1 min readLW link

[Question] Is driv­ing worth the risk?

Adam Zerner11 May 2021 5:04 UTC
26 points
29 comments7 min readLW link

Yam­polskiy on AI Risk Skepticism

Gordon Seidoh Worley11 May 2021 14:50 UTC
15 points
5 comments1 min readLW link
(www.researchgate.net)

Hu­man pri­ors, fea­tures and mod­els, lan­guages, and Sol­monoff induction

Stuart_Armstrong10 May 2021 10:55 UTC
16 points
2 comments4 min readLW link

[AN #150]: The sub­types of Co­op­er­a­tive AI research

Rohin Shah12 May 2021 17:20 UTC
15 points
0 comments6 min readLW link
(mailchi.mp)

Un­der­stand­ing the Lot­tery Ticket Hy­poth­e­sis

Alex Flint14 May 2021 0:25 UTC
50 points
9 comments8 min readLW link

Con­cern­ing not get­ting lost

Alex Flint14 May 2021 19:38 UTC
50 points
9 comments4 min readLW link

[Event] Weekly Align­ment Re­search Coffee Time (05/​17)

adamShimi15 May 2021 22:07 UTC
7 points
0 comments1 min readLW link

Op­ti­miz­ers: To Define or not to Define

J Bostock16 May 2021 19:55 UTC
4 points
0 comments4 min readLW link

In­ter­mit­tent Distil­la­tions #3

Mark Xu15 May 2021 7:13 UTC
19 points
1 comment11 min readLW link

AXRP Epi­sode 7 - Side Effects with Vic­to­ria Krakovna

DanielFilan14 May 2021 3:50 UTC
34 points
6 comments43 min readLW link

Sav­ing Time

Scott Garrabrant18 May 2021 20:11 UTC
131 points
19 comments4 min readLW link

[Question] Are there any meth­ods for NNs or other ML sys­tems to get in­for­ma­tion from knock­out-like or as­say-like ex­per­i­ments?

J Bostock18 May 2021 21:33 UTC
2 points
1 comment1 min readLW link

SGD’s Bias

johnswentworth18 May 2021 23:19 UTC
60 points
16 comments3 min readLW link

This Sun­day, 12PM PT: Scott Garrabrant on “Finite Fac­tored Sets”

Raemon19 May 2021 1:48 UTC
33 points
4 comments1 min readLW link

[AN #151]: How spar­sity in the fi­nal layer makes a neu­ral net debuggable

Rohin Shah19 May 2021 17:20 UTC
19 points
0 comments6 min readLW link
(mailchi.mp)

The Vari­a­tional Char­ac­ter­i­za­tion of KL-Diver­gence, Er­ror Catas­tro­phes, and Generalization

Past Account20 May 2021 20:57 UTC
38 points
5 comments3 min readLW link

Or­a­cles, In­form­ers, and Controllers

ozziegooen25 May 2021 14:16 UTC
15 points
2 comments3 min readLW link

Knowl­edge is not just map/​ter­ri­tory resemblance

Alex Flint25 May 2021 17:58 UTC
28 points
4 comments3 min readLW link

MDP mod­els are de­ter­mined by the agent ar­chi­tec­ture and the en­vi­ron­men­tal dynamics

TurnTrout26 May 2021 0:14 UTC
23 points
34 comments3 min readLW link

[Question] List of good AI safety pro­ject ideas?

Aryeh Englander26 May 2021 22:36 UTC
24 points
8 comments1 min readLW link

AXRP Epi­sode 7.5 - Fore­cast­ing Trans­for­ma­tive AI from Biolog­i­cal An­chors with Ajeya Cotra

DanielFilan28 May 2021 0:20 UTC
24 points
1 comment67 min readLW link

Pre­dict re­sponses to the “ex­is­ten­tial risk from AI” survey

Rob Bensinger28 May 2021 1:32 UTC
44 points
6 comments2 min readLW link

Teach­ing ML to an­swer ques­tions hon­estly in­stead of pre­dict­ing hu­man answers

paulfchristiano28 May 2021 17:30 UTC
53 points
18 comments16 min readLW link
(ai-alignment.com)

The blue-min­imis­ing robot and model splintering

Stuart_Armstrong28 May 2021 15:09 UTC
13 points
4 comments3 min readLW link1 review

[Question] Use of GPT-3 for iden­ti­fy­ing Phish­ing and other email based at­tacks?

jmh29 May 2021 17:11 UTC
6 points
0 comments1 min readLW link

[Event] Weekly Align­ment Re­search Coffee Time

adamShimi29 May 2021 13:26 UTC
12 points
5 comments1 min readLW link

What is the most effec­tive way to donate to AGI XRisk miti­ga­tion?

JoshuaFox30 May 2021 11:08 UTC
44 points
11 comments1 min readLW link

“Ex­is­ten­tial risk from AI” sur­vey results

Rob Bensinger1 Jun 2021 20:02 UTC
56 points
8 comments11 min readLW link

April 2021 Gw­ern.net newsletter

gwern3 Jun 2021 15:13 UTC
20 points
0 comments1 min readLW link
(www.gwern.net)

The un­der­ly­ing model of a morphism

Stuart_Armstrong4 Jun 2021 22:29 UTC
10 points
0 comments5 min readLW link

We need a stan­dard set of com­mu­nity ad­vice for how to fi­nan­cially pre­pare for AGI

GeneSmith7 Jun 2021 7:24 UTC
50 points
53 comments5 min readLW link

Some AI Gover­nance Re­search Ideas

7 Jun 2021 14:40 UTC
29 points
2 comments2 min readLW link

Big pic­ture of pha­sic dopamine

Steven Byrnes8 Jun 2021 13:07 UTC
59 points
18 comments36 min readLW link

Bayeswatch 6: Mechwarrior

lsusr7 Jun 2021 20:20 UTC
47 points
8 comments2 min readLW link

Spec­u­la­tions against GPT-n writ­ing al­ign­ment papers

Donald Hobson7 Jun 2021 21:13 UTC
31 points
6 comments2 min readLW link

The re­verse Good­hart problem

Stuart_Armstrong8 Jun 2021 15:48 UTC
16 points
22 comments1 min readLW link

Against intelligence

George3d68 Jun 2021 13:03 UTC
12 points
17 comments10 min readLW link
(cerebralab.com)

Danger­ous op­ti­mi­sa­tion in­cludes var­i­ance minimisation

Stuart_Armstrong8 Jun 2021 11:34 UTC
32 points
5 comments2 min readLW link

Sur­vey on AI ex­is­ten­tial risk scenarios

8 Jun 2021 17:12 UTC
60 points
11 comments7 min readLW link

AXRP Epi­sode 8 - As­sis­tance Games with Dy­lan Had­field-Menell

DanielFilan8 Jun 2021 23:20 UTC
22 points
1 comment71 min readLW link

“De­ci­sion Trans­former” (Tool AIs are se­cret Agent AIs)

gwern9 Jun 2021 1:06 UTC
37 points
4 comments1 min readLW link
(sites.google.com)

Evan Hub­inger on Ho­mo­gene­ity in Take­off Speeds, Learned Op­ti­miza­tion and Interpretability

Michaël Trazzi8 Jun 2021 19:20 UTC
28 points
0 comments55 min readLW link

A naive al­ign­ment strat­egy and op­ti­mism about generalization

paulfchristiano10 Jun 2021 0:10 UTC
44 points
4 comments3 min readLW link
(ai-alignment.com)

Knowl­edge is not just mu­tual information

Alex Flint10 Jun 2021 1:01 UTC
27 points
6 comments4 min readLW link

The Ap­pren­tice Experiment

johnswentworth10 Jun 2021 3:29 UTC
148 points
11 comments4 min readLW link

[Question] ML is now au­tomat­ing parts of chip R&D. How big a deal is this?

Daniel Kokotajlo10 Jun 2021 9:51 UTC
45 points
17 comments1 min readLW link

Oh No My AI (Filk)

Gordon Seidoh Worley11 Jun 2021 15:05 UTC
42 points
7 comments1 min readLW link

May 2021 Gw­ern.net newsletter

gwern11 Jun 2021 14:13 UTC
31 points
0 comments1 min readLW link
(www.gwern.net)

[Question] What other prob­lems would a suc­cess­ful AI safety al­gorithm solve?

DirectedEvolution13 Jun 2021 21:07 UTC
12 points
4 comments1 min readLW link

Avoid­ing the in­stru­men­tal policy by hid­ing in­for­ma­tion about humans

paulfchristiano13 Jun 2021 20:00 UTC
31 points
2 comments2 min readLW link

An­swer­ing ques­tions hon­estly given world-model mismatches

paulfchristiano13 Jun 2021 18:00 UTC
34 points
2 comments16 min readLW link
(ai-alignment.com)

Vignettes Work­shop (AI Im­pacts)

Daniel Kokotajlo15 Jun 2021 12:05 UTC
47 points
3 comments1 min readLW link

Three Paths to Ex­is­ten­tial Risk from AI

harsimony16 Jun 2021 1:37 UTC
1 point
2 comments1 min readLW link
(harsimony.wordpress.com)

[AN #152]: How we’ve over­es­ti­mated few-shot learn­ing capabilities

Rohin Shah16 Jun 2021 17:20 UTC
22 points
6 comments8 min readLW link
(mailchi.mp)

AI-Based Code Gen­er­a­tion Us­ing GPT-J-6B

Tomás B.16 Jun 2021 15:05 UTC
21 points
15 comments1 min readLW link
(minimaxir.com)

In­suffi­cient Values

16 Jun 2021 14:33 UTC
29 points
15 comments5 min readLW link

[Question] Pros and cons of work­ing on near-term tech­ni­cal AI safety and assurance

Aryeh Englander17 Jun 2021 20:17 UTC
11 points
1 comment2 min readLW link

Non-poi­sonous cake: an­thropic up­dates are normal

Stuart_Armstrong18 Jun 2021 14:51 UTC
27 points
11 comments2 min readLW link

Knowl­edge is not just pre­cip­i­ta­tion of action

Alex Flint18 Jun 2021 23:26 UTC
21 points
6 comments7 min readLW link

I’m no longer sure that I buy dutch book ar­gu­ments and this makes me skep­ti­cal of the “util­ity func­tion” abstraction

Eli Tyre22 Jun 2021 3:53 UTC
45 points
29 comments4 min readLW link

Fre­quent ar­gu­ments about alignment

John Schulman23 Jun 2021 0:46 UTC
95 points
16 comments5 min readLW link

Em­piri­cal Ob­ser­va­tions of Ob­jec­tive Ro­bust­ness Failures

23 Jun 2021 23:23 UTC
63 points
5 comments9 min readLW link

[AN #153]: Ex­per­i­ments that demon­strate failures of ob­jec­tive robustness

Rohin Shah26 Jun 2021 17:10 UTC
25 points
1 comment8 min readLW link
(mailchi.mp)

An­throp­ics and Embed­ded Agency

dadadarren26 Jun 2021 1:45 UTC
7 points
2 comments2 min readLW link

Deep limi­ta­tions? Ex­am­in­ing ex­pert dis­agree­ment over deep learning

Richard_Ngo27 Jun 2021 0:55 UTC
17 points
5 comments1 min readLW link
(link.springer.com)

Finite Fac­tored Sets: LW tran­script with run­ning commentary

27 Jun 2021 16:02 UTC
30 points
0 comments51 min readLW link

Brute force search­ing for alignment

Donald Hobson27 Jun 2021 21:54 UTC
23 points
3 comments2 min readLW link

How teams went about their re­search at AI Safety Camp edi­tion 5

Remmelt28 Jun 2021 15:15 UTC
24 points
0 comments6 min readLW link

Search by abstraction

p.b.29 Jun 2021 20:56 UTC
4 points
0 comments1 min readLW link

[Question] Is there a “co­her­ent de­ci­sions im­ply con­sis­tent util­ities”-style ar­gu­ment for non-lex­i­co­graphic prefer­ences?

Tetraspace29 Jun 2021 19:14 UTC
3 points
20 comments1 min readLW link

Try­ing to ap­prox­i­mate Statis­ti­cal Models as Scor­ing Tables

Jsevillamol29 Jun 2021 17:20 UTC
18 points
2 comments9 min readLW link

Do in­co­her­ent en­tities have stronger rea­son to be­come more co­her­ent than less?

KatjaGrace30 Jun 2021 5:50 UTC
46 points
5 comments4 min readLW link
(worldspiritsockpuppet.com)

[AN #154]: What eco­nomic growth the­ory has to say about trans­for­ma­tive AI

Rohin Shah30 Jun 2021 17:20 UTC
12 points
0 comments9 min readLW link
(mailchi.mp)

Progress on Causal In­fluence Diagrams

tom4everitt30 Jun 2021 15:34 UTC
71 points
6 comments9 min readLW link

Could Ad­vanced AI Drive Ex­plo­sive Eco­nomic Growth?

Matthew Barnett30 Jun 2021 22:17 UTC
15 points
4 comments2 min readLW link
(www.openphilanthropy.org)

Ex­per­i­men­tally eval­u­at­ing whether hon­esty generalizes

paulfchristiano1 Jul 2021 17:47 UTC
99 points
23 comments9 min readLW link

Should VS Would and New­comb’s Paradox

dadadarren3 Jul 2021 23:45 UTC
5 points
36 comments2 min readLW link

Mauhn Re­leases AI Safety Documentation

Berg Severens3 Jul 2021 21:23 UTC
4 points
0 comments1 min readLW link

An­thropic Effects in Es­ti­mat­ing Evolu­tion Difficulty

Mark Xu5 Jul 2021 4:02 UTC
12 points
2 comments3 min readLW link

A sim­ple ex­am­ple of con­di­tional or­thog­o­nal­ity in finite fac­tored sets

DanielFilan6 Jul 2021 0:36 UTC
43 points
3 comments5 min readLW link
(danielfilan.com)

[Question] Is keep­ing AI “in the box” dur­ing train­ing enough?

tgb6 Jul 2021 15:17 UTC
7 points
10 comments1 min readLW link

A sec­ond ex­am­ple of con­di­tional or­thog­o­nal­ity in finite fac­tored sets

DanielFilan7 Jul 2021 1:40 UTC
46 points
0 comments2 min readLW link
(danielfilan.com)

Agency and the un­re­li­able au­tonomous car

Alex Flint7 Jul 2021 14:58 UTC
29 points
24 comments10 min readLW link

How much chess en­g­ine progress is about adapt­ing to big­ger com­put­ers?

paulfchristiano7 Jul 2021 22:35 UTC
114 points
23 comments6 min readLW link

BASALT: A Bench­mark for Learn­ing from Hu­man Feedback

Rohin Shah8 Jul 2021 17:40 UTC
56 points
20 comments2 min readLW link
(bair.berkeley.edu)

[AN #155]: A Minecraft bench­mark for al­gorithms that learn with­out re­ward functions

Rohin Shah8 Jul 2021 17:20 UTC
21 points
5 comments7 min readLW link
(mailchi.mp)

Look­ing for Col­lab­o­ra­tors for an AGI Re­search Project

Rafael Cosman8 Jul 2021 17:01 UTC
3 points
5 comments3 min readLW link

Jack­pot! An AI Vignette

bgold8 Jul 2021 20:32 UTC
13 points
0 comments2 min readLW link

In­ter­mit­tent Distil­la­tions #4: Semi­con­duc­tors, Eco­nomics, In­tel­li­gence, and Tech­nolog­i­cal Progress.

Mark Xu8 Jul 2021 22:14 UTC
81 points
9 comments10 min readLW link

Finite Fac­tored Sets: Con­di­tional Orthogonality

Scott Garrabrant9 Jul 2021 6:01 UTC
27 points
2 comments7 min readLW link

The ac­cu­mu­la­tion of knowl­edge: liter­a­ture review

Alex Flint10 Jul 2021 18:36 UTC
29 points
3 comments7 min readLW link

The in­escapa­bil­ity of knowledge

Alex Flint11 Jul 2021 22:59 UTC
28 points
17 comments5 min readLW link

[Link] Musk’s non-miss­ing mood

jimrandomh12 Jul 2021 22:09 UTC
70 points
21 comments1 min readLW link
(lukemuehlhauser.com)

[Question] What will the twen­ties look like if AGI is 30 years away?

Daniel Kokotajlo13 Jul 2021 8:14 UTC
29 points
18 comments1 min readLW link

An­swer­ing ques­tions hon­estly in­stead of pre­dict­ing hu­man an­swers: lots of prob­lems and some solutions

evhub13 Jul 2021 18:49 UTC
53 points
25 comments31 min readLW link

Model-based RL, De­sires, Brains, Wireheading

Steven Byrnes14 Jul 2021 15:11 UTC
17 points
1 comment13 min readLW link

A closer look at chess scal­ings (into the past)

hippke15 Jul 2021 8:13 UTC
49 points
14 comments4 min readLW link

AlphaFold 2 pa­per re­leased: “Highly ac­cu­rate pro­tein struc­ture pre­dic­tion with AlphaFold”, Jumper et al 2021

gwern15 Jul 2021 19:27 UTC
39 points
10 comments1 min readLW link
(www.nature.com)

Bench­mark­ing an old chess en­g­ine on new hardware

hippke16 Jul 2021 7:58 UTC
71 points
3 comments5 min readLW link

[AN #156]: The scal­ing hy­poth­e­sis: a plan for build­ing AGI

Rohin Shah16 Jul 2021 17:10 UTC
44 points
20 comments8 min readLW link
(mailchi.mp)

Bayesi­anism ver­sus con­ser­vatism ver­sus Goodhart

Stuart_Armstrong16 Jul 2021 23:39 UTC
15 points
1 comment6 min readLW link

(2009) Shane Legg—Fund­ing safe AGI

Tomás B.17 Jul 2021 16:46 UTC
36 points
2 comments1 min readLW link
(www.vetta.org)

[Question] Equiv­a­lent of In­for­ma­tion The­ory but for Com­pu­ta­tion?

J Bostock17 Jul 2021 9:38 UTC
5 points
27 comments1 min readLW link

A Models-cen­tric Ap­proach to Cor­rigible Alignment

J Bostock17 Jul 2021 17:27 UTC
2 points
0 comments6 min readLW link

A model of de­ci­sion-mak­ing in the brain (the short ver­sion)

Steven Byrnes18 Jul 2021 14:39 UTC
20 points
0 comments3 min readLW link

[Question] Any tax­onomies of con­scious ex­pe­rience?

JohnDavidBustard18 Jul 2021 18:28 UTC
7 points
10 comments1 min readLW link

[Question] Work on Bayesian fit­ting of AI trends of perfor­mance?

Jsevillamol19 Jul 2021 18:45 UTC
3 points
0 comments1 min readLW link

Some thoughts on David Rood­man’s GWP model and its re­la­tion to AI timelines

Tom Davidson19 Jul 2021 22:59 UTC
30 points
1 comment8 min readLW link

In search of benev­olence (or: what should you get Clippy for Christ­mas?)

Joe Carlsmith20 Jul 2021 1:12 UTC
20 points
0 comments33 min readLW link

En­tropic bound­ary con­di­tions to­wards safe ar­tifi­cial superintelligence

Santiago Nunez-Corrales20 Jul 2021 22:15 UTC
3 points
0 comments2 min readLW link
(www.tandfonline.com)

Re­ward splin­ter­ing for AI design

Stuart_Armstrong21 Jul 2021 16:13 UTC
30 points
1 comment8 min readLW link

Re-Define In­tent Align­ment?

abramdemski22 Jul 2021 19:00 UTC
27 points
33 comments4 min readLW link

[AN #157]: Mea­sur­ing mis­al­ign­ment in the tech­nol­ogy un­der­ly­ing Copilot

Rohin Shah23 Jul 2021 17:20 UTC
28 points
18 comments7 min readLW link
(mailchi.mp)

Ex­am­ples of hu­man-level AI run­ning un­al­igned.

df fd23 Jul 2021 8:49 UTC
−3 points
0 comments2 min readLW link
(sortale.substack.com)

AXRP Epi­sode 10 - AI’s Fu­ture and Im­pacts with Katja Grace

DanielFilan23 Jul 2021 22:10 UTC
34 points
2 comments76 min readLW link

Wanted: Foom-scared al­ign­ment re­search partner

Icarus Gallagher26 Jul 2021 19:23 UTC
40 points
5 comments1 min readLW link

Re­fac­tor­ing Align­ment (at­tempt #2)

abramdemski26 Jul 2021 20:12 UTC
46 points
17 comments8 min readLW link

[Question] How much com­pute was used to train Deep­Mind’s gen­er­ally ca­pa­ble agents?

Daniel Kokotajlo29 Jul 2021 11:34 UTC
32 points
11 comments1 min readLW link

[Question] Did they or didn’t they learn tool use?

Daniel Kokotajlo29 Jul 2021 13:26 UTC
16 points
8 comments1 min readLW link

[AN #158]: Should we be op­ti­mistic about gen­er­al­iza­tion?

Rohin Shah29 Jul 2021 17:20 UTC
19 points
0 comments8 min readLW link
(mailchi.mp)

[Question] Very Un­nat­u­ral Tasks?

Orfeas31 Jul 2021 21:22 UTC
4 points
5 comments1 min readLW link

[Question] Is iter­ated am­plifi­ca­tion re­ally more pow­er­ful than imi­ta­tion?

Chantiel2 Aug 2021 23:20 UTC
5 points
0 comments2 min readLW link

What does GPT-3 un­der­stand? Sym­bol ground­ing and Chi­nese rooms

Stuart_Armstrong3 Aug 2021 13:14 UTC
40 points
15 comments12 min readLW link

Garrabrant and Shah on hu­man mod­el­ing in AGI

Rob Bensinger4 Aug 2021 4:35 UTC
57 points
10 comments47 min readLW link

Value load­ing in the hu­man brain: a worked example

Steven Byrnes4 Aug 2021 17:20 UTC
45 points
2 comments8 min readLW link

[AN #159]: Build­ing agents that know how to ex­per­i­ment, by train­ing on pro­ce­du­rally gen­er­ated games

Rohin Shah4 Aug 2021 17:10 UTC
18 points
4 comments14 min readLW link
(mailchi.mp)

[Question] How many pa­ram­e­ters do self-driv­ing-car neu­ral nets have?

Daniel Kokotajlo6 Aug 2021 11:24 UTC
9 points
3 comments1 min readLW link

Rage Against The MOOChine

Borasko7 Aug 2021 17:57 UTC
20 points
12 comments7 min readLW link

Ap­pli­ca­tions for De­con­fus­ing Goal-Directedness

adamShimi8 Aug 2021 13:05 UTC
36 points
3 comments5 min readLW link1 review

In­stru­men­tal Con­ver­gence: Power as Rademacher Complexity

Past Account12 Aug 2021 16:02 UTC
6 points
0 comments3 min readLW link

A new defi­ni­tion of “op­ti­mizer”

Chantiel9 Aug 2021 13:42 UTC
5 points
0 comments7 min readLW link

Goal-Direct­ed­ness and Be­hav­ior, Redux

adamShimi9 Aug 2021 14:26 UTC
14 points
4 comments2 min readLW link

Au­tomat­ing Au­dit­ing: An am­bi­tious con­crete tech­ni­cal re­search proposal

evhub11 Aug 2021 20:32 UTC
77 points
9 comments14 min readLW link1 review

Some crite­ria for sand­wich­ing projects

dmz12 Aug 2021 3:40 UTC
18 points
1 comment4 min readLW link

Power-seek­ing for suc­ces­sive choices

adamShimi12 Aug 2021 20:37 UTC
11 points
9 comments4 min readLW link

[AN #160]: Build­ing AIs that learn and think like people

Rohin Shah13 Aug 2021 17:10 UTC
28 points
6 comments10 min readLW link
(mailchi.mp)

[Question] How would the Scal­ing Hy­poth­e­sis change things?

Aryeh Englander13 Aug 2021 15:42 UTC
4 points
4 comments1 min readLW link

A re­view of “Agents and De­vices”

adamShimi13 Aug 2021 8:42 UTC
10 points
0 comments4 min readLW link

Ap­proaches to gra­di­ent hacking

adamShimi14 Aug 2021 15:16 UTC
16 points
8 comments8 min readLW link

[Question] What are some open ex­po­si­tion prob­lems in AI?

Sai Sasank Y16 Aug 2021 15:05 UTC
4 points
2 comments1 min readLW link

Think­ing about AI re­la­tion­ally

TekhneMakre16 Aug 2021 22:03 UTC
5 points
0 comments2 min readLW link

Finite Fac­tored Sets: Polyno­mi­als and Probability

Scott Garrabrant17 Aug 2021 21:53 UTC
21 points
2 comments8 min readLW link

How Deep­Mind’s Gen­er­ally Ca­pable Agents Were Trained

1a3orn20 Aug 2021 18:52 UTC
87 points
6 comments19 min readLW link

[AN #161]: Creat­ing gen­er­al­iz­able re­ward func­tions for mul­ti­ple tasks by learn­ing a model of func­tional similarity

Rohin Shah20 Aug 2021 17:20 UTC
15 points
0 comments9 min readLW link
(mailchi.mp)

Im­pli­ca­tion of AI timelines on plan­ning and solutions

JJ Hepburn21 Aug 2021 5:12 UTC
18 points
5 comments2 min readLW link

Au­tore­gres­sive Propaganda

lsusr22 Aug 2021 2:18 UTC
25 points
3 comments3 min readLW link

AI Risk for Epistemic Minimalists

Alex Flint22 Aug 2021 15:39 UTC
57 points
12 comments13 min readLW link1 review

The Codex Skep­tic FAQ

Michaël Trazzi24 Aug 2021 16:01 UTC
49 points
24 comments2 min readLW link

How to turn money into AI safety?

Charlie Steiner25 Aug 2021 10:49 UTC
66 points
26 comments8 min readLW link

In­tro­duc­tion to Re­duc­ing Goodhart

Charlie Steiner26 Aug 2021 18:38 UTC
40 points
10 comments4 min readLW link

Could you have stopped Ch­er­nobyl?

Carlos Ramirez27 Aug 2021 1:48 UTC
29 points
17 comments8 min readLW link

[AN #162]: Foun­da­tion mod­els: a paradigm shift within AI

Rohin Shah27 Aug 2021 17:20 UTC
21 points
0 comments8 min readLW link
(mailchi.mp)

A short in­tro­duc­tion to ma­chine learning

Richard_Ngo30 Aug 2021 14:31 UTC
67 points
0 comments8 min readLW link

[Question] What could small scale dis­asters from AI look like?

CharlesD31 Aug 2021 15:52 UTC
14 points
8 comments1 min readLW link

NIST AI Risk Man­age­ment Frame­work re­quest for in­for­ma­tion (RFI)

Aryeh Englander1 Sep 2021 0:15 UTC
15 points
0 comments2 min readLW link

Re­ward splin­ter­ing as re­verse of interpretability

Stuart_Armstrong31 Aug 2021 22:27 UTC
10 points
0 comments1 min readLW link

What are bi­ases, any­way? Mul­ti­ple type signatures

Stuart_Armstrong31 Aug 2021 21:16 UTC
11 points
0 comments3 min readLW link

Finite Fac­tored Sets: Applications

Scott Garrabrant31 Aug 2021 21:19 UTC
27 points
1 comment10 min readLW link

Finite Fac­tored Sets: In­fer­ring Time

Scott Garrabrant31 Aug 2021 21:18 UTC
17 points
5 comments4 min readLW link

US Mili­tary Global In­for­ma­tion Dom­i­nance Experiments

NunoSempere1 Sep 2021 13:34 UTC
25 points
0 comments4 min readLW link
(www.defense.gov)

Com­pe­tent Preferences

Charlie Steiner2 Sep 2021 14:26 UTC
27 points
2 comments6 min readLW link

For­mal­iz­ing Ob­jec­tions against Sur­ro­gate Goals

VojtaKovarik2 Sep 2021 16:24 UTC
5 points
23 comments20 min readLW link

[Question] Is there a name for the the­ory that “There will be fast take­off in real-world ca­pa­bil­ities be­cause al­most ev­ery­thing is AGI-com­plete”?

David Scott Krueger (formerly: capybaralet)2 Sep 2021 23:00 UTC
31 points
8 comments1 min readLW link

Thoughts on gra­di­ent hacking

Richard_Ngo3 Sep 2021 13:02 UTC
33 points
12 comments4 min readLW link

Why the tech­nolog­i­cal sin­gu­lar­ity by AGI may never happen

hippke3 Sep 2021 14:19 UTC
5 points
14 comments1 min readLW link

All Pos­si­ble Views About Hu­man­ity’s Fu­ture Are Wild

HoldenKarnofsky3 Sep 2021 20:19 UTC
140 points
40 comments8 min readLW link1 review

The Most Im­por­tant Cen­tury: Se­quence Introduction

HoldenKarnofsky3 Sep 2021 20:19 UTC
68 points
5 comments4 min readLW link1 review

[Question] Are there sub­stan­tial re­search efforts to­wards al­ign­ing nar­row AIs?

Rossin4 Sep 2021 18:40 UTC
11 points
4 comments2 min readLW link

Multi-Agent In­verse Re­in­force­ment Learn­ing: Subop­ti­mal De­mon­stra­tions and Alter­na­tive Solu­tion Concepts

sage_bergerson7 Sep 2021 16:11 UTC
5 points
0 comments1 min readLW link

Bayeswatch 7: Wildfire

lsusr8 Sep 2021 5:35 UTC
47 points
6 comments3 min readLW link

[AN #163]: Us­ing finite fac­tored sets for causal and tem­po­ral inference

Rohin Shah8 Sep 2021 17:20 UTC
38 points
0 comments10 min readLW link
(mailchi.mp)

Gra­di­ent de­scent is not just more effi­cient ge­netic algorithms

leogao8 Sep 2021 16:23 UTC
54 points
14 comments1 min readLW link

Sam Alt­man Q&A Notes—Aftermath

p.b.8 Sep 2021 8:20 UTC
45 points
35 comments2 min readLW link

[Question] Does blockchain tech­nol­ogy offer po­ten­tial solu­tions to some AI al­ign­ment prob­lems?

pilord9 Sep 2021 16:51 UTC
−4 points
8 comments2 min readLW link

Countably Fac­tored Spaces

Diffractor9 Sep 2021 4:24 UTC
47 points
3 comments18 min readLW link

The al­ign­ment prob­lem in differ­ent ca­pa­bil­ity regimes

Buck9 Sep 2021 19:46 UTC
87 points
12 comments5 min readLW link

GPT-X, DALL-E, and our Mul­ti­modal Fu­ture [video se­ries]

bakztfuture9 Sep 2021 23:05 UTC
0 points
1 comment1 min readLW link
(youtube.com)

Bayeswatch 8: Antimatter

lsusr10 Sep 2021 5:01 UTC
29 points
6 comments3 min readLW link

Mea­sure­ment, Op­ti­miza­tion, and Take-off Speed

jsteinhardt10 Sep 2021 19:30 UTC
47 points
4 comments13 min readLW link

Bayeswatch 9: Zombies

lsusr11 Sep 2021 5:57 UTC
41 points
15 comments3 min readLW link

[Question] Is MIRI’s read­ing list up to date?

Aryeh Englander11 Sep 2021 18:56 UTC
25 points
5 comments1 min readLW link

Soldiers, Scouts, and Al­ba­trosses.

Jan12 Sep 2021 10:36 UTC
5 points
0 comments1 min readLW link
(universalprior.substack.com)

GPT-Aug­mented Blogging

lsusr14 Sep 2021 11:55 UTC
52 points
18 comments13 min readLW link

[AN #164]: How well can lan­guage mod­els write code?

Rohin Shah15 Sep 2021 17:20 UTC
13 points
7 comments9 min readLW link
(mailchi.mp)

I wanted to in­ter­view Eliezer Yud­kowsky but he’s busy so I simu­lated him instead

lsusr16 Sep 2021 7:34 UTC
110 points
33 comments5 min readLW link

Eco­nomic AI Safety

jsteinhardt16 Sep 2021 20:50 UTC
35 points
3 comments5 min readLW link

Jit­ters No Ev­i­dence of Stu­pidity in RL

1a3orn16 Sep 2021 22:43 UTC
82 points
18 comments3 min readLW link

Im­mo­bile AI makes a move: anti-wire­head­ing, on­tol­ogy change, and model splintering

Stuart_Armstrong17 Sep 2021 15:24 UTC
32 points
3 comments2 min readLW link

Great Power Conflict

Zach Stein-Perlman17 Sep 2021 15:00 UTC
11 points
7 comments4 min readLW link

The the­ory-prac­tice gap

Buck17 Sep 2021 22:51 UTC
133 points
14 comments6 min readLW link

[Book Re­view] “The Align­ment Prob­lem” by Brian Christian

lsusr20 Sep 2021 6:36 UTC
70 points
16 comments6 min readLW link

AI, learn to be con­ser­va­tive, then learn to be less so: re­duc­ing side-effects, learn­ing pre­served fea­tures, and go­ing be­yond conservatism

Stuart_Armstrong20 Sep 2021 11:56 UTC
14 points
4 comments3 min readLW link

Sig­moids be­hav­ing badly: arXiv paper

Stuart_Armstrong20 Sep 2021 10:29 UTC
24 points
1 comment1 min readLW link

[Question] How much should you be will­ing to pay for an AGI?

Logan Zoellner20 Sep 2021 11:51 UTC
11 points
5 comments1 min readLW link

An­nounc­ing the Vi­talik Bu­terin Fel­low­ships in AI Ex­is­ten­tial Safety!

DanielFilan21 Sep 2021 0:33 UTC
64 points
2 comments1 min readLW link
(grants.futureoflife.org)

Red­wood Re­search’s cur­rent project

Buck21 Sep 2021 23:30 UTC
143 points
29 comments15 min readLW link

[Question] What are good mod­els of col­lu­sion in AI?

EconomicModel22 Sep 2021 15:16 UTC
7 points
1 comment1 min readLW link

[AN #165]: When large mod­els are more likely to lie

Rohin Shah22 Sep 2021 17:30 UTC
23 points
0 comments8 min readLW link
(mailchi.mp)

Neu­ral net /​ de­ci­sion tree hy­brids: a po­ten­tial path to­ward bridg­ing the in­ter­pretabil­ity gap

Nathan Helm-Burger23 Sep 2021 0:38 UTC
21 points
2 comments12 min readLW link

What is Com­pute? - Trans­for­ma­tive AI and Com­pute [1/​4]

lennart23 Sep 2021 16:25 UTC
24 points
8 comments19 min readLW link

Fore­cast­ing Trans­for­ma­tive AI, Part 1: What Kind of AI?

HoldenKarnofsky24 Sep 2021 0:46 UTC
17 points
17 comments9 min readLW link

Path­ways: Google’s AGI

Lê Nguyên Hoang25 Sep 2021 7:02 UTC
44 points
5 comments1 min readLW link

Cog­ni­tive Bi­ases in Large Lan­guage Models

Jan25 Sep 2021 20:59 UTC
17 points
3 comments12 min readLW link
(universalprior.substack.com)

Trans­for­ma­tive AI and Com­pute [Sum­mary]

lennart26 Sep 2021 11:41 UTC
13 points
0 comments9 min readLW link

Beyond fire alarms: free­ing the groupstruck

KatjaGrace26 Sep 2021 9:30 UTC
81 points
15 comments54 min readLW link
(worldspiritsockpuppet.com)

[Question] Any write­ups on GPT agency?

Ozyrus26 Sep 2021 22:55 UTC
4 points
6 comments1 min readLW link

AI take­off story: a con­tinu­a­tion of progress by other means

Edouard Harris27 Sep 2021 15:55 UTC
75 points
13 comments10 min readLW link

A Con­fused Chemist’s Re­view of AlphaFold 2

J Bostock27 Sep 2021 11:10 UTC
23 points
4 comments5 min readLW link

[Question] Col­lec­tion of ar­gu­ments to ex­pect (outer and in­ner) al­ign­ment failure?

Sam Clarke28 Sep 2021 16:55 UTC
20 points
10 comments1 min readLW link

Brain-in­spired AGI and the “life­time an­chor”

Steven Byrnes29 Sep 2021 13:09 UTC
64 points
16 comments13 min readLW link

[Question] What Heuris­tics Do You Use to Think About Align­ment Topics?

Logan Riggs29 Sep 2021 2:31 UTC
5 points
3 comments1 min readLW link

Bayeswatch 10: Spyware

lsusr29 Sep 2021 7:01 UTC
97 points
7 comments4 min readLW link

Un­solved ML Safety Problems

jsteinhardt29 Sep 2021 16:00 UTC
58 points
2 comments3 min readLW link
(bounded-regret.ghost.io)

Some Ex­ist­ing Selec­tion Theorems

johnswentworth30 Sep 2021 16:13 UTC
48 points
2 comments4 min readLW link

Fore­cast­ing Com­pute—Trans­for­ma­tive AI and Com­pute [2/​4]

lennart2 Oct 2021 15:54 UTC
17 points
0 comments19 min readLW link

Nu­clear Es­pi­onage and AI Governance

Guive4 Oct 2021 23:04 UTC
26 points
5 comments24 min readLW link

Model­ling and Un­der­stand­ing SGD

J Bostock5 Oct 2021 13:41 UTC
8 points
0 comments3 min readLW link

Force neu­ral nets to use mod­els, then de­tect these

Stuart_Armstrong5 Oct 2021 11:31 UTC
17 points
8 comments2 min readLW link

[Question] Is GPT-3 already sam­ple-effi­cient?

Daniel Kokotajlo6 Oct 2021 13:38 UTC
36 points
32 comments1 min readLW link

Prefer­ences from (real and hy­po­thet­i­cal) psy­chol­ogy papers

Stuart_Armstrong6 Oct 2021 9:06 UTC
15 points
0 comments2 min readLW link

Au­to­mated Fact Check­ing: A Look at the Field

Hoagy6 Oct 2021 23:52 UTC
12 points
0 comments8 min readLW link

Safety-ca­pa­bil­ities trade­off di­als are in­evitable in AGI

Steven Byrnes7 Oct 2021 19:03 UTC
57 points
4 comments3 min readLW link

Bayeswatch 11: Parabellum

lsusr9 Oct 2021 7:08 UTC
32 points
12 comments2 min readLW link

Steel­man ar­gu­ments against the idea that AGI is in­evitable and will ar­rive soon

RomanS9 Oct 2021 6:22 UTC
19 points
13 comments4 min readLW link

In­tel­li­gence or Evolu­tion?

Ramana Kumar9 Oct 2021 17:14 UTC
50 points
15 comments3 min readLW link

Bayeswatch 12: The Sin­gu­lar­ity War

lsusr10 Oct 2021 1:04 UTC
32 points
6 comments2 min readLW link

The Ex­trap­o­la­tion Problem

lsusr10 Oct 2021 5:11 UTC
25 points
8 comments2 min readLW link

The eval­u­a­tion func­tion of an AI is not its aim

Yair Halberstadt10 Oct 2021 14:52 UTC
13 points
5 comments3 min readLW link

On Solv­ing Prob­lems Be­fore They Ap­pear: The Weird Episte­molo­gies of Alignment

adamShimi11 Oct 2021 8:20 UTC
97 points
11 comments15 min readLW link

Bayeswatch 13: Spaceship

lsusr12 Oct 2021 21:35 UTC
51 points
4 comments1 min readLW link

Com­pute Gover­nance and Con­clu­sions—Trans­for­ma­tive AI and Com­pute [3/​4]

lennart14 Oct 2021 8:23 UTC
13 points
0 comments5 min readLW link

Clas­si­cal sym­bol ground­ing and causal graphs

Stuart_Armstrong14 Oct 2021 18:04 UTC
22 points
2 comments5 min readLW link

NLP Po­si­tion Paper: When Com­bat­ting Hype, Pro­ceed with Caution

Sam Bowman15 Oct 2021 20:57 UTC
46 points
15 comments1 min readLW link

[Question] Memetic haz­ards of AGI ar­chi­tec­ture posts

Ozyrus16 Oct 2021 16:10 UTC
9 points
12 comments1 min readLW link

[Pre­dic­tion] We are in an Al­gorith­mic Over­hang, Part 2

lsusr17 Oct 2021 7:48 UTC
20 points
29 comments2 min readLW link

Epistemic Strate­gies of Selec­tion Theorems

adamShimi18 Oct 2021 8:57 UTC
32 points
1 comment12 min readLW link

On The Risks of Emer­gent Be­hav­ior in Foun­da­tion Models

jsteinhardt18 Oct 2021 20:00 UTC
30 points
0 comments3 min readLW link
(bounded-regret.ghost.io)

Beyond the hu­man train­ing dis­tri­bu­tion: would the AI CEO cre­ate al­most-ille­gal ted­dies?

Stuart_Armstrong18 Oct 2021 21:10 UTC
36 points
2 comments3 min readLW link

[AN #167]: Con­crete ML safety prob­lems and their rele­vance to x-risk

Rohin Shah20 Oct 2021 17:10 UTC
19 points
4 comments9 min readLW link
(mailchi.mp)

Bor­ing ma­chine learn­ing is where it’s at

George3d620 Oct 2021 11:23 UTC
28 points
16 comments3 min readLW link
(cerebralab.com)

AGI Safety Fun­da­men­tals cur­ricu­lum and application

Richard_Ngo20 Oct 2021 21:44 UTC
67 points
0 comments8 min readLW link
(docs.google.com)

Epistemic Strate­gies of Safety-Ca­pa­bil­ities Tradeoffs

adamShimi22 Oct 2021 8:22 UTC
5 points
0 comments6 min readLW link

Gen­eral al­ign­ment plus hu­man val­ues, or al­ign­ment via hu­man val­ues?

Stuart_Armstrong22 Oct 2021 10:11 UTC
45 points
27 comments3 min readLW link

Naive self-su­per­vised ap­proaches to truth­ful AI

ryan_greenblatt23 Oct 2021 13:03 UTC
9 points
4 comments2 min readLW link

My ML Scal­ing bibliography

gwern23 Oct 2021 14:41 UTC
35 points
9 comments1 min readLW link
(www.gwern.net)

Selfish­ness, prefer­ence falsifi­ca­tion, and AI alignment

jessicata28 Oct 2021 0:16 UTC
52 points
29 comments13 min readLW link
(unstableontology.com)

[AN #168]: Four tech­ni­cal top­ics for which Open Phil is so­lic­it­ing grant proposals

Rohin Shah28 Oct 2021 17:20 UTC
15 points
0 comments9 min readLW link
(mailchi.mp)

Fore­cast­ing progress in lan­guage models

28 Oct 2021 20:40 UTC
54 points
5 comments11 min readLW link
(www.metaculus.com)

Re­quest for pro­pos­als for pro­jects in AI al­ign­ment that work with deep learn­ing systems

29 Oct 2021 7:26 UTC
87 points
0 comments5 min readLW link

Interpretability

29 Oct 2021 7:28 UTC
59 points
13 comments12 min readLW link

Truth­ful and hon­est AI

29 Oct 2021 7:28 UTC
41 points
1 comment13 min readLW link

Mea­sur­ing and fore­cast­ing risks

29 Oct 2021 7:27 UTC
20 points
0 comments12 min readLW link

Tech­niques for en­hanc­ing hu­man feedback

29 Oct 2021 7:27 UTC
22 points
0 comments2 min readLW link

Stu­art Rus­sell and Me­lanie Mitchell on Munk Debates

Alex Flint29 Oct 2021 19:13 UTC
29 points
3 comments3 min readLW link

True Sto­ries of Al­gorith­mic Improvement

johnswentworth29 Oct 2021 20:57 UTC
91 points
7 comments5 min readLW link

Must true AI sleep?

YimbyGeorge30 Oct 2021 16:47 UTC
0 points
1 comment1 min readLW link

Nate Soares on the Ul­ti­mate New­comb’s Problem

Rob Bensinger31 Oct 2021 19:42 UTC
56 points
20 comments1 min readLW link

Models Model­ing Models

Charlie Steiner2 Nov 2021 7:08 UTC
20 points
5 comments10 min readLW link

[Question] What’s the differ­ence be­tween newer Atari-play­ing AI and the older Deep­mind one (from 2014)?

Raemon2 Nov 2021 23:36 UTC
27 points
8 comments1 min readLW link

Ap­ply to the ML for Align­ment Boot­camp (MLAB) in Berkeley [Jan 3 - Jan 22]

3 Nov 2021 18:22 UTC
95 points
4 comments1 min readLW link

[Ex­ter­nal Event] 2022 IEEE In­ter­na­tional Con­fer­ence on As­sured Au­ton­omy (ICAA) - sub­mis­sion dead­line extended

Aryeh Englander5 Nov 2021 15:29 UTC
13 points
0 comments3 min readLW link

Y2K: Suc­cess­ful Prac­tice for AI Alignment

Darmani5 Nov 2021 6:09 UTC
47 points
5 comments6 min readLW link

Some Re­marks on Reg­u­la­tor The­o­rems No One Asked For

Past Account5 Nov 2021 19:33 UTC
19 points
1 comment4 min readLW link

How should we com­pare neu­ral net­work rep­re­sen­ta­tions?

jsteinhardt5 Nov 2021 22:10 UTC
24 points
0 comments3 min readLW link
(bounded-regret.ghost.io)

Drug ad­dicts and de­cep­tively al­igned agents—a com­par­a­tive analysis

Jan5 Nov 2021 21:42 UTC
41 points
2 comments12 min readLW link
(universalprior.substack.com)

Com­ments on OpenPhil’s In­ter­pretabil­ity RFP

paulfchristiano5 Nov 2021 22:36 UTC
84 points
5 comments7 min readLW link

How do we be­come con­fi­dent in the safety of a ma­chine learn­ing sys­tem?

evhub8 Nov 2021 22:49 UTC
92 points
2 comments32 min readLW link

[Question] What ex­actly is GPT-3′s base ob­jec­tive?

Daniel Kokotajlo10 Nov 2021 0:57 UTC
60 points
15 comments2 min readLW link

Re­lax­ation-Based Search, From Every­day Life To Un­fa­mil­iar Territory

johnswentworth10 Nov 2021 21:47 UTC
57 points
3 comments8 min readLW link

Us­ing blin­ders to help you see things for what they are

Adam Zerner11 Nov 2021 7:07 UTC
13 points
2 comments2 min readLW link

AGI is at least as far away as Nu­clear Fu­sion.

Logan Zoellner11 Nov 2021 21:33 UTC
0 points
8 comments1 min readLW link

Mea­sur­ing and Fore­cast­ing Risks from AI

jsteinhardt12 Nov 2021 2:30 UTC
24 points
0 comments3 min readLW link
(bounded-regret.ghost.io)

Why I’m ex­cited about Red­wood Re­search’s cur­rent project

paulfchristiano12 Nov 2021 19:26 UTC
112 points
6 comments7 min readLW link

A Defense of Func­tional De­ci­sion Theory

Heighn12 Nov 2021 20:59 UTC
21 points
120 comments10 min readLW link

Com­ments on Car­l­smith’s “Is power-seek­ing AI an ex­is­ten­tial risk?”

So8res13 Nov 2021 4:29 UTC
137 points
13 comments40 min readLW link

[Question] What’s the like­li­hood of only sub ex­po­nen­tial growth for AGI?

M. Y. Zuo13 Nov 2021 22:46 UTC
5 points
22 comments1 min readLW link

My cur­rent un­cer­tain­ties re­gard­ing AI, al­ign­ment, and the end of the world

dominicq14 Nov 2021 14:08 UTC
2 points
3 comments2 min readLW link

My un­der­stand­ing of the al­ign­ment problem

danieldewey15 Nov 2021 18:13 UTC
43 points
3 comments3 min readLW link

“Sum­ma­riz­ing Books with Hu­man Feed­back” (re­cur­sive GPT-3)

gwern15 Nov 2021 17:41 UTC
24 points
4 comments1 min readLW link
(openai.com)

Quan­tilizer ≡ Op­ti­mizer with a Bounded Amount of Output

itaibn016 Nov 2021 1:03 UTC
10 points
4 comments2 min readLW link

Two Stupid AI Align­ment Ideas

aphyer16 Nov 2021 16:13 UTC
24 points
3 comments4 min readLW link

[Question] What are the mu­tual benefits of AGI-hu­man col­lab­o­ra­tion that would oth­er­wise be un­ob­tain­able?

M. Y. Zuo17 Nov 2021 3:09 UTC
1 point
4 comments1 min readLW link

Ap­pli­ca­tions for AI Safety Camp 2022 Now Open!

adamShimi17 Nov 2021 21:42 UTC
47 points
3 comments1 min readLW link

Ngo and Yud­kowsky on AI ca­pa­bil­ity gains

18 Nov 2021 22:19 UTC
129 points
61 comments39 min readLW link

“Ac­qui­si­tion of Chess Knowl­edge in AlphaZero”: prob­ing AZ over time

jsd18 Nov 2021 23:24 UTC
11 points
9 comments1 min readLW link
(arxiv.org)

How To Get Into In­de­pen­dent Re­search On Align­ment/​Agency

johnswentworth19 Nov 2021 0:00 UTC
314 points
33 comments13 min readLW link

Good­hart: Endgame

Charlie Steiner19 Nov 2021 1:26 UTC
23 points
3 comments8 min readLW link

More de­tailed pro­posal for mea­sur­ing al­ign­ment of cur­rent models

Beth Barnes20 Nov 2021 0:03 UTC
31 points
0 comments8 min readLW link

From lan­guage to ethics by au­to­mated reasoning

Michele Campolo21 Nov 2021 15:16 UTC
4 points
4 comments6 min readLW link

Mo­rally un­der­defined situ­a­tions can be deadly

Stuart_Armstrong22 Nov 2021 14:48 UTC
17 points
8 comments2 min readLW link

Yud­kowsky and Chris­ti­ano dis­cuss “Take­off Speeds”

Eliezer Yudkowsky22 Nov 2021 19:35 UTC
191 points
181 comments60 min readLW link1 review

Po­ten­tial Align­ment men­tal tool: Keep­ing track of the types

Donald Hobson22 Nov 2021 20:05 UTC
28 points
1 comment2 min readLW link

For­mal­iz­ing Policy-Mod­ifi­ca­tion Corrigibility

TurnTrout3 Dec 2021 1:31 UTC
23 points
6 comments6 min readLW link

[AN #169]: Col­lab­o­rat­ing with hu­mans with­out hu­man data

Rohin Shah24 Nov 2021 18:30 UTC
33 points
0 comments8 min readLW link
(mailchi.mp)

Chris­ti­ano, Co­tra, and Yud­kowsky on AI progress

25 Nov 2021 16:45 UTC
117 points
95 comments68 min readLW link

Lat­a­cora might be of in­ter­est to some AI Safety organizations

NunoSempere25 Nov 2021 23:57 UTC
14 points
10 comments1 min readLW link
(www.latacora.com)

Solve Cor­rigi­bil­ity Week

Logan Riggs28 Nov 2021 17:00 UTC
39 points
21 comments1 min readLW link

TTS au­dio of “Ngo and Yud­kowsky on al­ign­ment difficulty”

Quintin Pope28 Nov 2021 18:11 UTC
4 points
3 comments1 min readLW link

Red­wood Re­search is hiring for sev­eral roles

29 Nov 2021 0:16 UTC
44 points
0 comments1 min readLW link

Com­pute Re­search Ques­tions and Met­rics—Trans­for­ma­tive AI and Com­pute [4/​4]

lennart28 Nov 2021 22:49 UTC
6 points
0 comments16 min readLW link

Com­ments on Allan Dafoe on AI Governance

Alex Flint29 Nov 2021 16:16 UTC
13 points
0 comments7 min readLW link

Soares, Tal­linn, and Yud­kowsky dis­cuss AGI cognition

29 Nov 2021 19:26 UTC
118 points
35 comments40 min readLW link

Self-study­ing to de­velop an in­side-view model of AI al­ign­ment; co-studiers wel­come!

Vael Gates30 Nov 2021 9:25 UTC
13 points
0 comments4 min readLW link

Ma­chine Agents, Hy­brid Su­per­in­tel­li­gences, and The Loss of Hu­man Con­trol (Chap­ter 1)

Justin Bullock30 Nov 2021 17:35 UTC
4 points
0 comments8 min readLW link

AXRP Epi­sode 12 - AI Ex­is­ten­tial Risk with Paul Christiano

DanielFilan2 Dec 2021 2:20 UTC
36 points
0 comments125 min readLW link

Mo­ral­ity is Scary

Wei Dai2 Dec 2021 6:35 UTC
175 points
125 comments4 min readLW link

Syd­ney AI Safety Fellowship

Chris_Leong2 Dec 2021 7:34 UTC
22 points
0 comments2 min readLW link

$100/​$50 re­wards for good references

Stuart_Armstrong3 Dec 2021 16:55 UTC
20 points
5 comments1 min readLW link

[Question] Does the Struc­ture of an al­gorithm mat­ter for AI Risk and/​or con­scious­ness?

Logan Zoellner3 Dec 2021 18:31 UTC
7 points
5 comments1 min readLW link

[Linkpost] A Gen­eral Lan­guage As­sis­tant as a Lab­o­ra­tory for Alignment

Quintin Pope3 Dec 2021 19:42 UTC
37 points
2 comments2 min readLW link

Agency: What it is and why it matters

Daniel Kokotajlo4 Dec 2021 21:32 UTC
25 points
2 comments2 min readLW link

[Question] Are limited-hori­zon agents a good heuris­tic for the off-switch prob­lem?

Yonadav Shavit5 Dec 2021 19:27 UTC
5 points
19 comments1 min readLW link

In­tro­duc­tion to in­ac­cessible information

Ryan Kidd9 Dec 2021 1:28 UTC
27 points
6 comments8 min readLW link

More Chris­ti­ano, Co­tra, and Yud­kowsky on AI progress

6 Dec 2021 20:33 UTC
85 points
30 comments40 min readLW link

Ex­ter­mi­nat­ing hu­mans might be on the to-do list of a Friendly AI

RomanS7 Dec 2021 14:15 UTC
5 points
8 comments2 min readLW link

In­ter­views on Im­prov­ing the AI Safety Pipeline

Chris_Leong7 Dec 2021 12:03 UTC
55 points
16 comments17 min readLW link

Let’s buy out Cyc, for use in AGI in­ter­pretabil­ity sys­tems?

Steven Byrnes7 Dec 2021 20:46 UTC
47 points
10 comments2 min readLW link

[AN #170]: An­a­lyz­ing the ar­gu­ment for risk from power-seek­ing AI

Rohin Shah8 Dec 2021 18:10 UTC
21 points
1 comment7 min readLW link
(mailchi.mp)

[MLSN #2]: Ad­ver­sar­ial Training

Dan_H9 Dec 2021 17:16 UTC
26 points
0 comments3 min readLW link

Su­per­vised learn­ing and self-mod­el­ing: What’s “su­per­hu­man?”

Charlie Steiner9 Dec 2021 12:44 UTC
12 points
1 comment8 min readLW link

Some ab­stract, non-tech­ni­cal rea­sons to be non-max­i­mally-pes­simistic about AI alignment

Rob Bensinger12 Dec 2021 2:08 UTC
66 points
37 comments7 min readLW link

Trans­form­ing my­opic op­ti­miza­tion to or­di­nary op­ti­miza­tion—Do we want to seek con­ver­gence for my­opic op­ti­miza­tion prob­lems?

tailcalled11 Dec 2021 20:38 UTC
12 points
1 comment5 min readLW link

Red­wood’s Tech­nique-Fo­cused Epistemic Strategy

adamShimi12 Dec 2021 16:36 UTC
48 points
1 comment7 min readLW link

[Question] [Re­solved] Who else prefers “AI al­ign­ment” to “AI safety?”

Evan_Gaensbauer13 Dec 2021 0:35 UTC
5 points
8 comments1 min readLW link

Hard-Cod­ing Neu­ral Computation

MadHatter13 Dec 2021 4:35 UTC
32 points
8 comments27 min readLW link

Solv­ing In­ter­pretabil­ity Week

Logan Riggs13 Dec 2021 17:09 UTC
11 points
5 comments1 min readLW link

Un­der­stand­ing and con­trol­ling auto-in­duced dis­tri­bu­tional shift

L Rudolf L13 Dec 2021 14:59 UTC
26 points
3 comments16 min readLW link

Lan­guage Model Align­ment Re­search Internships

Ethan Perez13 Dec 2021 19:53 UTC
68 points
1 comment1 min readLW link

En­abling More Feed­back for AI Safety Researchers

frances_lorenz13 Dec 2021 20:10 UTC
17 points
0 comments3 min readLW link

ARC’s first tech­ni­cal re­port: Elic­it­ing La­tent Knowledge

14 Dec 2021 20:09 UTC
212 points
88 comments1 min readLW link
(docs.google.com)

In­ter­lude: Agents as Automobiles

Daniel Kokotajlo14 Dec 2021 18:49 UTC
25 points
6 comments5 min readLW link

ARC is hiring!

14 Dec 2021 20:09 UTC
62 points
2 comments1 min readLW link

Ngo’s view on al­ign­ment difficulty

14 Dec 2021 21:34 UTC
63 points
7 comments17 min readLW link

The Nat­u­ral Ab­strac­tion Hy­poth­e­sis: Im­pli­ca­tions and Evidence

CallumMcDougall14 Dec 2021 23:14 UTC
30 points
8 comments19 min readLW link

Elic­i­ta­tion for Model­ing Trans­for­ma­tive AI Risks

Davidmanheim16 Dec 2021 15:24 UTC
30 points
2 comments9 min readLW link

Some mo­ti­va­tions to gra­di­ent hack

peterbarnett17 Dec 2021 3:06 UTC
8 points
0 comments6 min readLW link

In­tro­duc­ing the Prin­ci­ples of In­tel­li­gent Be­havi­our in Biolog­i­cal and So­cial Sys­tems (PIBBSS) Fellowship

adamShimi18 Dec 2021 15:23 UTC
51 points
4 comments10 min readLW link

[Question] Im­por­tant ML sys­tems from be­fore 2012?

Jsevillamol18 Dec 2021 12:12 UTC
12 points
5 comments1 min readLW link

[Ex­tended Dead­line: Jan 23rd] An­nounc­ing the PIBBSS Sum­mer Re­search Fellowship

Nora_Ammann18 Dec 2021 16:56 UTC
6 points
1 comment1 min readLW link

Ex­plor­ing De­ci­sion The­o­ries With Coun­ter­fac­tu­als and Dy­namic Agent Self-Pointers

JoshuaOSHickman18 Dec 2021 21:50 UTC
2 points
0 comments4 min readLW link

Don’t In­fluence the In­fluencers!

lhc19 Dec 2021 9:02 UTC
14 points
2 comments10 min readLW link

SGD Un­der­stood through Prob­a­bil­ity Current

J Bostock19 Dec 2021 23:26 UTC
23 points
1 comment5 min readLW link

Worst-case think­ing in AI alignment

Buck23 Dec 2021 1:29 UTC
139 points
15 comments6 min readLW link

2021 AI Align­ment Liter­a­ture Re­view and Char­ity Comparison

Larks23 Dec 2021 14:06 UTC
164 points
26 comments73 min readLW link

Re­ply to Eliezer on Biolog­i­cal Anchors

HoldenKarnofsky23 Dec 2021 16:15 UTC
146 points
46 comments15 min readLW link

Risks from AI persuasion

Beth Barnes24 Dec 2021 1:48 UTC
68 points
15 comments31 min readLW link

Un­der­stand­ing the ten­sor product for­mu­la­tion in Trans­former Circuits

Tom Lieberum24 Dec 2021 18:05 UTC
16 points
2 comments3 min readLW link

Mechanis­tic In­ter­pretabil­ity for the MLP Lay­ers (rough early thoughts)

MadHatter24 Dec 2021 7:24 UTC
11 points
2 comments1 min readLW link
(www.youtube.com)

My Overview of the AI Align­ment Land­scape: Threat Models

Neel Nanda25 Dec 2021 23:07 UTC
50 points
4 comments28 min readLW link

Re­in­force­ment Learn­ing Study Group

Kay Kozaronek26 Dec 2021 23:11 UTC
20 points
9 comments1 min readLW link

AI Fire Alarm Scenarios

PeterMcCluskey28 Dec 2021 2:20 UTC
10 points
0 comments6 min readLW link
(www.bayesianinvestor.com)

Re­v­erse-en­g­ineer­ing us­ing interpretability

Beth Barnes29 Dec 2021 23:21 UTC
21 points
1 comment5 min readLW link

Coun­terex­am­ples to some ELK proposals

paulfchristiano31 Dec 2021 17:05 UTC
50 points
10 comments7 min readLW link

We Choose To Align AI

johnswentworth1 Jan 2022 20:06 UTC
259 points
15 comments3 min readLW link

Why don’t we just, like, try and build safe AGI?

Sun1 Jan 2022 23:24 UTC
0 points
4 comments1 min readLW link

[Question] Tag for AI al­ign­ment?

Alex_Altair2 Jan 2022 18:55 UTC
7 points
6 comments1 min readLW link

How an alien the­ory of mind might be unlearnable

Stuart_Armstrong3 Jan 2022 11:16 UTC
26 points
35 comments5 min readLW link

Shad­ows Of The Com­ing Race (1879)

Capybasilisk3 Jan 2022 15:55 UTC
49 points
4 comments7 min readLW link

Ap­ply for re­search in­tern­ships at ARC!

paulfchristiano3 Jan 2022 20:26 UTC
61 points
0 comments1 min readLW link

Promis­ing posts on AF that have fallen through the cracks

Evan R. Murphy4 Jan 2022 15:39 UTC
33 points
6 comments2 min readLW link

You can’t un­der­stand hu­man agency with­out un­der­stand­ing amoeba agency

Shmi6 Jan 2022 4:42 UTC
19 points
36 comments1 min readLW link

Satisf-AI: A Route to Re­duc­ing Risks From AI

harsimony6 Jan 2022 2:34 UTC
4 points
1 comment4 min readLW link
(harsimony.wordpress.com)

Im­por­tance of fore­sight eval­u­a­tions within ELK

Jonathan Uesato6 Jan 2022 15:34 UTC
25 points
1 comment10 min readLW link

Goal-di­rect­ed­ness: my baseline beliefs

Morgan_Rogers8 Jan 2022 13:09 UTC
21 points
3 comments3 min readLW link

The Un­rea­son­able Fea­si­bil­ity Of Play­ing Chess Un­der The Influence

Jan12 Jan 2022 23:09 UTC
29 points
17 comments13 min readLW link
(universalprior.substack.com)

New year, new re­search agenda post

Charlie Steiner12 Jan 2022 17:58 UTC
29 points
4 comments16 min readLW link

Value ex­trap­o­la­tion par­tially re­solves sym­bol grounding

Stuart_Armstrong12 Jan 2022 16:30 UTC
24 points
10 comments1 min readLW link

2020 Re­view Article

Vaniver14 Jan 2022 4:58 UTC
74 points
3 comments7 min readLW link

The Greedy Doc­tor Prob­lem… turns out to be rele­vant to the ELK prob­lem?

Jan14 Jan 2022 11:58 UTC
33 points
10 comments14 min readLW link
(universalprior.substack.com)

PIBBSS Fel­low­ship: Bounty for Refer­rals & Dead­line Extension

Anna Gajdova17 Jan 2022 16:23 UTC
7 points
0 comments1 min readLW link

Differ­ent way clas­sifiers can be diverse

Stuart_Armstrong17 Jan 2022 16:30 UTC
10 points
5 comments2 min readLW link

Scalar re­ward is not enough for al­igned AGI

Peter Vamplew17 Jan 2022 21:02 UTC
15 points
3 comments11 min readLW link

Challenges with Break­ing into MIRI-Style Research

Chris_Leong17 Jan 2022 9:23 UTC
72 points
15 comments3 min readLW link

Thought Ex­per­i­ments Provide a Third Anchor

jsteinhardt18 Jan 2022 16:00 UTC
44 points
20 comments4 min readLW link
(bounded-regret.ghost.io)

An­chor Weights for ML

jsteinhardt20 Jan 2022 16:20 UTC
17 points
2 comments2 min readLW link
(bounded-regret.ghost.io)

Es­ti­mat­ing train­ing com­pute of Deep Learn­ing models

20 Jan 2022 16:12 UTC
37 points
4 comments1 min readLW link

Shar­ing Pow­er­ful AI Models

apc21 Jan 2022 11:57 UTC
6 points
4 comments1 min readLW link

[AN #171]: Disagree­ments be­tween al­ign­ment “op­ti­mists” and “pes­simists”

Rohin Shah21 Jan 2022 18:30 UTC
32 points
1 comment7 min readLW link
(mailchi.mp)

A one-ques­tion Tur­ing test for GPT-3

22 Jan 2022 18:17 UTC
84 points
23 comments5 min readLW link

ML Sys­tems Will Have Weird Failure Modes

jsteinhardt26 Jan 2022 1:40 UTC
54 points
8 comments6 min readLW link
(bounded-regret.ghost.io)

Search Is All You Need

blake808625 Jan 2022 23:13 UTC
33 points
13 comments3 min readLW link

Aligned AI Needs Slack

Shmi26 Jan 2022 9:29 UTC
23 points
10 comments1 min readLW link

Em­piri­cal Find­ings Gen­er­al­ize Sur­pris­ingly Far

jsteinhardt1 Feb 2022 22:30 UTC
46 points
0 comments6 min readLW link
(bounded-regret.ghost.io)

OpenAI Solves (Some) For­mal Math Olympiad Problems

Michaël Trazzi2 Feb 2022 21:49 UTC
77 points
26 comments2 min readLW link

Ob­served pat­terns around ma­jor tech­nolog­i­cal advancements

Richard Korzekwa 3 Feb 2022 0:30 UTC
45 points
15 comments11 min readLW link
(aiimpacts.org)

Paper­clip­pers, s-risks, hope

superads914 Feb 2022 19:03 UTC
13 points
17 comments1 min readLW link

AI Wri­teup Part 1

SNl4 Feb 2022 21:16 UTC
8 points
1 comment18 min readLW link

Align­ment ver­sus AI Alignment

Alex Flint4 Feb 2022 22:59 UTC
87 points
15 comments22 min readLW link

Ca­pa­bil­ity Phase Tran­si­tion Examples

gwern8 Feb 2022 3:32 UTC
39 points
1 comment1 min readLW link
(www.reddit.com)

A broad basin of at­trac­tion around hu­man val­ues?

Wei Dai12 Apr 2022 5:15 UTC
105 points
16 comments2 min readLW link

Ap­pendix: More Is Differ­ent In Other Domains

jsteinhardt8 Feb 2022 16:00 UTC
12 points
1 comment4 min readLW link
(bounded-regret.ghost.io)

[In­tro to brain-like-AGI safety] 2. “Learn­ing from scratch” in the brain

Steven Byrnes2 Feb 2022 13:22 UTC
43 points
12 comments25 min readLW link

Bet­ter im­pos­si­bil­ity re­sult for un­bounded utilities

paulfchristiano9 Feb 2022 6:10 UTC
29 points
24 comments5 min readLW link

EleutherAI’s GPT-NeoX-20B release

leogao10 Feb 2022 6:56 UTC
30 points
3 comments1 min readLW link
(eaidata.bmk.sh)

In­fer­ring util­ity func­tions from lo­cally non-tran­si­tive preferences

Jan10 Feb 2022 10:33 UTC
28 points
15 comments8 min readLW link
(universalprior.substack.com)

A sum­mary of al­ign­ing nar­rowly su­per­hu­man models

gugu10 Feb 2022 18:26 UTC
8 points
0 comments8 min readLW link

Idea: build al­ign­ment dataset for very ca­pa­ble models

Quintin Pope12 Feb 2022 19:30 UTC
9 points
2 comments3 min readLW link

Goal-di­rect­ed­ness: ex­plor­ing explanations

Morgan_Rogers14 Feb 2022 16:20 UTC
13 points
3 comments18 min readLW link

Is ELK enough? Di­a­mond, Ma­trix and Child AI

adamShimi15 Feb 2022 2:29 UTC
17 points
10 comments4 min readLW link

What Does The Nat­u­ral Ab­strac­tion Frame­work Say About ELK?

johnswentworth15 Feb 2022 2:27 UTC
34 points
0 comments6 min readLW link

Some Hacky ELK Ideas

johnswentworth15 Feb 2022 2:27 UTC
34 points
8 comments5 min readLW link

How harm­ful are im­prove­ments in AI? + Poll

15 Feb 2022 18:16 UTC
15 points
4 comments8 min readLW link

Be­com­ing Stronger as Episte­mol­o­gist: Introduction

adamShimi15 Feb 2022 6:15 UTC
29 points
2 comments4 min readLW link

REPL’s: a type sig­na­ture for agents

scottviteri15 Feb 2022 22:57 UTC
23 points
5 comments2 min readLW link

REPL’s and ELK

scottviteri17 Feb 2022 1:14 UTC
9 points
4 comments1 min readLW link

[Link] Eric Sch­midt’s new AI2050 Fund

Aryeh Englander16 Feb 2022 21:21 UTC
32 points
3 comments2 min readLW link

Align­ment re­searchers, how use­ful is ex­tra com­pute for you?

Lauro Langosco19 Feb 2022 15:35 UTC
7 points
4 comments1 min readLW link

[Question] 2 (naive?) ideas for alignment

Jonathan Moregård20 Feb 2022 19:01 UTC
3 points
1 comment1 min readLW link

The Big Pic­ture Of Align­ment (Talk Part 1)

johnswentworth21 Feb 2022 5:49 UTC
98 points
35 comments1 min readLW link
(www.youtube.com)

[Question] Fa­vorite /​ most ob­scure re­search on un­der­stand­ing DNNs?

Vivek Hebbar21 Feb 2022 5:49 UTC
16 points
1 comment1 min readLW link

Two Challenges for ELK

derek shiller21 Feb 2022 5:49 UTC
7 points
0 comments4 min readLW link

[Question] Do any AI al­ign­ment orgs hire re­motely?

RobertM21 Feb 2022 22:33 UTC
24 points
9 comments2 min readLW link

More GPT-3 and sym­bol grounding

Stuart_Armstrong23 Feb 2022 18:30 UTC
21 points
7 comments3 min readLW link

Trans­former in­duc­tive bi­ases & RASP

Vivek Hebbar24 Feb 2022 0:42 UTC
15 points
4 comments1 min readLW link
(proceedings.mlr.press)

A com­ment on Ajeya Co­tra’s draft re­port on AI timelines

Matthew Barnett24 Feb 2022 0:41 UTC
69 points
13 comments7 min readLW link

The Big Pic­ture Of Align­ment (Talk Part 2)

johnswentworth25 Feb 2022 2:53 UTC
33 points
12 comments1 min readLW link
(www.youtube.com)

Trust-max­i­miz­ing AGI

25 Feb 2022 15:13 UTC
7 points
26 comments9 min readLW link
(universalprior.substack.com)

IMO challenge bet with Eliezer

paulfchristiano26 Feb 2022 4:50 UTC
162 points
25 comments3 min readLW link

New Speaker Series on AI Align­ment Start­ing March 3

Zechen Zhang26 Feb 2022 19:31 UTC
7 points
1 comment1 min readLW link

How I Formed My Own Views About AI Safety

Neel Nanda27 Feb 2022 18:50 UTC
64 points
6 comments13 min readLW link
(www.neelnanda.io)

Shah and Yud­kowsky on al­ign­ment failures

28 Feb 2022 19:18 UTC
83 points
38 comments91 min readLW link

ELK Thought Dump

abramdemski28 Feb 2022 18:46 UTC
58 points
18 comments17 min readLW link

Late 2021 MIRI Con­ver­sa­tions: AMA /​ Discussion

Rob Bensinger28 Feb 2022 20:03 UTC
119 points
208 comments1 min readLW link

[Question] What are the causal­ity effects of an agents pres­ence in a re­in­force­ment learn­ing environment

Jonas Kgomo1 Mar 2022 21:57 UTC
0 points
2 comments1 min readLW link

Mus­ings on the Speed Prior

evhub2 Mar 2022 4:04 UTC
19 points
4 comments10 min readLW link

AI Perfor­mance on Hu­man Tasks

Asher Ellis3 Mar 2022 20:13 UTC
58 points
3 comments21 min readLW link

In­tro­duc­ing my­self: Henry Lie­ber­man, MIT CSAIL, why­cantwe.org

Henry A Lieberman3 Mar 2022 23:42 UTC
−2 points
9 comments1 min readLW link

Pre­serv­ing and con­tin­u­ing al­ign­ment re­search through a se­vere global catastrophe

A_donor6 Mar 2022 18:43 UTC
36 points
11 comments5 min readLW link

Why work at AI Im­pacts?

Katja6 Mar 2022 22:10 UTC
50 points
7 comments13 min readLW link
(aiimpacts.org)

Per­sonal imi­ta­tion software

Flaglandbase7 Mar 2022 7:55 UTC
6 points
6 comments1 min readLW link

[MLSN #3]: NeurIPS Safety Paper Roundup

Dan H8 Mar 2022 15:17 UTC
45 points
0 comments4 min readLW link

ELK prize results

9 Mar 2022 0:01 UTC
130 points
50 comments21 min readLW link

[Question] Non-co­er­cive mo­ti­va­tion for al­ign­ment re­search?

Jonathan Moregård8 Mar 2022 20:50 UTC
1 point
0 comments1 min readLW link

On pre­sent­ing the case for AI risk

Aryeh Englander9 Mar 2022 1:41 UTC
54 points
18 comments4 min readLW link

Ask AI com­pa­nies about what they are do­ing for AI safety?

mic9 Mar 2022 15:14 UTC
50 points
0 comments2 min readLW link

Deriv­ing Our World From Small Datasets

Capybasilisk9 Mar 2022 0:34 UTC
5 points
4 comments2 min readLW link

Value ex­trap­o­la­tion, con­cept ex­trap­o­la­tion, model splintering

Stuart_Armstrong8 Mar 2022 22:50 UTC
14 points
1 comment2 min readLW link

The Proof of Doom

johnlawrenceaspden9 Mar 2022 19:37 UTC
27 points
18 comments3 min readLW link

A Rephras­ing Of and Foot­note To An Embed­ded Agency Proposal

JoshuaOSHickman9 Mar 2022 18:13 UTC
5 points
0 comments5 min readLW link

ELK Sub—Note-tak­ing in in­ter­nal rollouts

Hoagy9 Mar 2022 17:23 UTC
6 points
0 comments5 min readLW link

[Question] Are there any im­pos­si­bil­ity the­o­rems for strong and safe AI?

David Johnston11 Mar 2022 1:41 UTC
5 points
3 comments1 min readLW link

Com­pute Trends — Com­par­i­son to OpenAI’s AI and Compute

12 Mar 2022 18:09 UTC
23 points
3 comments3 min readLW link

ELK con­test sub­mis­sion: route un­der­stand­ing through the hu­man ontology

14 Mar 2022 21:42 UTC
21 points
2 comments2 min readLW link

Dual use of ar­tifi­cial-in­tel­li­gence-pow­ered drug discovery

Vaniver15 Mar 2022 2:52 UTC
91 points
15 comments1 min readLW link
(www.nature.com)

[In­tro to brain-like-AGI safety] 8. Take­aways from neuro 1/​2: On AGI development

Steven Byrnes16 Mar 2022 13:59 UTC
41 points
2 comments15 min readLW link

Some (po­ten­tially) fund­able AI Safety Ideas

Logan Riggs16 Mar 2022 12:48 UTC
21 points
5 comments5 min readLW link

What do paradigm shifts look like?

leogao16 Mar 2022 19:17 UTC
15 points
2 comments1 min readLW link

[Question] What is the equiv­a­lent of the “do” op­er­a­tor for finite fac­tored sets?

Chris van Merwijk17 Mar 2022 8:05 UTC
8 points
2 comments1 min readLW link

[Question] What to do af­ter in­vent­ing AGI?

elephantcrew18 Mar 2022 22:30 UTC
9 points
4 comments1 min readLW link

Goal-di­rect­ed­ness: im­perfect rea­son­ing, limited knowl­edge and in­ac­cu­rate beliefs

Morgan_Rogers19 Mar 2022 17:28 UTC
4 points
1 comment21 min readLW link

Wargam­ing AGI Development

ryan_b19 Mar 2022 17:59 UTC
36 points
13 comments5 min readLW link

Ex­plor­ing Finite Fac­tored Sets with some toy examples

Thomas Kehrenberg19 Mar 2022 22:08 UTC
36 points
1 comment9 min readLW link
(tm.kehrenberg.net)

Nat­u­ral Value Learning

Chris van Merwijk20 Mar 2022 12:44 UTC
7 points
10 comments4 min readLW link

Why will an AGI be ra­tio­nal?

azsantosk21 Mar 2022 21:54 UTC
4 points
8 comments2 min readLW link

We can­not di­rectly choose an AGI’s util­ity function

azsantosk21 Mar 2022 22:08 UTC
12 points
18 comments3 min readLW link

Progress Re­port 1: in­ter­pretabil­ity ex­per­i­ments & learn­ing, test­ing com­pres­sion hypotheses

Nathan Helm-Burger22 Mar 2022 20:12 UTC
11 points
0 comments2 min readLW link

Les­sons After a Cou­ple Months of Try­ing to Do ML Research

KevinRoWang22 Mar 2022 23:45 UTC
68 points
8 comments6 min readLW link

Job Offer­ing: Help Com­mu­ni­cate Infrabayesianism

23 Mar 2022 18:35 UTC
135 points
21 comments1 min readLW link

A sur­vey of tool use and work­flows in al­ign­ment research

23 Mar 2022 23:44 UTC
43 points
5 comments1 min readLW link

Why Agent Foun­da­tions? An Overly Ab­stract Explanation

johnswentworth25 Mar 2022 23:17 UTC
247 points
54 comments8 min readLW link

[ASoT] Ob­ser­va­tions about ELK

leogao26 Mar 2022 0:42 UTC
30 points
0 comments3 min readLW link

[Question] When peo­ple ask for your P(doom), do you give them your in­side view or your bet­ting odds?

Vivek Hebbar26 Mar 2022 23:08 UTC
11 points
12 comments1 min readLW link

Com­pute Gover­nance: The Role of Com­mod­ity Hardware

Jan26 Mar 2022 10:08 UTC
14 points
7 comments7 min readLW link
(universalprior.substack.com)

Agency and Coherence

David Udell26 Mar 2022 19:25 UTC
23 points
2 comments3 min readLW link

[ASoT] Some ways ELK could still be solv­able in practice

leogao27 Mar 2022 1:15 UTC
26 points
1 comment2 min readLW link

[Question] Your spe­cific at­ti­tudes to­wards AI safety

Esben Kran27 Mar 2022 22:33 UTC
8 points
22 comments1 min readLW link

[ASoT] Search­ing for con­se­quen­tial­ist structure

leogao27 Mar 2022 19:09 UTC
25 points
2 comments4 min readLW link

Vaniver’s ELK Submission

Vaniver28 Mar 2022 21:14 UTC
10 points
0 comments7 min readLW link

Towards a bet­ter cir­cuit prior: Im­prov­ing on ELK state-of-the-art

evhub29 Mar 2022 1:56 UTC
19 points
0 comments16 min readLW link

Strate­gies for differ­en­tial di­vul­ga­tion of key ideas in AI capability

azsantosk29 Mar 2022 3:22 UTC
8 points
0 comments6 min readLW link

[ASoT] Some thoughts about de­cep­tive mesaoptimization

leogao28 Mar 2022 21:14 UTC
24 points
5 comments7 min readLW link

[Question] What would make you con­fi­dent that AGI has been achieved?

Yitz29 Mar 2022 23:02 UTC
17 points
6 comments1 min readLW link

Progress Re­port 2

Nathan Helm-Burger30 Mar 2022 2:29 UTC
4 points
1 comment1 min readLW link

[ASoT] Some thoughts about LM monologue limi­ta­tions and ELK

leogao30 Mar 2022 14:26 UTC
10 points
0 comments2 min readLW link

Pro­ce­du­rally eval­u­at­ing fac­tual ac­cu­racy: a re­quest for research

Jacob_Hilton30 Mar 2022 16:37 UTC
24 points
2 comments6 min readLW link

No, EDT Did Not Get It Right All Along: Why the Coin Flip Creation Prob­lem Is Irrelevant

Heighn30 Mar 2022 18:41 UTC
6 points
6 comments3 min readLW link

ELK Com­pu­ta­tional Com­plex­ity: Three Levels of Difficulty

abramdemski30 Mar 2022 20:56 UTC
46 points
9 comments7 min readLW link

[Link] Train­ing Com­pute-Op­ti­mal Large Lan­guage Models

nostalgebraist31 Mar 2022 18:01 UTC
50 points
23 comments1 min readLW link
(arxiv.org)

New­comb’s prob­lem is just a stan­dard time con­sis­tency problem

basil.halperin31 Mar 2022 17:32 UTC
12 points
6 comments12 min readLW link

The Calcu­lus of New­comb’s Problem

Heighn1 Apr 2022 14:41 UTC
3 points
6 comments2 min readLW link

New Scal­ing Laws for Large Lan­guage Models

1a3orn1 Apr 2022 20:41 UTC
223 points
21 comments5 min readLW link

In­ter­act­ing with a Boxed AI

aphyer1 Apr 2022 22:42 UTC
11 points
19 comments4 min readLW link

Op­ti­mal­ity is the tiger, and agents are its teeth

Veedrac2 Apr 2022 0:46 UTC
197 points
31 comments16 min readLW link

[Question] How can a lay­man con­tribute to AI Align­ment efforts, given shorter timeline/​doomier sce­nar­ios?

AprilSR2 Apr 2022 4:34 UTC
13 points
5 comments1 min readLW link

AI Gover­nance across Slow/​Fast Take­off and Easy/​Hard Align­ment spectra

Davidmanheim3 Apr 2022 7:45 UTC
27 points
6 comments3 min readLW link

[Question] What are some ways in which we can die with more dig­nity?

Chris_Leong3 Apr 2022 5:32 UTC
14 points
19 comments1 min readLW link

[Question] Should we push for ban­ning mak­ing hiring de­ci­sions based on AI?

ChristianKl3 Apr 2022 19:46 UTC
10 points
6 comments1 min readLW link

Bayeswatch 9.5: Rest & Relaxation

lsusr4 Apr 2022 1:13 UTC
24 points
1 comment2 min readLW link

Bayeswatch 6.5: Therapy

lsusr4 Apr 2022 1:20 UTC
15 points
0 comments1 min readLW link

The­o­ries of Mo­du­lar­ity in the Biolog­i­cal Literature

4 Apr 2022 12:48 UTC
47 points
13 comments7 min readLW link

Google’s new 540 billion pa­ram­e­ter lan­guage model

Matthew Barnett4 Apr 2022 17:49 UTC
108 points
83 comments1 min readLW link
(storage.googleapis.com)

Call For Distillers

johnswentworth4 Apr 2022 18:25 UTC
192 points
42 comments3 min readLW link

Is the scal­ing race fi­nally on?

p.b.4 Apr 2022 19:53 UTC
24 points
0 comments2 min readLW link

Yud­kowsky Con­tra Chris­ti­ano on AI Take­off Speeds [Linkpost]

aogara5 Apr 2022 2:09 UTC
18 points
0 comments11 min readLW link

[Cross-post] Half baked ideas: defin­ing and mea­sur­ing Ar­tifi­cial In­tel­li­gence sys­tem effectiveness

David Johnston5 Apr 2022 0:29 UTC
2 points
0 comments7 min readLW link

[Question] Why is Toby Ord’s like­li­hood of hu­man ex­tinc­tion due to AI so low?

ChristianKl5 Apr 2022 12:16 UTC
8 points
9 comments1 min readLW link

Non-pro­gram­mers in­tro to AI for programmers

Dustin5 Apr 2022 18:12 UTC
6 points
0 comments2 min readLW link

What Would A Fight Between Hu­man­ity And AGI Look Like?

johnswentworth5 Apr 2022 20:03 UTC
79 points
22 comments3 min readLW link

Su­per­vise Pro­cess, not Outcomes

5 Apr 2022 22:18 UTC
119 points
8 comments10 min readLW link

AXRP Epi­sode 14 - In­fra-Bayesian Phys­i­cal­ism with Vanessa Kosoy

DanielFilan5 Apr 2022 23:10 UTC
23 points
9 comments52 min readLW link

[Question] What’s the prob­lem with hav­ing an AI al­ign it­self?

FinalFormal26 Apr 2022 0:59 UTC
0 points
3 comments1 min readLW link

What if we stopped mak­ing GPUs for a bit?

MrPointy5 Apr 2022 23:02 UTC
−3 points
2 comments1 min readLW link

Don’t die with dig­nity; in­stead play to your outs

Jeffrey Ladish6 Apr 2022 7:53 UTC
243 points
58 comments5 min readLW link

What I Was Think­ing About Be­fore Alignment

johnswentworth6 Apr 2022 16:08 UTC
77 points
8 comments5 min readLW link

[Link] A min­i­mal vi­able product for alignment

janleike6 Apr 2022 15:38 UTC
51 points
38 comments1 min readLW link

[Link] Why I’m ex­cited about AI-as­sisted hu­man feedback

janleike6 Apr 2022 15:37 UTC
29 points
0 comments1 min readLW link

Test­ing PaLM prompts on GPT3

Yitz6 Apr 2022 5:21 UTC
103 points
15 comments8 min readLW link

[ASoT] Some thoughts about im­perfect world modeling

leogao7 Apr 2022 15:42 UTC
7 points
0 comments4 min readLW link

Truth­ful­ness, stan­dards and credibility

Joe Collman7 Apr 2022 10:31 UTC
12 points
2 comments32 min readLW link

What if “friendly/​un­friendly” GAI isn’t a thing?

homunq7 Apr 2022 16:54 UTC
−1 points
4 comments2 min readLW link

Pro­duc­tive Mis­takes, Not Perfect Answers

adamShimi7 Apr 2022 16:41 UTC
95 points
11 comments6 min readLW link

Believ­able near-term AI disaster

Dagon7 Apr 2022 18:20 UTC
8 points
2 comments2 min readLW link

How BoMAI Might fail

Donald Hobson7 Apr 2022 15:32 UTC
11 points
3 comments2 min readLW link

Deep­Mind: The Pod­cast—Ex­cerpts on AGI

WilliamKiely7 Apr 2022 22:09 UTC
75 points
10 comments5 min readLW link

AI Align­ment and Recognition

Chris_Leong8 Apr 2022 5:39 UTC
7 points
2 comments1 min readLW link

Re­v­erse (in­tent) al­ign­ment may al­low for safer Oracles

azsantosk8 Apr 2022 2:48 UTC
4 points
0 comments4 min readLW link

AIs should learn hu­man prefer­ences, not biases

Stuart_Armstrong8 Apr 2022 13:45 UTC
10 points
1 comment1 min readLW link

[Question] Is there a pos­si­bil­ity that the up­com­ing scal­ing of data in lan­guage mod­els causes A.G.I.?

ArtMi8 Apr 2022 6:56 UTC
2 points
0 comments1 min readLW link

Differ­ent per­spec­tives on con­cept extrapolation

Stuart_Armstrong8 Apr 2022 10:42 UTC
42 points
7 comments5 min readLW link

[RETRACTED] It’s time for EA lead­er­ship to pull the short-timelines fire alarm.

Not Relevant8 Apr 2022 16:07 UTC
112 points
165 comments4 min readLW link

Con­vinc­ing All Ca­pa­bil­ity Researchers

Logan Riggs8 Apr 2022 17:40 UTC
120 points
70 comments3 min readLW link

Lan­guage Model Tools for Align­ment Research

Logan Riggs8 Apr 2022 17:32 UTC
27 points
0 comments2 min readLW link

[Question] What would the cre­ation of al­igned AGI look like for us?

Perhaps8 Apr 2022 18:05 UTC
3 points
4 comments1 min readLW link

Take­aways From 3 Years Work­ing In Ma­chine Learning

George3d68 Apr 2022 17:14 UTC
34 points
10 comments11 min readLW link
(www.epistem.ink)

[Question] Can AI sys­tems have ex­tremely im­pres­sive out­puts and also not need to be al­igned be­cause they aren’t gen­eral enough or some­thing?

WilliamKiely9 Apr 2022 6:03 UTC
6 points
3 comments1 min readLW link

Why In­stru­men­tal Goals are not a big AI Safety Problem

Jonathan Paulson9 Apr 2022 0:10 UTC
0 points
9 comments3 min readLW link

Emer­gent Ven­tures/​Sch­midt (new grantor for in­di­vi­d­ual re­searchers)

gwern9 Apr 2022 14:41 UTC
21 points
6 comments1 min readLW link
(marginalrevolution.com)

Strate­gies for keep­ing AIs nar­row in the short term

Rossin9 Apr 2022 16:42 UTC
9 points
3 comments3 min readLW link

A con­crete bet offer to those with short AI timelines

9 Apr 2022 21:41 UTC
195 points
104 comments4 min readLW link

Fi­nally En­ter­ing Alignment

Ulisse Mini10 Apr 2022 17:01 UTC
75 points
8 comments2 min readLW link

[Question] Does non-ac­cess to out­puts pre­vent re­cur­sive self-im­prove­ment?

Gunnar_Zarncke10 Apr 2022 18:37 UTC
14 points
0 comments1 min readLW link

[Question] Con­vince me that hu­man­ity is as doomed by AGI as Yud­kowsky et al., seems to believe

Yitz10 Apr 2022 21:02 UTC
91 points
142 comments2 min readLW link

[Question] Could we set a re­s­olu­tion/​stop­per for the up­per bound of the util­ity func­tion of an AI?

FinalFormal211 Apr 2022 3:10 UTC
−5 points
2 comments1 min readLW link

What can peo­ple not smart/​tech­ni­cal enough for AI re­search/​AI risk work do to re­duce AI-risk/​max­i­mize AI safety? (which is most peo­ple?)

Alex K. Chen (parrot)11 Apr 2022 14:05 UTC
7 points
3 comments3 min readLW link

We should stop be­ing so con­fi­dent that AI co­or­di­na­tion is unlikely

trevor11 Apr 2022 22:27 UTC
14 points
7 comments1 min readLW link

The Reg­u­la­tory Op­tion: A re­sponse to near 0% sur­vival odds

Matthew Lowenstein11 Apr 2022 22:00 UTC
45 points
21 comments6 min readLW link

[Question] How can I de­ter­mine that Elicit is not some weak AGI’s at­tempt at tak­ing over the world ?

Lucie Philippon12 Apr 2022 0:54 UTC
5 points
3 comments1 min readLW link

[Question] Three ques­tions about mesa-optimizers

Eric Neyman12 Apr 2022 2:58 UTC
23 points
5 comments3 min readLW link

A Small Nega­tive Re­sult on Debate

Sam Bowman12 Apr 2022 18:19 UTC
42 points
11 comments1 min readLW link

The Peerless

Tamsin Leake13 Apr 2022 1:07 UTC
18 points
2 comments1 min readLW link
(carado.moe)

Con­vinc­ing Peo­ple of Align­ment with Street Epistemology

Logan Riggs12 Apr 2022 23:43 UTC
54 points
4 comments3 min readLW link

[Question] “Frag­ility of Value” vs. LLMs

Not Relevant13 Apr 2022 2:02 UTC
32 points
32 comments1 min readLW link

How dath ilan co­or­di­nates around solv­ing alignment

Thomas Kwa13 Apr 2022 4:22 UTC
46 points
37 comments5 min readLW link

[Question] What’s a good prob­a­bil­ity dis­tri­bu­tion fam­ily (e.g. “log-nor­mal”) to use for AGI timelines?

David Scott Krueger (formerly: capybaralet)13 Apr 2022 4:45 UTC
9 points
12 comments1 min readLW link

Take­off speeds have a huge effect on what it means to work on AI x-risk

Buck13 Apr 2022 17:38 UTC
117 points
25 comments2 min readLW link

De­sign, Im­ple­ment and Verify

rwallace13 Apr 2022 18:14 UTC
32 points
13 comments4 min readLW link

[Question] What to in­clude in a guest lec­ture on ex­is­ten­tial risks from AI?

Aryeh Englander13 Apr 2022 17:03 UTC
20 points
9 comments1 min readLW link

A Quick Guide to Con­fronting Doom

Ruby13 Apr 2022 19:30 UTC
224 points
36 comments2 min readLW link

Ex­plor­ing toy neu­ral nets un­der node re­moval. Sec­tion 1.

Donald Hobson13 Apr 2022 23:30 UTC
12 points
7 comments8 min readLW link

[Question] Un­change­able Code pos­si­ble ?

AntonTimmer14 Apr 2022 11:17 UTC
7 points
9 comments1 min readLW link

How to be­come an AI safety researcher

peterbarnett15 Apr 2022 11:41 UTC
19 points
0 comments14 min readLW link

Early 2022 Paper Round-up

jsteinhardt14 Apr 2022 20:50 UTC
80 points
4 comments3 min readLW link
(bounded-regret.ghost.io)

[Question] Can some­one ex­plain to me why MIRI is so pes­simistic of our chances of sur­vival?

iamthouthouarti14 Apr 2022 20:28 UTC
10 points
7 comments1 min readLW link

Pivotal acts from Math AIs

azsantosk15 Apr 2022 0:25 UTC
10 points
4 comments5 min readLW link

Refine: An In­cu­ba­tor for Con­cep­tual Align­ment Re­search Bets

adamShimi15 Apr 2022 8:57 UTC
123 points
13 comments4 min readLW link

My least fa­vorite thing

sudo14 Apr 2022 22:33 UTC
41 points
30 comments3 min readLW link

[Question] Con­strain­ing nar­row AI in a cor­po­rate setting

MaximumLiberty15 Apr 2022 22:36 UTC
28 points
4 comments1 min readLW link

Pop Cul­ture Align­ment Re­search and Taxes

Jan16 Apr 2022 15:45 UTC
16 points
14 comments11 min readLW link
(universalprior.substack.com)

Org an­nounce­ment: [AC]RC

Vivek Hebbar17 Apr 2022 17:24 UTC
79 points
12 comments1 min readLW link

Code Gen­er­a­tion as an AI risk setting

Not Relevant17 Apr 2022 22:27 UTC
91 points
16 comments2 min readLW link

Men­tal Health and the Align­ment Prob­lem: A Com­pila­tion of Resources

Chris Scammell18 Apr 2022 18:36 UTC
139 points
7 comments17 min readLW link

Is “Con­trol” of a Su­per­in­tel­li­gence Pos­si­ble?

Mahdi Complex18 Apr 2022 16:03 UTC
9 points
14 comments1 min readLW link

[Closed] Hiring a math­e­mat­i­cian to work on the learn­ing-the­o­retic AI al­ign­ment agenda

Vanessa Kosoy19 Apr 2022 6:44 UTC
84 points
21 comments2 min readLW link

[Question] The two miss­ing core rea­sons why al­ign­ing at-least-par­tially su­per­hu­man AGI is hard

Joel Burget19 Apr 2022 17:15 UTC
7 points
2 comments1 min readLW link

[Question] How does the world look like 10 years af­ter we have de­ployed an al­igned AGI?

mukashi19 Apr 2022 11:34 UTC
4 points
3 comments1 min readLW link

[Question] Clar­ifi­ca­tion on Defi­ni­tion of AGI

stanislaw19 Apr 2022 12:41 UTC
0 points
1 comment1 min readLW link

[Question] What’s the Re­la­tion­ship Between “Hu­man Values” and the Brain’s Re­ward Sys­tem?

interstice19 Apr 2022 5:15 UTC
36 points
16 comments1 min readLW link

De­cep­tive Agents are a Good Way to Do Things

David Udell19 Apr 2022 18:04 UTC
15 points
0 comments1 min readLW link

The Scale Prob­lem in AI

tailcalled19 Apr 2022 17:46 UTC
22 points
17 comments3 min readLW link

Con­cept ex­trap­o­la­tion: key posts

Stuart_Armstrong19 Apr 2022 10:01 UTC
12 points
2 comments1 min readLW link

“Pivotal Act” In­ten­tions: Nega­tive Con­se­quences and Fal­la­cious Arguments

Andrew_Critch19 Apr 2022 20:25 UTC
96 points
56 comments7 min readLW link

GPT-3 and con­cept extrapolation

Stuart_Armstrong20 Apr 2022 10:39 UTC
19 points
28 comments1 min readLW link

[In­tro to brain-like-AGI safety] 12. Two paths for­ward: “Con­trol­led AGI” and “So­cial-in­stinct AGI”

Steven Byrnes20 Apr 2022 12:58 UTC
33 points
10 comments16 min readLW link

Pr­ereg­is­tra­tion: Air Con­di­tioner Test

johnswentworth21 Apr 2022 19:48 UTC
109 points
64 comments9 min readLW link

[Question] Choice := An­throp­ics un­cer­tainty? And po­ten­tial im­pli­ca­tions for agency

Antoine de Scorraille21 Apr 2022 16:38 UTC
5 points
1 comment1 min readLW link

Un­der­stand­ing the Merg­ing of Opinions with In­creas­ing In­for­ma­tion theorem

ViktoriaMalyasova21 Apr 2022 14:13 UTC
13 points
1 comment5 min readLW link

Early 2022 Paper Round-up (Part 2)

jsteinhardt21 Apr 2022 23:40 UTC
10 points
0 comments5 min readLW link
(bounded-regret.ghost.io)

[Question] What are the num­bers in mind for the su­per-short AGI timelines so many long-ter­mists are alarmed about?

Evan_Gaensbauer21 Apr 2022 23:32 UTC
22 points
14 comments1 min readLW link

AI Will Multiply

harsimony22 Apr 2022 4:33 UTC
13 points
4 comments1 min readLW link
(harsimony.wordpress.com)

Hu­man­ity as an en­tity: An al­ter­na­tive to Co­her­ent Ex­trap­o­lated Volition

Victor Novikov22 Apr 2022 12:48 UTC
0 points
2 comments4 min readLW link

[ASoT] Con­se­quen­tial­ist mod­els as a su­per­set of mesaoptimizers

leogao23 Apr 2022 17:57 UTC
36 points
2 comments4 min readLW link

Skil­ling-up in ML Eng­ineer­ing for Align­ment: re­quest for comments

23 Apr 2022 15:11 UTC
19 points
0 comments1 min readLW link

[Question] Want­ing to change what you want

Mithrandir23 Apr 2022 4:23 UTC
−1 points
1 comment1 min readLW link

Progress Re­port 5: ty­ing it together

Nathan Helm-Burger23 Apr 2022 21:07 UTC
10 points
0 comments2 min readLW link

Cal­ling for Stu­dent Sub­mis­sions: AI Safety Distil­la­tion Contest

Aris24 Apr 2022 1:53 UTC
48 points
15 comments4 min readLW link

Ex­am­in­ing Evolu­tion as an Up­per Bound for AGI Timelines

meanderingmoose24 Apr 2022 19:08 UTC
5 points
1 comment9 min readLW link

AI safety rais­ing aware­ness re­sources bleg

iivonen24 Apr 2022 17:13 UTC
6 points
1 comment1 min readLW link

In­tu­itions about solv­ing hard problems

Richard_Ngo25 Apr 2022 15:29 UTC
92 points
23 comments6 min readLW link

[Re­quest for Distil­la­tion] Co­her­ence of Distributed De­ci­sions With Differ­ent In­puts Im­plies Conditioning

johnswentworth25 Apr 2022 17:01 UTC
22 points
14 comments2 min readLW link

dalle2 comments

nostalgebraist26 Apr 2022 5:30 UTC
183 points
13 comments13 min readLW link
(nostalgebraist.tumblr.com)

Make a neu­ral net­work in ~10 minutes

Arjun Yadav26 Apr 2022 5:24 UTC
8 points
0 comments4 min readLW link
(arjunyadav.net)

Law-Fol­low­ing AI 1: Se­quence In­tro­duc­tion and Structure

Cullen27 Apr 2022 17:26 UTC
16 points
10 comments9 min readLW link

Law-Fol­low­ing AI 2: In­tent Align­ment + Su­per­in­tel­li­gence → Lawless AI (By De­fault)

Cullen27 Apr 2022 17:27 UTC
5 points
2 comments6 min readLW link

Law-Fol­low­ing AI 3: Lawless AI Agents Un­der­mine Sta­bi­liz­ing Agreements

Cullen27 Apr 2022 17:30 UTC
2 points
2 comments3 min readLW link

If you’re very op­ti­mistic about ELK then you should be op­ti­mistic about outer alignment

Sam Marks27 Apr 2022 19:30 UTC
17 points
8 comments3 min readLW link

AI Alter­na­tive Fu­tures: Sce­nario Map­ping Ar­tifi­cial In­tel­li­gence Risk—Re­quest for Par­ti­ci­pa­tion (*Closed*)

Kakili27 Apr 2022 22:07 UTC
10 points
2 comments8 min readLW link

The Speed + Sim­plic­ity Prior is prob­a­bly anti-deceptive

Yonadav Shavit27 Apr 2022 19:30 UTC
30 points
29 comments12 min readLW link

Slides: Po­ten­tial Risks From Ad­vanced AI

Aryeh Englander28 Apr 2022 2:15 UTC
7 points
0 comments1 min readLW link

How Might an Align­ment At­trac­tor Look like?

Shmi28 Apr 2022 6:46 UTC
47 points
15 comments2 min readLW link

Naive com­ments on AGIlignment

Ericf28 Apr 2022 1:08 UTC
2 points
4 comments1 min readLW link

[Question] Is al­ign­ment pos­si­ble?

Shay28 Apr 2022 21:18 UTC
0 points
5 comments1 min readLW link

Learn­ing the smooth prior

29 Apr 2022 21:10 UTC
31 points
0 comments12 min readLW link

[Linkpost] New multi-modal Deep­mind model fus­ing Chin­chilla with images and videos

p.b.30 Apr 2022 3:47 UTC
53 points
18 comments1 min readLW link

Note-Tak­ing with­out Hid­den Messages

Hoagy30 Apr 2022 11:15 UTC
7 points
1 comment4 min readLW link

[Question] Why hasn’t deep learn­ing gen­er­ated sig­nifi­cant eco­nomic value yet?

Alex_Altair30 Apr 2022 20:27 UTC
112 points
95 comments2 min readLW link

What is the solu­tion to the Align­ment prob­lem?

Algon30 Apr 2022 23:19 UTC
24 points
2 comments1 min readLW link

[Linkpost] Value ex­trac­tion via lan­guage model abduction

Paul Bricman1 May 2022 19:11 UTC
4 points
3 comments1 min readLW link
(paulbricman.com)

ELK shaving

Miss Aligned AI1 May 2022 21:05 UTC
6 points
1 comment1 min readLW link

So has AI con­quered Bridge ?

Ponder Stibbons2 May 2022 15:01 UTC
16 points
2 comments14 min readLW link

In­for­ma­tion se­cu­rity con­sid­er­a­tions for AI and the long term future

2 May 2022 20:54 UTC
74 points
6 comments10 min readLW link

Is evolu­tion­ary in­fluence the mesa ob­jec­tive that we’re in­ter­ested in?

David Johnston3 May 2022 1:18 UTC
3 points
2 comments5 min readLW link

Var­i­ous Align­ment Strate­gies (and how likely they are to work)

Logan Zoellner3 May 2022 16:54 UTC
73 points
34 comments11 min readLW link

In­tro­duc­ing the ML Safety Schol­ars Program

4 May 2022 16:01 UTC
73 points
2 comments3 min readLW link

Franken­stein: A Modern AGI

Sable5 May 2022 16:16 UTC
9 points
10 comments9 min readLW link

[Question] What is bias in al­ign­ment terms?

Jonas Kgomo4 May 2022 21:35 UTC
0 points
2 comments1 min readLW link

Ethan Ca­ballero on Pri­vate Scal­ing Progress

Michaël Trazzi5 May 2022 18:32 UTC
62 points
1 comment2 min readLW link
(theinsideview.github.io)

Ap­ply to the sec­ond iter­a­tion of the ML for Align­ment Boot­camp (MLAB 2) in Berkeley [Aug 15 - Fri Sept 2]

Buck6 May 2022 4:23 UTC
68 points
0 comments6 min readLW link

The case for be­com­ing a black-box in­ves­ti­ga­tor of lan­guage models

Buck6 May 2022 14:35 UTC
118 points
19 comments3 min readLW link

Get­ting GPT-3 to pre­dict Me­tac­u­lus questions

MathiasKB6 May 2022 6:01 UTC
68 points
8 comments2 min readLW link

But What’s Your *New Align­ment In­sight,* out of a Fu­ture-Text­book Para­graph?

David Udell7 May 2022 3:10 UTC
24 points
18 comments5 min readLW link

Video and Tran­script of Pre­sen­ta­tion on Ex­is­ten­tial Risk from Power-Seek­ing AI

Joe Carlsmith8 May 2022 3:50 UTC
20 points
1 comment29 min readLW link

A Bird’s Eye View of the ML Field [Prag­matic AI Safety #2]

9 May 2022 17:18 UTC
126 points
5 comments35 min readLW link

In­tro­duc­tion to Prag­matic AI Safety [Prag­matic AI Safety #1]

9 May 2022 17:06 UTC
70 points
1 comment6 min readLW link

Jobs: Help scale up LM al­ign­ment re­search at NYU

Sam Bowman9 May 2022 14:12 UTC
60 points
1 comment1 min readLW link

When is AI safety re­search harm­ful?

NathanBarnard9 May 2022 18:19 UTC
2 points
0 comments8 min readLW link

AI Align­ment YouTube Playlists

9 May 2022 21:33 UTC
29 points
4 comments1 min readLW link

Ex­am­in­ing Arm­strong’s cat­e­gory of gen­er­al­ized models

Morgan_Rogers10 May 2022 9:07 UTC
14 points
0 comments7 min readLW link

An In­side View of AI Alignment

Ansh Radhakrishnan11 May 2022 2:16 UTC
31 points
2 comments2 min readLW link

[Question] What are your recom­men­da­tions for tech­ni­cal AI al­ign­ment pod­casts?

Evan_Gaensbauer11 May 2022 21:52 UTC
5 points
4 comments1 min readLW link

Deep­mind’s Gato: Gen­er­al­ist Agent

Daniel Kokotajlo12 May 2022 16:01 UTC
164 points
61 comments1 min readLW link

“A Gen­er­al­ist Agent”: New Deep­Mind Publication

1a3orn12 May 2022 15:30 UTC
79 points
43 comments1 min readLW link

A ten­ta­tive di­alogue with a Friendly-boxed-su­per-AGI on brain uploads

Ramiro P.12 May 2022 19:40 UTC
1 point
12 comments4 min readLW link

Pos­i­tive out­comes un­der an un­al­igned AGI takeover

Yitz12 May 2022 7:45 UTC
19 points
12 comments3 min readLW link

The Last Paperclip

Logan Zoellner12 May 2022 19:25 UTC
57 points
15 comments17 min readLW link

RLHF

Ansh Radhakrishnan12 May 2022 21:18 UTC
16 points
5 comments5 min readLW link

[Question] What to do when start­ing a busi­ness in an im­mi­nent-AGI world?

ryan_b12 May 2022 21:07 UTC
25 points
7 comments1 min readLW link

Deep­Mind is hiring for the Scal­able Align­ment and Align­ment Teams

13 May 2022 12:17 UTC
145 points
35 comments9 min readLW link

“Tech com­pany sin­gu­lar­i­ties”, and steer­ing them to re­duce x-risk

Andrew_Critch13 May 2022 17:24 UTC
73 points
12 comments4 min readLW link

Against Time in Agent Models

johnswentworth13 May 2022 19:55 UTC
50 points
12 comments3 min readLW link

Frame for Take-Off Speeds to in­form com­pute gov­er­nance & scal­ing alignment

Logan Riggs13 May 2022 22:23 UTC
15 points
2 comments2 min readLW link

Align­ment as Constraints

Logan Riggs13 May 2022 22:07 UTC
10 points
0 comments2 min readLW link

Fermi es­ti­ma­tion of the im­pact you might have work­ing on AI safety

Fabien Roger13 May 2022 17:49 UTC
6 points
0 comments1 min readLW link

An ob­ser­va­tion about Hub­inger et al.’s frame­work for learned optimization

carboniferous_umbraculum 13 May 2022 16:20 UTC
33 points
9 comments8 min readLW link

Thoughts on AI Safety Camp

Charlie Steiner13 May 2022 7:16 UTC
24 points
7 comments7 min readLW link

Clar­ify­ing the con­fu­sion around in­ner alignment

Rauno Arike13 May 2022 23:05 UTC
27 points
0 comments11 min readLW link

[Link post] Promis­ing Paths to Align­ment—Con­nor Leahy | Talk

frances_lorenz14 May 2022 16:01 UTC
34 points
0 comments1 min readLW link

The AI Count­down Clock

River Lewis15 May 2022 18:37 UTC
40 points
27 comments2 min readLW link
(heytraveler.substack.com)

Sur­viv­ing Au­toma­tion In The 21st Cen­tury—Part 1

George3d615 May 2022 19:16 UTC
27 points
17 comments8 min readLW link
(www.epistem.ink)

Why I’m Op­ti­mistic About Near-Term AI Risk

harsimony15 May 2022 23:05 UTC
57 points
28 comments1 min readLW link

Op­ti­miza­tion at a Distance

johnswentworth16 May 2022 17:58 UTC
78 points
13 comments4 min readLW link

[Question] To what ex­tent is your AGI timeline bi­modal or oth­er­wise “bumpy”?

jchan16 May 2022 17:42 UTC
13 points
2 comments1 min readLW link

Proxy mis­speci­fi­ca­tion and the ca­pa­bil­ities vs. value learn­ing race

Sam Marks16 May 2022 18:58 UTC
19 points
1 comment4 min readLW link

How to in­vest in ex­pec­ta­tion of AGI?

Jakobovski17 May 2022 11:03 UTC
3 points
4 comments1 min readLW link

[In­tro to brain-like-AGI safety] 15. Con­clu­sion: Open prob­lems, how to help, AMA

Steven Byrnes17 May 2022 15:11 UTC
81 points
11 comments14 min readLW link

Ac­tion­able-guidance and roadmap recom­men­da­tions for the NIST AI Risk Man­age­ment Framework

17 May 2022 15:26 UTC
25 points
0 comments3 min readLW link

What are the pos­si­ble tra­jec­to­ries of an AGI/​ASI world?

Jakobovski17 May 2022 13:28 UTC
0 points
2 comments1 min readLW link

Max­ent and Ab­strac­tions: Cur­rent Best Arguments

johnswentworth18 May 2022 19:54 UTC
34 points
2 comments3 min readLW link

How to get into AI safety research

Stuart_Armstrong18 May 2022 18:05 UTC
44 points
7 comments1 min readLW link

A bridge to Dath Ilan? Im­proved gov­er­nance on the crit­i­cal path to AI al­ign­ment.

Jackson Wagner18 May 2022 15:51 UTC
23 points
0 comments11 min readLW link

We have achieved Noob Gains in AI

phdead18 May 2022 20:56 UTC
114 points
21 comments7 min readLW link

[Question] Why does gra­di­ent de­scent always work on neu­ral net­works?

MichaelDickens20 May 2022 21:13 UTC
15 points
11 comments1 min readLW link

How RL Agents Be­have When Their Ac­tions Are Mod­ified? [Distil­la­tion post]

PabloAMC20 May 2022 18:47 UTC
21 points
0 comments8 min readLW link

Over-digi­tal­iza­tion: A Pre­lude to Analo­gia (Chap­ter 6)

Justin Bullock20 May 2022 16:39 UTC
3 points
0 comments13 min readLW link

Clar­ify­ing what ELK is try­ing to achieve

Towards_Keeperhood21 May 2022 7:34 UTC
7 points
0 comments5 min readLW link

[Short ver­sion] In­for­ma­tion Loss --> Basin flatness

Vivek Hebbar21 May 2022 12:59 UTC
11 points
0 comments1 min readLW link

In­for­ma­tion Loss --> Basin flatness

Vivek Hebbar21 May 2022 12:58 UTC
47 points
31 comments7 min readLW link

What kinds of al­gorithms do multi-hu­man imi­ta­tors learn?

22 May 2022 14:27 UTC
20 points
0 comments3 min readLW link

Are hu­man imi­ta­tors su­per­hu­man mod­els with ex­plicit con­straints on ca­pa­bil­ities?

Chris van Merwijk22 May 2022 12:46 UTC
41 points
3 comments1 min readLW link

Ad­ver­sar­ial at­tacks and op­ti­mal control

Jan22 May 2022 18:22 UTC
16 points
7 comments8 min readLW link
(universalprior.substack.com)

CNN fea­ture vi­su­al­iza­tion in 50 lines of code

StefanHex26 May 2022 11:02 UTC
17 points
4 comments5 min readLW link

[Question] [Align­ment] Is there a cen­sus on who’s work­ing on what?

Cedar23 May 2022 15:33 UTC
23 points
6 comments1 min readLW link

AXRP Epi­sode 15 - Nat­u­ral Ab­strac­tions with John Wentworth

DanielFilan23 May 2022 5:40 UTC
32 points
1 comment57 min readLW link

Why I’m Wor­ried About AI

peterbarnett23 May 2022 21:13 UTC
21 points
2 comments12 min readLW link

Com­plex Sys­tems for AI Safety [Prag­matic AI Safety #3]

24 May 2022 0:00 UTC
49 points
2 comments21 min readLW link

The No Free Lunch the­o­rems and their Razor

Adrià Garriga-alonso24 May 2022 6:40 UTC
47 points
3 comments9 min readLW link

Google’s Ima­gen uses larger text encoder

Ben Livengood24 May 2022 21:55 UTC
27 points
2 comments1 min readLW link

au­ton­omy: the miss­ing AGI in­gre­di­ent?

nostalgebraist25 May 2022 0:33 UTC
61 points
13 comments6 min readLW link

Paper: Teach­ing GPT3 to ex­press un­cer­tainty in words

Owain_Evans31 May 2022 13:27 UTC
96 points
7 comments4 min readLW link

Croe­sus, Cer­berus, and the mag­pies: a gen­tle in­tro­duc­tion to Elic­it­ing La­tent Knowledge

Alexandre Variengien27 May 2022 17:58 UTC
14 points
0 comments16 min readLW link

[Question] How much white col­lar work could be au­to­mated us­ing ex­ist­ing ML mod­els?

AM26 May 2022 8:09 UTC
25 points
4 comments1 min readLW link

The Poin­t­ers Prob­lem—Distilled

Nina Panickssery26 May 2022 22:44 UTC
9 points
0 comments2 min readLW link

Iter­ated Distil­la­tion-Am­plifi­ca­tion, Gato, and Proto-AGI [Re-Ex­plained]

Gabe M27 May 2022 5:42 UTC
21 points
4 comments6 min readLW link

Boot­strap­ping Lan­guage Models

harsimony27 May 2022 19:43 UTC
7 points
5 comments2 min readLW link

Un­der­stand­ing Selec­tion Theorems

adamk28 May 2022 1:49 UTC
35 points
3 comments7 min readLW link

[Question] What have been the ma­jor “triumphs” in the field of AI over the last ten years?

lc28 May 2022 19:49 UTC
35 points
10 comments1 min readLW link

[Question] Bayesian Per­sua­sion?

Karthik Tadepalli28 May 2022 17:52 UTC
8 points
2 comments1 min readLW link

Distributed Decisions

johnswentworth29 May 2022 2:43 UTC
65 points
4 comments6 min readLW link

The Prob­lem With The Cur­rent State of AGI Definitions

Yitz29 May 2022 13:58 UTC
40 points
22 comments8 min readLW link

Func­tional Anal­y­sis Read­ing Group

Ulisse Mini28 May 2022 2:40 UTC
4 points
0 comments1 min readLW link

[Question] Im­pact of ” ‘Let’s think step by step’ is all you need”?

yrimon24 Jul 2022 20:59 UTC
20 points
2 comments1 min readLW link

Perform Tractable Re­search While Avoid­ing Ca­pa­bil­ities Ex­ter­nal­ities [Prag­matic AI Safety #4]

30 May 2022 20:25 UTC
43 points
3 comments25 min readLW link

[Question] What is the state of Chi­nese AI re­search?

Ratios31 May 2022 10:05 UTC
34 points
17 comments1 min readLW link

The Brain That Builds Itself

Jan31 May 2022 9:42 UTC
55 points
6 comments8 min readLW link
(universalprior.substack.com)

Machines vs. Memes 2: Memet­i­cally-Mo­ti­vated Model Extensions

naterush31 May 2022 22:03 UTC
4 points
0 comments4 min readLW link

Machines vs Memes Part 3: Imi­ta­tion and Memes

ceru231 Jun 2022 13:36 UTC
5 points
0 comments7 min readLW link

Paradigms of AI al­ign­ment: com­po­nents and enablers

Vika2 Jun 2022 6:19 UTC
48 points
4 comments8 min readLW link

The Bio An­chors Forecast

Ansh Radhakrishnan2 Jun 2022 1:32 UTC
12 points
0 comments3 min readLW link

[MLSN #4]: Many New In­ter­pretabil­ity Papers, Vir­tual Logit Match­ing, Ra­tion­al­iza­tion Helps Robustness

Dan H3 Jun 2022 1:20 UTC
18 points
0 comments4 min readLW link

The pro­to­typ­i­cal catas­trophic AI ac­tion is get­ting root ac­cess to its datacenter

Buck2 Jun 2022 23:46 UTC
142 points
10 comments2 min readLW link

Ad­ver­sar­ial train­ing, im­por­tance sam­pling, and anti-ad­ver­sar­ial train­ing for AI whistleblowing

Buck2 Jun 2022 23:48 UTC
33 points
0 comments3 min readLW link

Deep Learn­ing Sys­tems Are Not Less In­ter­pretable Than Logic/​Prob­a­bil­ity/​Etc

johnswentworth4 Jun 2022 5:41 UTC
118 points
52 comments2 min readLW link

How to pur­sue a ca­reer in tech­ni­cal AI alignment

Charlie Rogers-Smith4 Jun 2022 21:11 UTC
63 points
0 comments39 min readLW link

Noisy en­vi­ron­ment reg­u­late util­ity maximizers

Niclas Kupper5 Jun 2022 18:48 UTC
4 points
0 comments7 min readLW link

Why agents are powerful

Daniel Kokotajlo6 Jun 2022 1:37 UTC
35 points
7 comments7 min readLW link

Why do some peo­ple try to make AGI?

TekhneMakre6 Jun 2022 9:14 UTC
14 points
7 comments3 min readLW link

Some ideas for fol­low-up pro­jects to Red­wood Re­search’s re­cent paper

JanB6 Jun 2022 13:29 UTC
10 points
0 comments7 min readLW link

Read­ing the ethi­cists 2: Hunt­ing for AI al­ign­ment papers

Charlie Steiner6 Jun 2022 15:49 UTC
21 points
1 comment7 min readLW link

DALL-E 2 - Unoffi­cial Nat­u­ral Lan­guage Image Edit­ing, Art Cri­tique Survey

bakztfuture6 Jun 2022 18:27 UTC
0 points
0 comments1 min readLW link
(bakztfuture.substack.com)

Think­ing about Broad Classes of Utility-like Functions

J Bostock7 Jun 2022 14:05 UTC
7 points
0 comments4 min readLW link

Thoughts on For­mal­iz­ing Composition

Tom Lieberum7 Jun 2022 7:51 UTC
13 points
0 comments7 min readLW link

“Pivotal Acts” means some­thing specific

Raemon7 Jun 2022 21:56 UTC
114 points
23 comments2 min readLW link

Why I don’t be­lieve in doom

mukashi7 Jun 2022 23:49 UTC
6 points
30 comments4 min readLW link

[Question] Has any­one ac­tu­ally tried to con­vince Terry Tao or other top math­e­mat­i­ci­ans to work on al­ign­ment?

P.8 Jun 2022 22:26 UTC
52 points
49 comments4 min readLW link

To­day in AI Risk His­tory: The Ter­mi­na­tor (1984 film) was re­leased.

Impassionata9 Jun 2022 1:32 UTC
−3 points
6 comments1 min readLW link

There’s prob­a­bly a trade­off be­tween AI ca­pa­bil­ity and safety, and we should act like it

David Johnston9 Jun 2022 0:17 UTC
3 points
3 comments1 min readLW link

AI Could Defeat All Of Us Combined

HoldenKarnofsky9 Jun 2022 15:50 UTC
168 points
29 comments17 min readLW link
(www.cold-takes.com)

[Question] If there was a mil­len­nium equiv­a­lent prize for AI al­ign­ment, what would the prob­lems be?

Yair Halberstadt9 Jun 2022 16:56 UTC
17 points
4 comments1 min readLW link

[Linkpost & Dis­cus­sion] AI Trained on 4Chan Be­comes ‘Hate Speech Ma­chine’ [and out­performs GPT-3 on Truth­fulQA Bench­mark?!]

Yitz9 Jun 2022 10:59 UTC
16 points
5 comments2 min readLW link
(www.vice.com)

If no near-term al­ign­ment strat­egy, re­search should aim for the long-term

harsimony9 Jun 2022 19:10 UTC
7 points
1 comment1 min readLW link

How Do Selec­tion The­o­rems Re­late To In­ter­pretabil­ity?

johnswentworth9 Jun 2022 19:39 UTC
57 points
14 comments3 min readLW link

Bureau­cracy of AIs

Logan Zoellner9 Jun 2022 23:03 UTC
11 points
6 comments14 min readLW link

Tao, Kont­se­vich & oth­ers on HLAI in Math

interstice10 Jun 2022 2:25 UTC
41 points
5 comments2 min readLW link
(www.youtube.com)

Open Prob­lems in AI X-Risk [PAIS #5]

10 Jun 2022 2:08 UTC
50 points
3 comments36 min readLW link

[Question] why as­sume AGIs will op­ti­mize for fixed goals?

nostalgebraist10 Jun 2022 1:28 UTC
119 points
52 comments4 min readLW link

Progress Re­port 6: get the tool working

Nathan Helm-Burger10 Jun 2022 11:18 UTC
4 points
0 comments2 min readLW link

Another plau­si­ble sce­nario of AI risk: AI builds mil­i­tary in­fras­truc­ture while col­lab­o­rat­ing with hu­mans, defects later.

avturchin10 Jun 2022 17:24 UTC
10 points
2 comments1 min readLW link

[Question] Is AI Align­ment Im­pos­si­ble?

Heighn10 Jun 2022 10:08 UTC
3 points
3 comments1 min readLW link

How dan­ger­ous is hu­man-level AI?

Alex_Altair10 Jun 2022 17:38 UTC
21 points
4 comments8 min readLW link

[linkpost] The fi­nal AI bench­mark: BIG-bench

RomanS10 Jun 2022 8:53 UTC
30 points
19 comments1 min readLW link

[Question] Could Pa­tent-Trol­ling de­lay AI timelines?

Pablo Repetto10 Jun 2022 2:53 UTC
1 point
3 comments1 min readLW link

How fast can we perform a for­ward pass?

jsteinhardt10 Jun 2022 23:30 UTC
53 points
9 comments15 min readLW link
(bounded-regret.ghost.io)

Steganog­ra­phy and the Cy­cleGAN—al­ign­ment failure case study

Jan Czechowski11 Jun 2022 9:41 UTC
28 points
0 comments4 min readLW link

AGI Safety Com­mu­ni­ca­tions Initiative

ines11 Jun 2022 17:34 UTC
7 points
0 comments1 min readLW link

[Question] How much stupi­der than hu­mans can AI be and still kill us all through sheer num­bers and re­source ac­cess?

Shmi12 Jun 2022 1:01 UTC
11 points
12 comments1 min readLW link

A claim that Google’s LaMDA is sentient

Ben Livengood12 Jun 2022 4:18 UTC
31 points
134 comments1 min readLW link

Let’s not name spe­cific AI labs in an ad­ver­sar­ial context

acylhalide12 Jun 2022 17:38 UTC
8 points
17 comments1 min readLW link

[Question] How much does cy­ber­se­cu­rity re­duce AI risk?

Darmani12 Jun 2022 22:13 UTC
34 points
23 comments1 min readLW link

[Question] How are com­pute as­sets dis­tributed in the world?

Chris van Merwijk12 Jun 2022 22:13 UTC
29 points
7 comments1 min readLW link

The beau­tiful mag­i­cal en­chanted golden Dall-e Mini is underrated

p.b.13 Jun 2022 7:58 UTC
14 points
0 comments1 min readLW link

Why so lit­tle AI risk on ra­tio­nal­ist-ad­ja­cent blogs?

Grant Demaree13 Jun 2022 6:31 UTC
46 points
23 comments8 min readLW link

[Question] What’s the “This AI is of moral con­cern.” fire alarm?

Quintin Pope13 Jun 2022 8:05 UTC
37 points
56 comments2 min readLW link

On A List of Lethalities

Zvi13 Jun 2022 12:30 UTC
154 points
48 comments54 min readLW link
(thezvi.wordpress.com)

[Question] Can you MRI a deep learn­ing model?

Yair Halberstadt13 Jun 2022 13:43 UTC
3 points
3 comments1 min readLW link

What are some smaller-but-con­crete challenges re­lated to AI safety that are im­pact­ing peo­ple to­day?

nonzerosum13 Jun 2022 17:36 UTC
3 points
2 comments1 min readLW link

Con­ti­nu­ity Assumptions

Jan_Kulveit13 Jun 2022 21:31 UTC
26 points
13 comments4 min readLW link

Crypto-fed Computation

aaguirre13 Jun 2022 21:20 UTC
22 points
7 comments7 min readLW link

Blake Richards on Why he is Skep­ti­cal of Ex­is­ten­tial Risk from AI

Michaël Trazzi14 Jun 2022 19:09 UTC
41 points
12 comments4 min readLW link
(theinsideview.ai)

I ap­plied for a MIRI job in 2020. Here’s what hap­pened next.

ViktoriaMalyasova15 Jun 2022 19:37 UTC
78 points
17 comments7 min readLW link

[Question] What are all the AI Align­ment and AI Safety Com­mu­ni­ca­tion Hubs?

Gunnar_Zarncke15 Jun 2022 16:16 UTC
25 points
5 comments1 min readLW link

[Question] Has there been any work on at­tempt­ing to use Pas­cal’s Mug­ging to make an AGI be­have?

Chris_Leong15 Jun 2022 8:33 UTC
7 points
17 comments1 min readLW link

Will vague “AI sen­tience” con­cerns do more for AI safety than any­thing else we might do?

Aryeh Englander14 Jun 2022 23:53 UTC
12 points
1 comment1 min readLW link

“Brain en­thu­si­asts” in AI Safety

18 Jun 2022 9:59 UTC
57 points
5 comments10 min readLW link
(universalprior.substack.com)

FYI: I’m work­ing on a book about the threat of AGI/​ASI for a gen­eral au­di­ence. I hope it will be of value to the cause and the community

Darren McKee15 Jun 2022 18:08 UTC
40 points
17 comments2 min readLW link

A cen­tral AI al­ign­ment prob­lem: ca­pa­bil­ities gen­er­al­iza­tion, and the sharp left turn

So8res15 Jun 2022 13:10 UTC
253 points
48 comments10 min readLW link

AI Risk, as Seen on Snapchat

dkirmani16 Jun 2022 19:31 UTC
23 points
8 comments1 min readLW link

Hu­mans are very re­li­able agents

alyssavance16 Jun 2022 22:02 UTC
248 points
35 comments3 min readLW link

A pos­si­ble AI-in­oc­u­la­tion due to early “robot up­ris­ing”

Shmi16 Jun 2022 21:21 UTC
16 points
2 comments1 min readLW link

A trans­parency and in­ter­pretabil­ity tech tree

evhub16 Jun 2022 23:44 UTC
136 points
10 comments19 min readLW link

Value ex­trap­o­la­tion vs Wireheading

Stuart_Armstrong17 Jun 2022 15:02 UTC
16 points
1 comment1 min readLW link

#SAT with Ten­sor Networks

Adam Jermyn17 Jun 2022 13:20 UTC
4 points
0 comments2 min readLW link

wrap­per-minds are the enemy

nostalgebraist17 Jun 2022 1:58 UTC
92 points
36 comments8 min readLW link

[Question] Is there an unified way to make sense of ai failure modes?

walking_mushroom17 Jun 2022 18:00 UTC
3 points
1 comment1 min readLW link

Quan­tify­ing Gen­eral Intelligence

JasonBrown17 Jun 2022 21:57 UTC
9 points
6 comments13 min readLW link

Pivotal out­comes and pivotal processes

Andrew_Critch17 Jun 2022 23:43 UTC
79 points
32 comments4 min readLW link

Scott Aaron­son is join­ing OpenAI to work on AI safety

peterbarnett18 Jun 2022 4:06 UTC
117 points
31 comments1 min readLW link
(scottaaronson.blog)

Can DALL-E un­der­stand sim­ple ge­om­e­try?

Isaac King18 Jun 2022 4:37 UTC
25 points
2 comments1 min readLW link

Spe­cific prob­lems with spe­cific an­i­mal com­par­i­sons for AI policy

trevor19 Jun 2022 1:27 UTC
3 points
1 comment2 min readLW link

Agent level parallelism

Johannes C. Mayer18 Jun 2022 20:56 UTC
6 points
5 comments1 min readLW link

[Link-post] On Defer­ence and Yud­kowsky’s AI Risk Estimates

bmg19 Jun 2022 17:25 UTC
27 points
7 comments1 min readLW link

Where I agree and dis­agree with Eliezer

paulfchristiano19 Jun 2022 19:15 UTC
777 points
205 comments20 min readLW link

Let’s See You Write That Cor­rigi­bil­ity Tag

Eliezer Yudkowsky19 Jun 2022 21:11 UTC
109 points
67 comments1 min readLW link

Are we there yet?

theflowerpot20 Jun 2022 11:19 UTC
2 points
2 comments1 min readLW link

On cor­rigi­bil­ity and its basin

Donald Hobson20 Jun 2022 16:33 UTC
16 points
3 comments2 min readLW link

Parable: The Bomb that doesn’t Explode

Lone Pine20 Jun 2022 16:41 UTC
14 points
5 comments2 min readLW link

Key Papers in Lan­guage Model Safety

aogara20 Jun 2022 15:00 UTC
37 points
1 comment22 min readLW link

Sur­vey re AIS/​LTism office in NYC

RyanCarey20 Jun 2022 19:21 UTC
7 points
0 comments1 min readLW link

An AI defense-offense sym­me­try thesis

Chris van Merwijk20 Jun 2022 10:01 UTC
10 points
9 comments3 min readLW link

[Question] How easy/​fast is it for a AGI to hack com­put­ers/​a hu­man brain?

Noosphere8921 Jun 2022 0:34 UTC
0 points
1 comment1 min readLW link

A Toy Model of Gra­di­ent Hacking

Oam Patel20 Jun 2022 22:01 UTC
25 points
7 comments4 min readLW link

De­bat­ing Whether AI is Con­scious Is A Dis­trac­tion from Real Problems

sidhe_they21 Jun 2022 16:56 UTC
4 points
10 comments1 min readLW link
(techpolicy.press)

The in­or­di­nately slow spread of good AGI con­ver­sa­tions in ML

Rob Bensinger21 Jun 2022 16:09 UTC
160 points
66 comments8 min readLW link

[Question] What is the differ­ence be­tween AI mis­al­ign­ment and bad pro­gram­ming?

puzzleGuzzle21 Jun 2022 21:52 UTC
6 points
2 comments1 min readLW link

Se­cu­rity Mind­set: Les­sons from 20+ years of Soft­ware Se­cu­rity Failures Rele­vant to AGI Alignment

elspood21 Jun 2022 23:55 UTC
331 points
40 comments7 min readLW link

A Quick List of Some Prob­lems in AI Align­ment As A Field

Nicholas / Heather Kross21 Jun 2022 23:23 UTC
74 points
12 comments6 min readLW link
(www.thinkingmuchbetter.com)

Con­fu­sion about neu­ro­science/​cog­ni­tive sci­ence as a dan­ger for AI Alignment

Samuel Nellessen22 Jun 2022 17:59 UTC
2 points
1 comment3 min readLW link
(snellessen.com)

Air Con­di­tioner Test Re­sults & Discussion

johnswentworth22 Jun 2022 22:26 UTC
80 points
38 comments6 min readLW link

Loose thoughts on AGI risk

Yitz23 Jun 2022 1:02 UTC
7 points
3 comments1 min readLW link

[Question] What’s the con­tin­gency plan if we get AGI to­mor­row?

Yitz23 Jun 2022 3:10 UTC
61 points
24 comments1 min readLW link

[Question] What are the best “policy” ap­proaches in wor­lds where al­ign­ment is difficult?

LHA23 Jun 2022 1:53 UTC
1 point
0 comments1 min readLW link

[Question] Is CIRL a promis­ing agenda?

Chris_Leong23 Jun 2022 17:12 UTC
25 points
12 comments1 min readLW link

Half-baked AI Safety ideas thread

Aryeh Englander23 Jun 2022 16:11 UTC
58 points
60 comments1 min readLW link

20 Cri­tiques of AI Safety That I Found on Twitter

dkirmani23 Jun 2022 19:23 UTC
21 points
16 comments1 min readLW link

Linkpost: Robin Han­son—Why Not Wait On AI Risk?

Yair Halberstadt24 Jun 2022 14:23 UTC
41 points
14 comments1 min readLW link
(www.overcomingbias.com)

Raphaël Millière on Gen­er­al­iza­tion and Scal­ing Maximalism

Michaël Trazzi24 Jun 2022 18:18 UTC
21 points
2 comments4 min readLW link
(theinsideview.ai)

[Question] Do al­ign­ment con­cerns ex­tend to pow­er­ful non-AI agents?

Ozyrus24 Jun 2022 18:26 UTC
21 points
13 comments1 min readLW link

Depen­den­cies for AGI pessimism

Yitz24 Jun 2022 22:25 UTC
6 points
4 comments1 min readLW link

What if the best path for a per­son who wants to work on AGI al­ign­ment is to join Face­book or Google?

dbasch24 Jun 2022 21:23 UTC
2 points
3 comments1 min readLW link

[Link] Ad­ver­sar­i­ally trained neu­ral rep­re­sen­ta­tions may already be as ro­bust as cor­re­spond­ing biolog­i­cal neu­ral representations

Gunnar_Zarncke24 Jun 2022 20:51 UTC
35 points
9 comments1 min readLW link

AI-Writ­ten Cri­tiques Help Hu­mans No­tice Flaws

paulfchristiano25 Jun 2022 17:22 UTC
133 points
5 comments3 min readLW link
(openai.com)

[LQ] Some Thoughts on Mes­sag­ing Around AI Risk

DragonGod25 Jun 2022 13:53 UTC
5 points
3 comments6 min readLW link

[Question] Should any hu­man en­slave an AGI sys­tem?

AlignmentMirror25 Jun 2022 19:35 UTC
−13 points
44 comments1 min readLW link

The Ba­sics of AGI Policy (Flowchart)

trevor26 Jun 2022 2:01 UTC
18 points
8 comments2 min readLW link

Slow mo­tion videos as AI risk in­tu­ition pumps

Andrew_Critch14 Jun 2022 19:31 UTC
209 points
36 comments2 min readLW link

Robin Han­son asks “Why Not Wait On AI Risk?”

Gunnar_Zarncke26 Jun 2022 23:32 UTC
22 points
4 comments1 min readLW link
(www.overcomingbias.com)

Epistemic mod­esty and how I think about AI risk

Aryeh Englander27 Jun 2022 18:47 UTC
22 points
4 comments4 min readLW link

An­nounc­ing the In­verse Scal­ing Prize ($250k Prize Pool)

27 Jun 2022 15:58 UTC
166 points
14 comments7 min readLW link

Scott Aaron­son and Steven Pinker De­bate AI Scaling

Liron28 Jun 2022 16:04 UTC
37 points
10 comments1 min readLW link
(scottaaronson.blog)

Four rea­sons I find AI safety emo­tion­ally compelling

28 Jun 2022 14:10 UTC
38 points
3 comments4 min readLW link

Some al­ter­na­tive AI safety re­search projects

Michele Campolo28 Jun 2022 14:09 UTC
9 points
0 comments3 min readLW link

Assess­ing AlephAlphas Mul­ti­modal Model

p.b.28 Jun 2022 9:28 UTC
30 points
5 comments3 min readLW link

Kurzge­sagt – The Last Hu­man (Youtube)

habryka29 Jun 2022 3:28 UTC
54 points
7 comments1 min readLW link
(www.youtube.com)

Can We Align AI by Hav­ing It Learn Hu­man Prefer­ences? I’m Scared (sum­mary of last third of Hu­man Com­pat­i­ble)

apollonianblues29 Jun 2022 4:09 UTC
19 points
3 comments6 min readLW link

Look­ing back on my al­ign­ment PhD

TurnTrout1 Jul 2022 3:19 UTC
287 points
60 comments11 min readLW link

Will Ca­pa­bil­ities Gen­er­al­ise More?

Ramana Kumar29 Jun 2022 17:12 UTC
109 points
38 comments4 min readLW link

Gra­di­ent hack­ing: defi­ni­tions and examples

Richard_Ngo29 Jun 2022 21:35 UTC
24 points
1 comment5 min readLW link

[Question] Cor­rect­ing hu­man er­ror vs do­ing ex­actly what you’re told—is there liter­a­ture on this in con­text of gen­eral sys­tem de­sign?

Jan Czechowski29 Jun 2022 21:30 UTC
6 points
0 comments1 min readLW link

Most Func­tions Have Un­de­sir­able Global Extrema

En Kepeig30 Jun 2022 17:10 UTC
8 points
5 comments3 min readLW link

$500 bounty for al­ign­ment con­test ideas

Akash30 Jun 2022 1:56 UTC
29 points
5 comments2 min readLW link

Quick sur­vey on AI al­ign­ment resources

frances_lorenz30 Jun 2022 19:09 UTC
14 points
0 comments1 min readLW link

[Linkpost] Solv­ing Quan­ti­ta­tive Rea­son­ing Prob­lems with Lan­guage Models

Yitz30 Jun 2022 18:58 UTC
76 points
15 comments2 min readLW link
(storage.googleapis.com)

GPT-3 Catch­ing Fish in Morse Code

Megan Kinniment30 Jun 2022 21:22 UTC
110 points
27 comments8 min readLW link

Selec­tion pro­cesses for subagents

Ryan Kidd30 Jun 2022 23:57 UTC
33 points
2 comments9 min readLW link

AI safety uni­ver­sity groups: a promis­ing op­por­tu­nity to re­duce ex­is­ten­tial risk

mic1 Jul 2022 3:59 UTC
13 points
0 comments11 min readLW link

Safetywashing

Adam Scholl1 Jul 2022 11:56 UTC
212 points
17 comments1 min readLW link

[Question] AGI al­ign­ment with what?

AlignmentMirror1 Jul 2022 10:22 UTC
6 points
10 comments1 min readLW link

What Is The True Name of Mo­du­lar­ity?

1 Jul 2022 14:55 UTC
21 points
10 comments12 min readLW link

AXRP Epi­sode 16 - Prepar­ing for De­bate AI with Ge­offrey Irving

DanielFilan1 Jul 2022 22:20 UTC
14 points
0 comments37 min readLW link

Agenty AGI – How Tempt­ing?

PeterMcCluskey1 Jul 2022 23:40 UTC
21 points
3 comments5 min readLW link
(www.bayesianinvestor.com)

[Linkpost] Ex­is­ten­tial Risk Anal­y­sis in Em­piri­cal Re­search Papers

Dan H2 Jul 2022 0:09 UTC
40 points
0 comments1 min readLW link
(arxiv.org)

Minerva

Algon1 Jul 2022 20:06 UTC
35 points
6 comments2 min readLW link
(ai.googleblog.com)

Could an AI Align­ment Sand­box be use­ful?

Michael Soareverix2 Jul 2022 5:06 UTC
2 points
1 comment1 min readLW link

Goal-di­rect­ed­ness: tack­ling complexity

Morgan_Rogers2 Jul 2022 13:51 UTC
8 points
0 comments38 min readLW link

[Question] Which one of these two aca­demic routes should I take to end up in AI Safety?

Martín Soto3 Jul 2022 1:05 UTC
5 points
2 comments1 min readLW link

Won­der and The Golden AI Rule

JeffreyK3 Jul 2022 18:21 UTC
0 points
4 comments6 min readLW link

De­ci­sion the­ory and dy­namic inconsistency

paulfchristiano3 Jul 2022 22:20 UTC
66 points
33 comments10 min readLW link
(sideways-view.com)

AI Fore­cast­ing: One Year In

jsteinhardt4 Jul 2022 5:10 UTC
131 points
12 comments6 min readLW link
(bounded-regret.ghost.io)

Re­mak­ing Effi­cien­tZero (as best I can)

Hoagy4 Jul 2022 11:03 UTC
34 points
9 comments22 min readLW link

Please help us com­mu­ni­cate AI xrisk. It could save the world.

otto.barten4 Jul 2022 21:47 UTC
4 points
7 comments2 min readLW link

Bench­mark for suc­cess­ful con­cept ex­trap­o­la­tion/​avoid­ing goal misgeneralization

Stuart_Armstrong4 Jul 2022 20:48 UTC
80 points
12 comments4 min readLW link

An­thropic’s SoLU (Soft­max Lin­ear Unit)

Joel Burget4 Jul 2022 18:38 UTC
15 points
1 comment4 min readLW link
(transformer-circuits.pub)

[AN #172] Sorry for the long hi­a­tus!

Rohin Shah5 Jul 2022 6:20 UTC
54 points
0 comments3 min readLW link
(mailchi.mp)

Prin­ci­ples for Align­ment/​Agency Projects

johnswentworth7 Jul 2022 2:07 UTC
115 points
20 comments4 min readLW link

Race Along Rashomon Ridge

7 Jul 2022 3:20 UTC
49 points
15 comments8 min readLW link

Con­fu­sions in My Model of AI Risk

peterbarnett7 Jul 2022 1:05 UTC
21 points
9 comments5 min readLW link

Safety con­sid­er­a­tions for on­line gen­er­a­tive modeling

Sam Marks7 Jul 2022 18:31 UTC
41 points
9 comments14 min readLW link

Re­in­force­ment Learner Wireheading

Nate Showell8 Jul 2022 5:32 UTC
8 points
2 comments4 min readLW link

MATS Models

johnswentworth9 Jul 2022 0:14 UTC
84 points
5 comments16 min readLW link

Train first VS prune first in neu­ral net­works.

Donald Hobson9 Jul 2022 15:53 UTC
20 points
5 comments2 min readLW link

Re­search Notes: What are we al­ign­ing for?

Shoshannah Tekofsky8 Jul 2022 22:13 UTC
19 points
8 comments2 min readLW link

Re­port from a civ­i­liza­tional ob­server on Earth

owencb9 Jul 2022 17:26 UTC
49 points
12 comments6 min readLW link

Vi­su­al­iz­ing Neu­ral net­works, how to blame the bias

Donald Hobson9 Jul 2022 15:52 UTC
7 points
1 comment6 min readLW link

Com­ment on “Propo­si­tions Con­cern­ing Digi­tal Minds and So­ciety”

Zack_M_Davis10 Jul 2022 5:48 UTC
95 points
12 comments8 min readLW link

Hes­sian and Basin volume

Vivek Hebbar10 Jul 2022 6:59 UTC
33 points
9 comments4 min readLW link

Check­sum Sen­sor Alignment

lsusr11 Jul 2022 3:31 UTC
12 points
2 comments1 min readLW link

The Align­ment Problem

lsusr11 Jul 2022 3:03 UTC
45 points
20 comments3 min readLW link

[Question] How do AI timelines af­fect how you live your life?

Quadratic Reciprocity11 Jul 2022 13:54 UTC
77 points
47 comments1 min readLW link

Three Min­i­mum Pivotal Acts Pos­si­ble by Nar­row AI

Michael Soareverix12 Jul 2022 9:51 UTC
0 points
4 comments2 min readLW link

On how var­i­ous plans miss the hard bits of the al­ign­ment challenge

So8res12 Jul 2022 2:49 UTC
258 points
81 comments29 min readLW link

[Question] What is wrong with this ap­proach to cor­rigi­bil­ity?

Rafael Cosman12 Jul 2022 22:55 UTC
7 points
8 comments1 min readLW link

MIRI Con­ver­sa­tions: Tech­nol­ogy Fore­cast­ing & Grad­u­al­ism (Distil­la­tion)

CallumMcDougall13 Jul 2022 15:55 UTC
31 points
1 comment20 min readLW link

[Question] Which AI Safety re­search agen­das are the most promis­ing?

Chris_Leong13 Jul 2022 7:54 UTC
27 points
6 comments1 min readLW link

Deep learn­ing cur­ricu­lum for large lan­guage model alignment

Jacob_Hilton13 Jul 2022 21:58 UTC
53 points
3 comments1 min readLW link
(github.com)

Ar­tifi­cial Sand­wich­ing: When can we test scal­able al­ign­ment pro­to­cols with­out hu­mans?

Sam Bowman13 Jul 2022 21:14 UTC
40 points
6 comments5 min readLW link

[Question] How to im­press stu­dents with re­cent ad­vances in ML?

Charbel-Raphaël14 Jul 2022 0:03 UTC
12 points
2 comments1 min readLW link

Cir­cum­vent­ing in­ter­pretabil­ity: How to defeat mind-readers

Lee Sharkey14 Jul 2022 16:59 UTC
94 points
8 comments36 min readLW link

Mus­ings on the Hu­man Ob­jec­tive Function

Michael Soareverix15 Jul 2022 7:13 UTC
3 points
0 comments3 min readLW link

Peter Singer’s first pub­lished piece on AI

Fai15 Jul 2022 6:18 UTC
20 points
5 comments1 min readLW link
(link.springer.com)

Notes on Learn­ing the Prior

carboniferous_umbraculum 15 Jul 2022 17:28 UTC
21 points
2 comments25 min readLW link

Pro­posed Orthog­o­nal­ity Th­e­ses #2-5

rjbg14 Jul 2022 22:59 UTC
6 points
0 comments2 min readLW link

A story about a du­plic­i­tous API

LiLiLi15 Jul 2022 18:26 UTC
2 points
0 comments1 min readLW link

Safety Im­pli­ca­tions of LeCun’s path to ma­chine intelligence

Ivan Vendrov15 Jul 2022 21:47 UTC
89 points
16 comments6 min readLW link

QNR Prospects

PeterMcCluskey16 Jul 2022 2:03 UTC
38 points
3 comments8 min readLW link
(www.bayesianinvestor.com)

All AGI safety ques­tions wel­come (es­pe­cially ba­sic ones) [July 2022]

16 Jul 2022 12:57 UTC
84 points
130 comments3 min readLW link

Align­ment as Game Design

Shoshannah Tekofsky16 Jul 2022 22:36 UTC
11 points
7 comments2 min readLW link

Why I Think Abrupt AI Takeoff

lincolnquirk17 Jul 2022 17:04 UTC
14 points
6 comments1 min readLW link

Why you might ex­pect ho­mo­ge­neous take-off: ev­i­dence from ML research

Andrei Alexandru17 Jul 2022 20:31 UTC
24 points
0 comments10 min readLW link

What should you change in re­sponse to an “emer­gency”? And AI risk

AnnaSalamon18 Jul 2022 1:11 UTC
303 points
60 comments6 min readLW link

Quan­tiliz­ers and Gen­er­a­tive Models

Adam Jermyn18 Jul 2022 16:32 UTC
24 points
5 comments4 min readLW link

Train­ing goals for large lan­guage models

Johannes Treutlein18 Jul 2022 7:09 UTC
26 points
5 comments19 min readLW link

Ma­chine Learn­ing Model Sizes and the Pa­ram­e­ter Gap [abridged]

Pablo Villalobos18 Jul 2022 16:51 UTC
20 points
0 comments1 min readLW link
(epochai.org)

Without spe­cific coun­ter­mea­sures, the eas­iest path to trans­for­ma­tive AI likely leads to AI takeover

Ajeya Cotra18 Jul 2022 19:06 UTC
310 points
89 comments84 min readLW link

At what point will we know if Eliezer’s pre­dic­tions are right or wrong?

anonymous12345618 Jul 2022 22:06 UTC
5 points
6 comments1 min readLW link

A daily rou­tine I do for my AI safety re­search work

scasper19 Jul 2022 21:58 UTC
15 points
7 comments1 min readLW link

Pit­falls with Proofs

scasper19 Jul 2022 22:21 UTC
19 points
21 comments8 min readLW link

Which sin­gu­lar­ity schools plus the no sin­gu­lar­ity school was right?

Noosphere8923 Jul 2022 15:16 UTC
9 points
27 comments9 min readLW link

Defin­ing Op­ti­miza­tion in a Deeper Way Part 3

J Bostock20 Jul 2022 22:06 UTC
8 points
0 comments2 min readLW link

[AN #173] Re­cent lan­guage model re­sults from DeepMind

Rohin Shah21 Jul 2022 2:30 UTC
37 points
9 comments8 min readLW link
(mailchi.mp)

[Question] How much to op­ti­mize for the short-timelines sce­nario?

SoerenMind21 Jul 2022 10:47 UTC
19 points
3 comments1 min readLW link

Mak­ing DALL-E Count

DirectedEvolution22 Jul 2022 9:11 UTC
23 points
12 comments4 min readLW link

Con­di­tion­ing Gen­er­a­tive Models with Restrictions

Adam Jermyn21 Jul 2022 20:33 UTC
16 points
4 comments8 min readLW link

Gen­eral al­ign­ment properties

TurnTrout8 Aug 2022 23:40 UTC
46 points
2 comments1 min readLW link

Which val­ues are sta­ble un­der on­tol­ogy shifts?

Richard_Ngo23 Jul 2022 2:40 UTC
68 points
47 comments3 min readLW link
(thinkingcomplete.blogspot.com)

Try­ing out Prompt Eng­ineer­ing on TruthfulQA

Megan Kinniment23 Jul 2022 2:04 UTC
10 points
0 comments8 min readLW link

Sym­bolic dis­til­la­tion, Diffu­sion, En­tropy, Repli­ca­tors, Agents, oh my (a mid-low qual­ity think­ing out loud post)

the gears to ascension23 Jul 2022 21:13 UTC
2 points
2 comments6 min readLW link

Eaves­drop­ping on Aliens: A Data De­cod­ing Challenge

anonymousaisafety24 Jul 2022 4:35 UTC
44 points
9 comments4 min readLW link

How much should we worry about mesa-op­ti­miza­tion challenges?

sudo25 Jul 2022 3:56 UTC
4 points
13 comments2 min readLW link

[Question] Does agent foun­da­tions cover all fu­ture ML sys­tems?

Jonas Hallgren25 Jul 2022 1:17 UTC
2 points
0 comments1 min readLW link

[Question] How op­ti­mistic should we be about AI figur­ing out how to in­ter­pret it­self?

oh5432125 Jul 2022 22:09 UTC
3 points
1 comment1 min readLW link

Ac­tive In­fer­ence as a for­mal­i­sa­tion of in­stru­men­tal convergence

Roman Leventov26 Jul 2022 17:55 UTC
6 points
2 comments3 min readLW link
(direct.mit.edu)

«Boundaries» Se­quence (In­dex Post)

Andrew_Critch26 Jul 2022 19:12 UTC
23 points
1 comment1 min readLW link

Mo­ral strate­gies at differ­ent ca­pa­bil­ity levels

Richard_Ngo27 Jul 2022 18:50 UTC
95 points
14 comments5 min readLW link
(thinkingcomplete.blogspot.com)

Prin­ci­ples of Pri­vacy for Align­ment Research

johnswentworth27 Jul 2022 19:53 UTC
68 points
30 comments7 min readLW link

Seek­ing beta read­ers who are ig­no­rant of biol­ogy but knowl­edge­able about AI safety

Holly_Elmore27 Jul 2022 23:02 UTC
10 points
6 comments1 min readLW link

Defin­ing Op­ti­miza­tion in a Deeper Way Part 4

J Bostock28 Jul 2022 17:02 UTC
7 points
0 comments5 min readLW link

An­nounc­ing the AI Safety Field Build­ing Hub, a new effort to provide AISFB pro­jects, men­tor­ship, and funding

Vael Gates28 Jul 2022 21:29 UTC
49 points
3 comments6 min readLW link

Distil­la­tion Con­test—Re­sults and Recap

Aris29 Jul 2022 17:40 UTC
33 points
0 comments7 min readLW link

Ab­stract­ing The Hard­ness of Align­ment: Un­bounded Atomic Optimization

adamShimi29 Jul 2022 18:59 UTC
62 points
3 comments16 min readLW link

How trans­parency changed over time

ViktoriaMalyasova30 Jul 2022 4:36 UTC
21 points
0 comments6 min readLW link

Trans­lat­ing be­tween La­tent Spaces

30 Jul 2022 3:25 UTC
20 points
1 comment8 min readLW link

AGI-level rea­soner will ap­pear sooner than an agent; what the hu­man­ity will do with this rea­soner is critical

Roman Leventov30 Jul 2022 20:56 UTC
24 points
10 comments1 min readLW link

chin­chilla’s wild implications

nostalgebraist31 Jul 2022 1:18 UTC
366 points
114 comments11 min readLW link

Tech­ni­cal AI Align­ment Study Group

Eric K1 Aug 2022 18:33 UTC
5 points
0 comments1 min readLW link

[Question] Which in­tro-to-AI-risk text would you recom­mend to...

Sherrinford1 Aug 2022 9:36 UTC
12 points
1 comment1 min readLW link

Two-year up­date on my per­sonal AI timelines

Ajeya Cotra2 Aug 2022 23:07 UTC
287 points
60 comments16 min readLW link

What are the Red Flags for Neu­ral Net­work Suffer­ing? - Seeds of Science call for reviewers

rogersbacon2 Aug 2022 22:37 UTC
24 points
5 comments1 min readLW link

Pre­cur­sor check­ing for de­cep­tive alignment

evhub3 Aug 2022 22:56 UTC
18 points
0 comments14 min readLW link

Sur­vey: What (de)mo­ti­vates you about AI risk?

Daniel_Friedrich3 Aug 2022 19:17 UTC
1 point
0 comments1 min readLW link
(forms.gle)

High Reli­a­bil­ity Orgs, and AI Companies

Raemon4 Aug 2022 5:45 UTC
73 points
6 comments12 min readLW link

In­ter­pretabil­ity isn’t Free

Joel Burget4 Aug 2022 15:02 UTC
10 points
1 comment2 min readLW link

[Question] AI al­ign­ment: Would a lazy self-preser­va­tion in­stinct be suffi­cient?

BrainFrog4 Aug 2022 17:53 UTC
−1 points
4 comments1 min readLW link

[Question] What drives progress, the­ory or ap­pli­ca­tion?

lberglund5 Aug 2022 1:14 UTC
5 points
1 comment1 min readLW link

The Prag­mas­cope Idea

johnswentworth4 Aug 2022 21:52 UTC
55 points
19 comments3 min readLW link

$20K In Boun­ties for AI Safety Public Materials

5 Aug 2022 2:52 UTC
68 points
7 comments6 min readLW link

Rant on Prob­lem Fac­tor­iza­tion for Alignment

johnswentworth5 Aug 2022 19:23 UTC
73 points
48 comments6 min readLW link

Rant on Prob­lem Fac­tor­iza­tion for Alignment

johnswentworth5 Aug 2022 19:23 UTC
73 points
48 comments6 min readLW link

An­nounc­ing the In­tro­duc­tion to ML Safety course

6 Aug 2022 2:46 UTC
69 points
6 comments7 min readLW link

Why I Am Skep­ti­cal of AI Reg­u­la­tion as an X-Risk Miti­ga­tion Strategy

A Ray6 Aug 2022 5:46 UTC
31 points
14 comments2 min readLW link

My ad­vice on find­ing your own path

A Ray6 Aug 2022 4:57 UTC
34 points
3 comments3 min readLW link

A De­cep­tively Sim­ple Ar­gu­ment in fa­vor of Prob­lem Factorization

Logan Zoellner6 Aug 2022 17:32 UTC
3 points
4 comments1 min readLW link

[Question] Can we get full au­dio for Eliezer’s con­ver­sa­tion with Sam Har­ris?

JakubK7 Aug 2022 20:35 UTC
30 points
8 comments1 min readLW link

How Deadly Will Roughly-Hu­man-Level AGI Be?

David Udell8 Aug 2022 1:59 UTC
12 points
6 comments1 min readLW link

Broad Bas­ins and Data Compression

8 Aug 2022 20:33 UTC
29 points
6 comments7 min readLW link

En­cul­tured AI Pre-plan­ning, Part 1: En­abling New Benchmarks

8 Aug 2022 22:44 UTC
62 points
2 comments6 min readLW link

En­cul­tured AI, Part 1 Ap­pendix: Rele­vant Re­search Examples

8 Aug 2022 22:44 UTC
11 points
1 comment7 min readLW link

Disagree­ments about Align­ment: Why, and how, we should try to solve them

ojorgensen9 Aug 2022 18:49 UTC
8 points
1 comment16 min readLW link

[Question] Many Gods re­fu­ta­tion and In­stru­men­tal Goals. (Proper one)

aditya malik9 Aug 2022 11:59 UTC
0 points
15 comments1 min readLW link

[Question] Is it pos­si­ble to find ven­ture cap­i­tal for AI re­search org with strong safety fo­cus?

AnonResearch9 Aug 2022 16:12 UTC
6 points
1 comment1 min readLW link

Us­ing GPT-3 to aug­ment hu­man intelligence

Henrik Karlsson10 Aug 2022 15:54 UTC
48 points
7 comments18 min readLW link
(escapingflatland.substack.com)

Emer­gent Abil­ities of Large Lan­guage Models [Linkpost]

aogara10 Aug 2022 18:02 UTC
25 points
2 comments1 min readLW link
(arxiv.org)

How Do We Align an AGI Without Get­ting So­cially Eng­ineered? (Hint: Box It)

10 Aug 2022 18:14 UTC
26 points
30 comments11 min readLW link

The al­ign­ment prob­lem from a deep learn­ing perspective

Richard_Ngo10 Aug 2022 22:46 UTC
93 points
13 comments27 min readLW link

How much al­ign­ment data will we need in the long run?

Jacob_Hilton10 Aug 2022 21:39 UTC
34 points
15 comments4 min readLW link

Thoughts on the good reg­u­la­tor theorem

JonasMoss11 Aug 2022 12:08 UTC
8 points
0 comments4 min readLW link

Lan­guage mod­els seem to be much bet­ter than hu­mans at next-to­ken prediction

11 Aug 2022 17:45 UTC
164 points
56 comments13 min readLW link

[Question] Se­ri­ously, what goes wrong with “re­ward the agent when it makes you smile”?

TurnTrout11 Aug 2022 22:22 UTC
76 points
41 comments2 min readLW link

Dis­sected boxed AI

Nathan112312 Aug 2022 2:37 UTC
−8 points
2 comments1 min readLW link

Steelmin­ing via Analogy

Paul Bricman13 Aug 2022 9:59 UTC
24 points
0 comments2 min readLW link
(paulbricman.com)

Refin­ing the Sharp Left Turn threat model, part 1: claims and mechanisms

12 Aug 2022 15:17 UTC
71 points
3 comments3 min readLW link
(vkrakovna.wordpress.com)

Over­sight Misses 100% of Thoughts The AI Does Not Think

johnswentworth12 Aug 2022 16:30 UTC
85 points
49 comments1 min readLW link

Timelines ex­pla­na­tion post part 1 of ?

Nathan Helm-Burger12 Aug 2022 16:13 UTC
10 points
1 comment2 min readLW link

A lit­tle play­ing around with Blen­der­bot3

Nathan Helm-Burger12 Aug 2022 16:06 UTC
9 points
0 comments1 min readLW link

Deep­Mind al­ign­ment team opinions on AGI ruin arguments

Vika12 Aug 2022 21:06 UTC
364 points
34 comments14 min readLW link

the In­su­lated Goal-Pro­gram idea

Tamsin Leake13 Aug 2022 9:57 UTC
39 points
3 comments2 min readLW link
(carado.moe)

goal-pro­gram bricks

Tamsin Leake13 Aug 2022 10:08 UTC
27 points
2 comments2 min readLW link
(carado.moe)

How I think about alignment

Linda Linsefors13 Aug 2022 10:01 UTC
30 points
11 comments5 min readLW link

Refine’s First Blog Post Day

adamShimi13 Aug 2022 10:23 UTC
55 points
3 comments1 min readLW link

Shapes of Mind and Plu­ral­ism in Alignment

adamShimi13 Aug 2022 10:01 UTC
30 points
1 comment2 min readLW link

An ex­tended rocket al­ign­ment analogy

remember13 Aug 2022 18:22 UTC
25 points
3 comments4 min readLW link

Cul­ti­vat­ing Valiance

Shoshannah Tekofsky13 Aug 2022 18:47 UTC
35 points
4 comments4 min readLW link

Evolu­tion is a bad anal­ogy for AGI: in­ner alignment

Quintin Pope13 Aug 2022 22:15 UTC
52 points
6 comments8 min readLW link

A brief note on Sim­plic­ity Bias

carboniferous_umbraculum 14 Aug 2022 2:05 UTC
16 points
0 comments4 min readLW link

Seek­ing In­terns/​RAs for Mechanis­tic In­ter­pretabil­ity Projects

Neel Nanda15 Aug 2022 7:11 UTC
61 points
0 comments2 min readLW link

Ex­treme Security

lc15 Aug 2022 12:11 UTC
39 points
4 comments5 min readLW link

On Prefer­ence Ma­nipu­la­tion in Re­ward Learn­ing Processes

Felix Hofstätter15 Aug 2022 19:32 UTC
8 points
0 comments4 min readLW link

Limits of Ask­ing ELK if Models are Deceptive

Oam Patel15 Aug 2022 20:44 UTC
6 points
2 comments4 min readLW link

What Makes an Idea Un­der­stand­able? On Ar­chi­tec­turally and Cul­turally Nat­u­ral Ideas.

16 Aug 2022 2:09 UTC
17 points
2 comments16 min readLW link

De­cep­tion as the op­ti­mal: mesa-op­ti­miz­ers and in­ner al­ign­ment

Eleni Angelou16 Aug 2022 4:49 UTC
10 points
0 comments5 min readLW link

Un­der­stand­ing differ­ences be­tween hu­mans and in­tel­li­gence-in-gen­eral to build safe AGI

Florian_Dietz16 Aug 2022 8:27 UTC
7 points
8 comments1 min readLW link

Au­ton­omy as tak­ing re­spon­si­bil­ity for refer­ence maintenance

Ramana Kumar17 Aug 2022 12:50 UTC
52 points
3 comments5 min readLW link

Thoughts on ‘List of Lethal­ities’

Alex Lawsen 17 Aug 2022 18:33 UTC
25 points
0 comments10 min readLW link

Hu­man Mimicry Mainly Works When We’re Already Close

johnswentworth17 Aug 2022 18:41 UTC
68 points
16 comments5 min readLW link

The Core of the Align­ment Prob­lem is...

17 Aug 2022 20:07 UTC
58 points
10 comments9 min readLW link

Con­crete Ad­vice for Form­ing In­side Views on AI Safety

Neel Nanda17 Aug 2022 22:02 UTC
18 points
6 comments10 min readLW link

An­nounc­ing En­cul­tured AI: Build­ing a Video Game

18 Aug 2022 2:16 UTC
103 points
26 comments4 min readLW link

An­nounc­ing the Distil­la­tion for Align­ment Practicum (DAP)

18 Aug 2022 19:50 UTC
21 points
3 comments3 min readLW link

Align­ment’s phlo­gis­ton

Eleni Angelou18 Aug 2022 22:27 UTC
10 points
2 comments2 min readLW link

[Question] Are lan­guage mod­els close to the su­per­hu­man level in philos­o­phy?

Roman Leventov19 Aug 2022 4:43 UTC
5 points
2 comments2 min readLW link

How to do the­o­ret­i­cal re­search, a per­sonal perspective

Mark Xu19 Aug 2022 19:41 UTC
84 points
4 comments15 min readLW link

Refine’s Se­cond Blog Post Day

adamShimi20 Aug 2022 13:01 UTC
19 points
0 comments1 min readLW link

No One-Size-Fit-All Epistemic Strategy

adamShimi20 Aug 2022 12:56 UTC
23 points
1 comment2 min readLW link

Re­duc­ing Good­hart: An­nounce­ment, Ex­ec­u­tive Summary

Charlie Steiner20 Aug 2022 9:49 UTC
14 points
0 comments1 min readLW link

Pivotal acts us­ing an un­al­igned AGI?

Simon Fischer21 Aug 2022 17:13 UTC
26 points
3 comments8 min readLW link

Beyond Hyperanthropomorphism

PointlessOne21 Aug 2022 17:55 UTC
3 points
17 comments1 min readLW link
(studio.ribbonfarm.com)

AXRP Epi­sode 17 - Train­ing for Very High Reli­a­bil­ity with Daniel Ziegler

DanielFilan21 Aug 2022 23:50 UTC
16 points
0 comments34 min readLW link

[Question] What if we solve AI Safety but no one cares

14285722 Aug 2022 5:38 UTC
18 points
5 comments1 min readLW link

Find­ing Goals in the World Model

22 Aug 2022 18:06 UTC
55 points
8 comments13 min readLW link

[Question] AI Box Ex­per­i­ment: Are peo­ple still in­ter­ested?

Double31 Aug 2022 3:04 UTC
31 points
13 comments1 min readLW link

Stable Diffu­sion has been released

P.22 Aug 2022 19:42 UTC
15 points
7 comments1 min readLW link
(stability.ai)

Dis­cus­sion on uti­liz­ing AI for alignment

elifland23 Aug 2022 2:36 UTC
16 points
3 comments1 min readLW link
(www.foxy-scout.com)

It Looks Like You’re Try­ing To Take Over The Narrative

George3d624 Aug 2022 13:36 UTC
2 points
20 comments9 min readLW link
(www.epistem.ink)

Thoughts about OOD alignment

Catnee24 Aug 2022 15:31 UTC
11 points
10 comments2 min readLW link

Vingean Agency

abramdemski24 Aug 2022 20:08 UTC
57 points
13 comments3 min readLW link

In­ter­species diplo­macy as a po­ten­tially pro­duc­tive lens on AGI alignment

Shariq Hashme24 Aug 2022 17:59 UTC
5 points
1 comment2 min readLW link

OpenAI’s Align­ment Plans

dkirmani24 Aug 2022 19:39 UTC
60 points
17 comments5 min readLW link
(openai.com)

What Makes A Good Mea­sure­ment De­vice?

johnswentworth24 Aug 2022 22:45 UTC
35 points
7 comments2 min readLW link

Eval­u­at­ing OpenAI’s al­ign­ment plans us­ing train­ing stories

ojorgensen25 Aug 2022 16:12 UTC
3 points
0 comments5 min readLW link

A Test for Lan­guage Model Consciousness

Ethan Perez25 Aug 2022 19:41 UTC
18 points
14 comments10 min readLW link

Seek­ing Stu­dent Sub­mis­sions: Edit Your Source Code Contest

Aris26 Aug 2022 2:08 UTC
28 points
5 comments2 min readLW link

Basin broad­ness de­pends on the size and num­ber of or­thog­o­nal features

27 Aug 2022 17:29 UTC
34 points
21 comments6 min readLW link

Suffi­ciently many Godzillas as an al­ign­ment strategy

14285728 Aug 2022 0:08 UTC
8 points
3 comments1 min readLW link

Ar­tifi­cial Mo­ral Ad­vi­sors: A New Per­spec­tive from Mo­ral Psychology

David Gross28 Aug 2022 16:37 UTC
25 points
1 comment1 min readLW link
(dl.acm.org)

First thing AI will do when it takes over is get fis­sion going

visiax28 Aug 2022 5:56 UTC
−2 points
0 comments1 min readLW link

Robert Long On Why Ar­tifi­cial Sen­tience Might Matter

Michaël Trazzi28 Aug 2022 17:30 UTC
26 points
5 comments5 min readLW link
(theinsideview.ai)

How Do AI Timelines Affect Ex­is­ten­tial Risk?

Stephen McAleese29 Aug 2022 16:57 UTC
7 points
9 comments23 min readLW link

[Question] What is the best cri­tique of AI ex­is­ten­tial risk ar­gu­ments?

joshc30 Aug 2022 2:18 UTC
5 points
10 comments1 min readLW link

Can We Align a Self-Im­prov­ing AGI?

Peter S. Park30 Aug 2022 0:14 UTC
8 points
5 comments11 min readLW link

LessWrong’s pre­dic­tion on apoc­a­lypse due to AGI (Aug 2022)

LetUsTalk29 Aug 2022 18:46 UTC
7 points
13 comments1 min readLW link

[Question] How can I rec­on­cile the two most likely re­quire­ments for hu­man­i­ties near-term sur­vival.

Erlja Jkdf.29 Aug 2022 18:46 UTC
1 point
6 comments1 min readLW link

How likely is de­cep­tive al­ign­ment?

evhub30 Aug 2022 19:34 UTC
72 points
21 comments60 min readLW link

In­ner Align­ment via Superpowers

30 Aug 2022 20:01 UTC
37 points
13 comments4 min readLW link

Three sce­nar­ios of pseudo-al­ign­ment

Eleni Angelou3 Sep 2022 12:47 UTC
9 points
0 comments3 min readLW link

New 80,000 Hours prob­lem pro­file on ex­is­ten­tial risks from AI

Benjamin Hilton31 Aug 2022 17:36 UTC
28 points
7 comments7 min readLW link
(80000hours.org)

Sur­vey of NLP Re­searchers: NLP is con­tribut­ing to AGI progress; ma­jor catas­tro­phe plausible

Sam Bowman31 Aug 2022 1:39 UTC
89 points
6 comments2 min readLW link

In­fra-Ex­er­cises, Part 1

1 Sep 2022 5:06 UTC
49 points
9 comments1 min readLW link

Align­ment is hard. Com­mu­ni­cat­ing that, might be harder

Eleni Angelou1 Sep 2022 16:57 UTC
7 points
8 comments3 min readLW link

A Sur­vey of Foun­da­tional Meth­ods in In­verse Re­in­force­ment Learning

adamk1 Sep 2022 18:21 UTC
16 points
0 comments12 min readLW link

AI Safety and Neigh­bor­ing Com­mu­ni­ties: A Quick-Start Guide, as of Sum­mer 2022

Sam Bowman1 Sep 2022 19:15 UTC
74 points
2 comments7 min readLW link

A Richly In­ter­ac­tive AGI Align­ment Chart

lisperati2 Sep 2022 0:44 UTC
14 points
6 comments1 min readLW link

Re­place­ment for PONR concept

Daniel Kokotajlo2 Sep 2022 0:09 UTC
44 points
6 comments2 min readLW link

AI co­or­di­na­tion needs clear wins

evhub1 Sep 2022 23:41 UTC
134 points
15 comments2 min readLW link

Simulators

janus2 Sep 2022 12:45 UTC
472 points
103 comments44 min readLW link
(generative.ink)

Laz­i­ness in AI

Richard Henage2 Sep 2022 17:04 UTC
11 points
5 comments1 min readLW link

Agency en­g­ineer­ing: is AI-al­ign­ment “to hu­man in­tent” enough?

catubc2 Sep 2022 18:14 UTC
9 points
10 comments6 min readLW link

Sticky goals: a con­crete ex­per­i­ment for un­der­stand­ing de­cep­tive alignment

evhub2 Sep 2022 21:57 UTC
35 points
13 comments3 min readLW link

[Question] Re­quest for Align­ment Re­search Pro­ject Recommendations

Rauno Arike3 Sep 2022 15:29 UTC
10 points
2 comments1 min readLW link

[Question] Re­quest for Align­ment Re­search Pro­ject Recommendations

Rauno Arike3 Sep 2022 15:29 UTC
10 points
2 comments1 min readLW link

Bugs or Fea­tures?

qbolec3 Sep 2022 7:04 UTC
69 points
9 comments2 min readLW link

Pri­vate al­ign­ment re­search shar­ing and coordination

porby4 Sep 2022 0:01 UTC
54 points
10 comments5 min readLW link

AXRP Epi­sode 18 - Con­cept Ex­trap­o­la­tion with Stu­art Armstrong

DanielFilan3 Sep 2022 23:12 UTC
10 points
1 comment39 min readLW link

[Question] Help me find a good Hackathon sub­ject

Charbel-Raphaël4 Sep 2022 8:40 UTC
6 points
18 comments1 min readLW link

How To Know What the AI Knows—An ELK Distillation

Fabien Roger4 Sep 2022 0:46 UTC
5 points
0 comments5 min readLW link

AI Gover­nance Needs Tech­ni­cal Work

Mau5 Sep 2022 22:28 UTC
39 points
1 comment9 min readLW link

Com­mu­nity Build­ing for Grad­u­ate Stu­dents: A Tar­geted Approach

Neil Crawford6 Sep 2022 17:17 UTC
6 points
0 comments3 min readLW link

pro­gram searches

Tamsin Leake5 Sep 2022 20:04 UTC
21 points
2 comments2 min readLW link
(carado.moe)

Alex Lawsen On Fore­cast­ing AI Progress

Michaël Trazzi6 Sep 2022 9:32 UTC
18 points
0 comments2 min readLW link
(theinsideview.ai)

It’s (not) how you use it

Eleni Angelou7 Sep 2022 17:15 UTC
8 points
1 comment2 min readLW link

AI-as­sisted list of ten con­crete al­ign­ment things to do right now

lemonhope7 Sep 2022 8:38 UTC
8 points
5 comments4 min readLW link

Progress Re­port 7: mak­ing GPT go hur­rdurr in­stead of brrrrrrr

Nathan Helm-Burger7 Sep 2022 3:28 UTC
21 points
0 comments4 min readLW link

Is there a list of pro­jects to get started with In­ter­pretabil­ity?

Franziska Fischer7 Sep 2022 4:27 UTC
8 points
2 comments1 min readLW link

Un­der­stand­ing and avoid­ing value drift

TurnTrout9 Sep 2022 4:16 UTC
40 points
9 comments6 min readLW link

Linkpost: Github Copi­lot pro­duc­tivity experiment

Daniel Kokotajlo8 Sep 2022 4:41 UTC
88 points
4 comments1 min readLW link
(github.blog)

Thoughts on AGI con­scious­ness /​ sentience

Steven Byrnes8 Sep 2022 16:40 UTC
37 points
37 comments6 min readLW link

What Should AI Owe To Us? Ac­countable and Aligned AI Sys­tems via Con­trac­tu­al­ist AI Alignment

xuan8 Sep 2022 15:04 UTC
30 points
15 comments25 min readLW link

A rough idea for solv­ing ELK: An ap­proach for train­ing gen­er­al­ist agents like GATO to make plans and de­scribe them to hu­mans clearly and hon­estly.

Michael Soareverix8 Sep 2022 15:20 UTC
2 points
2 comments2 min readLW link

Dath Ilan’s Views on Stop­gap Corrigibility

David Udell22 Sep 2022 16:16 UTC
50 points
17 comments13 min readLW link
(www.glowfic.com)

Most Peo­ple Start With The Same Few Bad Ideas

johnswentworth9 Sep 2022 0:29 UTC
161 points
30 comments3 min readLW link

Over­sight Leagues: The Train­ing Game as a Feature

Paul Bricman9 Sep 2022 10:08 UTC
20 points
6 comments10 min readLW link

AI al­ign­ment with hu­mans… but with which hu­mans?

geoffreymiller9 Sep 2022 18:21 UTC
11 points
33 comments3 min readLW link

Eval­u­a­tions pro­ject @ ARC is hiring a re­searcher and a web­dev/​engineer

Beth Barnes9 Sep 2022 22:46 UTC
94 points
7 comments10 min readLW link

Swap and Scale

Stephen Fowler9 Sep 2022 22:41 UTC
17 points
3 comments1 min readLW link

Alex­aTM − 20 Billion Pa­ram­e­ter Model With Im­pres­sive Performance

MrThink9 Sep 2022 21:46 UTC
5 points
0 comments1 min readLW link

[Fun][Link] Align­ment SMBC Comic

Gunnar_Zarncke9 Sep 2022 21:38 UTC
7 points
2 comments1 min readLW link
(www.smbc-comics.com)

Path de­pen­dence in ML in­duc­tive biases

10 Sep 2022 1:38 UTC
43 points
13 comments10 min readLW link

ethics and an­throp­ics of ho­mo­mor­phi­cally en­crypted computations

Tamsin Leake9 Sep 2022 10:49 UTC
43 points
49 comments3 min readLW link
(carado.moe)

Join ASAP! (AI Safety Ac­countabil­ity Pro­gramme) 🚀

CallumMcDougall10 Sep 2022 11:15 UTC
19 points
0 comments3 min readLW link

AI Safety field-build­ing pro­jects I’d like to see

Akash11 Sep 2022 23:43 UTC
44 points
7 comments6 min readLW link

[Question] Why do Peo­ple Think In­tel­li­gence Will be “Easy”?

DragonGod12 Sep 2022 17:32 UTC
15 points
32 comments2 min readLW link

Black Box In­ves­ti­ga­tion Re­search Hackathon

12 Sep 2022 7:20 UTC
9 points
4 comments2 min readLW link

Ar­gu­ment against 20% GDP growth from AI within 10 years [Linkpost]

aogara12 Sep 2022 4:08 UTC
58 points
21 comments5 min readLW link
(twitter.com)

Ide­olog­i­cal In­fer­ence Eng­ines: Mak­ing Deon­tol­ogy Differ­en­tiable*

Paul Bricman12 Sep 2022 12:00 UTC
6 points
0 comments14 min readLW link

Deep Q-Net­works Explained

Jay Bailey13 Sep 2022 12:01 UTC
37 points
4 comments22 min readLW link

Git Re-Basin: Merg­ing Models mod­ulo Per­mu­ta­tion Sym­me­tries [Linkpost]

aogara14 Sep 2022 8:55 UTC
21 points
0 comments2 min readLW link
(arxiv.org)

Some ideas for epis­tles to the AI ethicists

Charlie Steiner14 Sep 2022 9:07 UTC
19 points
0 comments4 min readLW link

The prob­lem with the me­dia pre­sen­ta­tion of “be­liev­ing in AI”

Roman Leventov14 Sep 2022 21:05 UTC
3 points
0 comments1 min readLW link

When is in­tent al­ign­ment suffi­cient or nec­es­sary to re­duce AGI con­flict?

14 Sep 2022 19:39 UTC
32 points
0 comments9 min readLW link

When would AGIs en­gage in con­flict?

14 Sep 2022 19:38 UTC
37 points
3 comments13 min readLW link

Re­spond­ing to ‘Beyond Hyper­an­thro­po­mor­phism’

ukc1001414 Sep 2022 20:37 UTC
8 points
0 comments16 min readLW link

How should Deep­Mind’s Chin­chilla re­vise our AI fore­casts?

Cleo Nardo15 Sep 2022 17:54 UTC
34 points
12 comments13 min readLW link

Ra­tional An­i­ma­tions’ Script Writ­ing Contest

Writer15 Sep 2022 16:56 UTC
22 points
1 comment3 min readLW link

Rep­re­sen­ta­tional Tethers: Ty­ing AI La­tents To Hu­man Ones

Paul Bricman16 Sep 2022 14:45 UTC
30 points
0 comments16 min readLW link

[Question] Why are we sure that AI will “want” some­thing?

Shmi16 Sep 2022 20:35 UTC
31 points
58 comments1 min readLW link

Refine Blog­post Day #3: The short­forms I did write

Alexander Gietelink Oldenziel16 Sep 2022 21:03 UTC
23 points
0 comments1 min readLW link

Take­aways from our ro­bust in­jury clas­sifier pro­ject [Red­wood Re­search]

dmz17 Sep 2022 3:55 UTC
135 points
9 comments6 min readLW link

Refine’s Third Blog Post Day/​Week

adamShimi17 Sep 2022 17:03 UTC
18 points
0 comments1 min readLW link

There is no royal road to alignment

Eleni Angelou18 Sep 2022 3:33 UTC
4 points
2 comments3 min readLW link

Prize and fast track to al­ign­ment re­search at ALTER

Vanessa Kosoy17 Sep 2022 16:58 UTC
65 points
4 comments3 min readLW link

[Question] Up­dates on FLI’s Value Alig­ment Map?

Fer32dwt34r3dfsz17 Sep 2022 22:27 UTC
17 points
4 comments1 min readLW link

[Question] Up­dates on FLI’s Value Alig­ment Map?

Fer32dwt34r3dfsz17 Sep 2022 22:27 UTC
17 points
4 comments1 min readLW link

Ap­ply for men­tor­ship in AI Safety field-building

Akash17 Sep 2022 19:06 UTC
9 points
0 comments1 min readLW link
(forum.effectivealtruism.org)

Sparse tri­nary weighted RNNs as a path to bet­ter lan­guage model interpretability

Am8ryllis17 Sep 2022 19:48 UTC
19 points
13 comments3 min readLW link

Pod­casts on sur­veys, slower AI, AI ar­gu­ments, etc

KatjaGrace18 Sep 2022 7:30 UTC
13 points
0 comments1 min readLW link
(worldspiritsockpuppet.com)

In­ner al­ign­ment: what are we point­ing at?

lemonhope18 Sep 2022 11:09 UTC
7 points
2 comments1 min readLW link

The In­ter-Agent Facet of AI Alignment

Michael Oesterle18 Sep 2022 20:39 UTC
12 points
1 comment5 min readLW link

Quintin’s al­ign­ment pa­pers roundup—week 2

Quintin Pope19 Sep 2022 13:41 UTC
60 points
2 comments10 min readLW link

Safety timelines: How long will it take to solve al­ign­ment?

19 Sep 2022 12:53 UTC
35 points
7 comments6 min readLW link
(forum.effectivealtruism.org)

Prize idea: Trans­mit MIRI and Eliezer’s worldviews

elifland19 Sep 2022 21:21 UTC
45 points
18 comments2 min readLW link

A noob goes to the SERI MATS presentations

Lowell Dennings19 Sep 2022 17:35 UTC
26 points
0 comments5 min readLW link

How to make your CPU as fast as a GPU—Ad­vances in Spar­sity w/​ Nir Shavit

the gears to ascension20 Sep 2022 3:48 UTC
0 points
0 comments27 min readLW link
(www.youtube.com)

Towards de­con­fus­ing wire­head­ing and re­ward maximization

leogao21 Sep 2022 0:36 UTC
69 points
7 comments4 min readLW link

Here Be AGI Dragons

Eris Discordia21 Sep 2022 22:28 UTC
−2 points
0 comments5 min readLW link

An­nounc­ing AISIC 2022 - the AI Safety Is­rael Con­fer­ence, Oc­to­ber 19-20

Davidmanheim21 Sep 2022 19:32 UTC
13 points
0 comments1 min readLW link

AI Risk In­tro 2: Solv­ing The Problem

22 Sep 2022 13:55 UTC
13 points
0 comments27 min readLW link

[Question] AI career

ondragon22 Sep 2022 3:48 UTC
2 points
0 comments1 min readLW link

Sha­har Avin On How To Reg­u­late Ad­vanced AI Systems

Michaël Trazzi23 Sep 2022 15:46 UTC
31 points
0 comments4 min readLW link
(theinsideview.ai)

The het­ero­gene­ity of hu­man value types: Im­pli­ca­tions for AI alignment

geoffreymiller23 Sep 2022 17:03 UTC
10 points
2 comments10 min readLW link

In­tel­li­gence as a Platform

Robert Kennedy23 Sep 2022 5:51 UTC
10 points
5 comments3 min readLW link

In­ter­pret­ing Neu­ral Net­works through the Poly­tope Lens

23 Sep 2022 17:58 UTC
123 points
26 comments33 min readLW link

Un­der what cir­cum­stances have gov­ern­ments can­cel­led AI-type sys­tems?

David Gross23 Sep 2022 21:11 UTC
7 points
1 comment1 min readLW link
(www.carnegieuktrust.org.uk)

[Question] I’m plan­ning to start cre­at­ing more write-ups sum­ma­riz­ing my thoughts on var­i­ous is­sues, mostly re­lated to AI ex­is­ten­tial safety. What do you want to hear my nu­anced takes on?

David Scott Krueger (formerly: capybaralet)24 Sep 2022 12:38 UTC
9 points
10 comments1 min readLW link

[Question] Why Do AI re­searchers Rate the Prob­a­bil­ity of Doom So Low?

Aorou24 Sep 2022 2:33 UTC
7 points
6 comments3 min readLW link

AI coöper­a­tion is more pos­si­ble than you think

42317524 Sep 2022 21:26 UTC
6 points
0 comments2 min readLW link

An Un­ex­pected GPT-3 De­ci­sion in a Sim­ple Gam­ble

casualphysicsenjoyer25 Sep 2022 16:46 UTC
8 points
4 comments1 min readLW link

Pri­ori­tiz­ing the Arts in re­sponse to AI automation

Casey25 Sep 2022 2:25 UTC
18 points
11 comments2 min readLW link

Plan­ning ca­pac­ity and daemons

lemonhope26 Sep 2022 0:15 UTC
2 points
0 comments5 min readLW link

Re­call and Re­gur­gi­ta­tion in GPT2

Megan Kinniment3 Oct 2022 19:35 UTC
33 points
1 comment26 min readLW link

[MLSN #5]: Prize Compilation

Dan H26 Sep 2022 21:55 UTC
14 points
1 comment2 min readLW link

Loss of Align­ment is not the High-Order Bit for AI Risk

yieldthought26 Sep 2022 21:16 UTC
14 points
20 comments2 min readLW link

In­verse Scal­ing Prize: Round 1 Winners

26 Sep 2022 19:57 UTC
88 points
16 comments4 min readLW link
(irmckenzie.co.uk)

[Question] Does the ex­is­tence of shared hu­man val­ues im­ply al­ign­ment is “easy”?

Morpheus26 Sep 2022 18:01 UTC
7 points
14 comments1 min readLW link

Why we’re not found­ing a hu­man-data-for-al­ign­ment org

27 Sep 2022 20:14 UTC
80 points
5 comments29 min readLW link
(forum.effectivealtruism.org)

Be Not Afraid

Alex Beyman27 Sep 2022 22:04 UTC
8 points
0 comments6 min readLW link

Strange Loops—Self-Refer­ence from Num­ber The­ory to AI

ojorgensen28 Sep 2022 14:10 UTC
9 points
5 comments18 min readLW link

AI Safety Endgame Stories

Ivan Vendrov28 Sep 2022 16:58 UTC
27 points
11 comments11 min readLW link

Es­ti­mat­ing the Cur­rent and Fu­ture Num­ber of AI Safety Researchers

Stephen McAleese28 Sep 2022 21:11 UTC
24 points
11 comments9 min readLW link
(forum.effectivealtruism.org)

Clar­ify­ing the Agent-Like Struc­ture Problem

johnswentworth29 Sep 2022 21:28 UTC
53 points
14 comments6 min readLW link

Emer­gency learning

Stuart_Armstrong28 Jan 2017 10:05 UTC
13 points
10 comments4 min readLW link

EAG DC: Meta-Bot­tle­necks in Prevent­ing AI Doom

Joseph Bloom30 Sep 2022 17:53 UTC
5 points
0 comments1 min readLW link

In­ter­est­ing pa­pers: for­mally ver­ify­ing DNNs

the gears to ascension30 Sep 2022 8:49 UTC
13 points
0 comments3 min readLW link

linkpost: loss basin visualization

Nathan Helm-Burger30 Sep 2022 3:42 UTC
14 points
1 comment1 min readLW link

Four us­ages of “loss” in AI

TurnTrout2 Oct 2022 0:52 UTC
42 points
18 comments5 min readLW link

An­nounc­ing the AI Safety Nudge Com­pe­ti­tion to Help Beat Procrastination

Marc Carauleanu1 Oct 2022 1:49 UTC
10 points
0 comments1 min readLW link

Google could build a con­scious AI in three months

derek shiller1 Oct 2022 13:24 UTC
9 points
18 comments1 min readLW link

AGI by 2050 prob­a­bil­ity less than 1%

fumin1 Oct 2022 19:45 UTC
−10 points
4 comments9 min readLW link
(docs.google.com)

[Question] Do an­thropic con­sid­er­a­tions un­der­cut the evolu­tion an­chor from the Bio An­chors re­port?

Ege Erdil1 Oct 2022 20:02 UTC
20 points
13 comments2 min readLW link

A re­view of the Bio-An­chors report

jylin043 Oct 2022 10:27 UTC
45 points
4 comments1 min readLW link
(docs.google.com)

Data for IRL: What is needed to learn hu­man val­ues?

Jan Wehner3 Oct 2022 9:23 UTC
18 points
6 comments12 min readLW link

my cur­rent out­look on AI risk mitigation

Tamsin Leake3 Oct 2022 20:06 UTC
58 points
4 comments11 min readLW link
(carado.moe)

No free lunch the­o­rem is irrelevant

Catnee4 Oct 2022 0:21 UTC
12 points
7 comments1 min readLW link

Paper+Sum­mary: OMNIGROK: GROKKING BEYOND ALGORITHMIC DATA

Marius Hobbhahn4 Oct 2022 7:22 UTC
44 points
11 comments1 min readLW link
(arxiv.org)

How are you deal­ing with on­tol­ogy iden­ti­fi­ca­tion?

Erik Jenner4 Oct 2022 23:28 UTC
33 points
10 comments3 min readLW link

Reflec­tion Mechanisms as an Align­ment tar­get: A fol­low-up survey

5 Oct 2022 14:03 UTC
13 points
2 comments7 min readLW link

Track­ing Com­pute Stocks and Flows: Case Stud­ies?

Cullen5 Oct 2022 17:57 UTC
11 points
5 comments1 min readLW link

Char­i­ta­ble Reads of Anti-AGI-X-Risk Ar­gu­ments, Part 1

sstich5 Oct 2022 5:03 UTC
3 points
4 comments3 min readLW link

Neu­ral Tan­gent Ker­nel Distillation

5 Oct 2022 18:11 UTC
68 points
20 comments8 min readLW link

More Re­cent Progress in the The­ory of Neu­ral Networks

jylin046 Oct 2022 16:57 UTC
78 points
6 comments4 min readLW link

Analysing a 2036 Takeover Scenario

ukc100146 Oct 2022 20:48 UTC
8 points
2 comments27 min readLW link

Warn­ing Shots Prob­a­bly Wouldn’t Change The Pic­ture Much

So8res6 Oct 2022 5:15 UTC
111 points
40 comments2 min readLW link

Align­ment Might Never Be Solved, By Hu­mans or AI

interstice7 Oct 2022 16:14 UTC
30 points
6 comments3 min readLW link

linkpost: neuro-sym­bolic hy­brid ai

Nathan Helm-Burger6 Oct 2022 21:52 UTC
16 points
0 comments1 min readLW link
(youtu.be)

Poly­se­man­tic­ity and Ca­pac­ity in Neu­ral Networks

7 Oct 2022 17:51 UTC
78 points
9 comments3 min readLW link

[Question] De­liber­ate prac­tice for re­search?

Alex_Altair8 Oct 2022 3:45 UTC
16 points
2 comments1 min readLW link

[Question] How many GPUs does NVIDIA make?

leogao8 Oct 2022 17:54 UTC
27 points
2 comments1 min readLW link

SERI MATS Pro­gram—Win­ter 2022 Cohort

8 Oct 2022 19:09 UTC
71 points
12 comments4 min readLW link

[Question] Toy al­ign­ment prob­lem: So­cial Ne­work KPI design

qbolec8 Oct 2022 22:14 UTC
7 points
1 comment1 min readLW link

My ten­ta­tive in­ter­pretabil­ity re­search agenda—topol­ogy match­ing.

Maxwell Clarke8 Oct 2022 22:14 UTC
10 points
2 comments4 min readLW link

[Question] AI Risk Micro­dy­nam­ics Survey

Froolow9 Oct 2022 20:04 UTC
3 points
0 comments1 min readLW link

Pos­si­ble miracles

9 Oct 2022 18:17 UTC
60 points
33 comments8 min readLW link

The Le­bowski The­o­rem — Char­i­ta­ble Reads of Anti-AGI-X-Risk Ar­gu­ments, Part 2

sstich8 Oct 2022 22:39 UTC
1 point
10 comments7 min readLW link

Embed­ding AI into AR goggles

aixar9 Oct 2022 20:08 UTC
−12 points
0 comments1 min readLW link

Cat­a­logu­ing Pri­ors in The­ory and Practice

Paul Bricman13 Oct 2022 12:36 UTC
13 points
8 comments7 min readLW link

Re­sults from the lan­guage model hackathon

Esben Kran10 Oct 2022 8:29 UTC
21 points
1 comment4 min readLW link

Don’t ex­pect AGI any­time soon

cveres10 Oct 2022 22:38 UTC
−14 points
6 comments1 min readLW link

Disen­tan­gling in­ner al­ign­ment failures

Erik Jenner10 Oct 2022 18:50 UTC
14 points
5 comments4 min readLW link

Anony­mous ad­vice: If you want to re­duce AI risk, should you take roles that ad­vance AI ca­pa­bil­ities?

Benjamin Hilton11 Oct 2022 14:16 UTC
54 points
10 comments1 min readLW link

Pret­tified AI Safety Game Cards

abramdemski11 Oct 2022 19:35 UTC
46 points
6 comments1 min readLW link

Power-Seek­ing AI and Ex­is­ten­tial Risk

Antonio Franca11 Oct 2022 22:50 UTC
5 points
0 comments9 min readLW link

Align­ment 201 curriculum

Richard_Ngo12 Oct 2022 18:03 UTC
102 points
3 comments1 min readLW link
(www.agisafetyfundamentals.com)

Ar­ti­cle Re­view: Google’s AlphaTensor

Robert_AIZI12 Oct 2022 18:04 UTC
8 points
2 comments10 min readLW link

[Question] Pre­vi­ous Work on Re­cre­at­ing Neu­ral Net­work In­put from In­ter­me­di­ate Layer Activations

bglass12 Oct 2022 19:28 UTC
1 point
3 comments1 min readLW link

You are bet­ter at math (and al­ign­ment) than you think

trevor13 Oct 2022 3:07 UTC
37 points
7 comments22 min readLW link
(www.lesswrong.com)

Coun­ter­ar­gu­ments to the ba­sic AI x-risk case

KatjaGrace14 Oct 2022 13:00 UTC
336 points
122 comments34 min readLW link
(aiimpacts.org)

Another prob­lem with AI con­fine­ment: or­di­nary CPUs can work as ra­dio transmitters

RomanS14 Oct 2022 8:28 UTC
34 points
1 comment1 min readLW link
(news.softpedia.com)

“AGI soon, but Nar­row works Bet­ter”

AnthonyRepetto14 Oct 2022 21:35 UTC
1 point
9 comments2 min readLW link

[Question] Best re­source to go from “typ­i­cal smart tech-savvy per­son” to “per­son who gets AGI risk ur­gency”?

Liron15 Oct 2022 22:26 UTC
14 points
8 comments1 min readLW link

[Question] Ques­tions about the al­ign­ment problem

GG1017 Oct 2022 1:42 UTC
−5 points
13 comments3 min readLW link

[Question] Creat­ing su­per­in­tel­li­gence with­out AGI

Antb17 Oct 2022 19:01 UTC
7 points
3 comments1 min readLW link

AI Safety Ideas: An Open AI Safety Re­search Platform

Esben Kran17 Oct 2022 17:01 UTC
24 points
0 comments1 min readLW link

Is GPT-N bounded by hu­man ca­pac­i­ties? No.

Cleo Nardo17 Oct 2022 23:26 UTC
5 points
4 comments2 min readLW link

A prag­matic met­ric for Ar­tifi­cial Gen­eral Intelligence

lorepieri17 Oct 2022 22:07 UTC
6 points
0 comments1 min readLW link
(lorenzopieri.com)

Is GitHub Copi­lot in le­gal trou­ble?

tcelferact18 Oct 2022 16:19 UTC
34 points
2 comments1 min readLW link

Me­tac­u­lus is build­ing a team ded­i­cated to AI forecasting

ChristianWilliams18 Oct 2022 16:08 UTC
3 points
0 comments1 min readLW link

[Question] Where can I find solu­tion to the ex­er­cises of AGISF?

Charbel-Raphaël18 Oct 2022 14:11 UTC
7 points
0 comments1 min readLW link

A con­ver­sa­tion about Katja’s coun­ter­ar­gu­ments to AI risk

18 Oct 2022 18:40 UTC
43 points
9 comments33 min readLW link

An Ex­tremely Opinionated An­no­tated List of My Favourite Mechanis­tic In­ter­pretabil­ity Papers

Neel Nanda18 Oct 2022 21:08 UTC
66 points
5 comments12 min readLW link
(www.neelnanda.io)

Distil­led Rep­re­sen­ta­tions Re­search Agenda

18 Oct 2022 20:59 UTC
15 points
2 comments8 min readLW link

[Question] Should we push for re­quiring AI train­ing data to be li­censed?

ChristianKl19 Oct 2022 17:49 UTC
38 points
32 comments1 min readLW link

Hacker-AI and Digi­tal Ghosts – Pre-AGI

Erland Wittkotter19 Oct 2022 15:33 UTC
9 points
7 comments8 min readLW link

Scal­ing Laws for Re­ward Model Overoptimization

20 Oct 2022 0:20 UTC
86 points
11 comments1 min readLW link
(arxiv.org)

The her­i­ta­bil­ity of hu­man val­ues: A be­hav­ior ge­netic cri­tique of Shard Theory

geoffreymiller20 Oct 2022 15:51 UTC
63 points
58 comments21 min readLW link

aisafety.com­mu­nity—A liv­ing doc­u­ment of AI safety communities

28 Oct 2022 17:50 UTC
52 points
22 comments1 min readLW link

Tra­jec­to­ries to 2036

ukc1001420 Oct 2022 20:23 UTC
1 point
1 comment14 min readLW link

In­tel­li­gent be­havi­our across sys­tems, scales and substrates

Nora_Ammann21 Oct 2022 17:09 UTC
11 points
0 comments10 min readLW link

A frame­work and open ques­tions for game the­o­retic shard modeling

Garrett Baker21 Oct 2022 21:40 UTC
11 points
4 comments4 min readLW link

[Question] The Last Year - is there an ex­ist­ing novel about the last year be­fore AI doom?

Luca Petrolati22 Oct 2022 20:44 UTC
4 points
4 comments1 min readLW link

Em­pow­er­ment is (al­most) All We Need

jacob_cannell23 Oct 2022 21:48 UTC
36 points
43 comments17 min readLW link

The op­ti­mal timing of spend­ing on AGI safety work; why we should prob­a­bly be spend­ing more now

Tristan Cook24 Oct 2022 17:42 UTC
62 points
0 comments1 min readLW link

A Bare­bones Guide to Mechanis­tic In­ter­pretabil­ity Prerequisites

Neel Nanda24 Oct 2022 20:45 UTC
62 points
8 comments3 min readLW link
(neelnanda.io)

Con­sider try­ing Vivek Heb­bar’s al­ign­ment exercises

Akash24 Oct 2022 19:46 UTC
36 points
1 comment4 min readLW link

POWER­play: An open-source toolchain to study AI power-seeking

Edouard Harris24 Oct 2022 20:03 UTC
22 points
0 comments1 min readLW link
(github.com)

What does it take to defend the world against out-of-con­trol AGIs?

Steven Byrnes25 Oct 2022 14:47 UTC
141 points
31 comments30 min readLW link

Mechanism De­sign for AI Safety—Read­ing Group Curriculum

Rubi J. Hudson25 Oct 2022 3:54 UTC
7 points
1 comment1 min readLW link

Maps and Blueprint; the Two Sides of the Align­ment Equation

Nora_Ammann25 Oct 2022 16:29 UTC
21 points
1 comment5 min readLW link

A Walk­through of A Math­e­mat­i­cal Frame­work for Trans­former Circuits

Neel Nanda25 Oct 2022 20:24 UTC
49 points
5 comments1 min readLW link
(www.youtube.com)

Paper: In-con­text Re­in­force­ment Learn­ing with Al­gorithm Distil­la­tion [Deep­mind]

LawrenceC26 Oct 2022 18:45 UTC
28 points
5 comments1 min readLW link
(arxiv.org)

Ap­ply to the Red­wood Re­search Mechanis­tic In­ter­pretabil­ity Ex­per­i­ment (REMIX), a re­search pro­gram in Berkeley

27 Oct 2022 1:32 UTC
134 points
14 comments12 min readLW link

You won’t solve al­ign­ment with­out agent foundations

Mikhail Samin6 Nov 2022 8:07 UTC
21 points
3 comments8 min readLW link

AI & ML Safety Up­dates W43

28 Oct 2022 13:18 UTC
9 points
3 comments3 min readLW link

Prizes for ML Safety Bench­mark Ideas

joshc28 Oct 2022 2:51 UTC
36 points
3 comments1 min readLW link

Me (Steve Byrnes) on the “Brain In­spired” podcast

Steven Byrnes30 Oct 2022 19:15 UTC
26 points
1 comment1 min readLW link
(braininspired.co)

Join the in­ter­pretabil­ity re­search hackathon

Esben Kran28 Oct 2022 16:26 UTC
15 points
0 comments1 min readLW link

In­stru­men­tal ig­nor­ing AI, Dumb but not use­less.

Donald Hobson30 Oct 2022 16:55 UTC
7 points
6 comments2 min readLW link

«Boundaries», Part 3a: Defin­ing bound­aries as di­rected Markov blankets

Andrew_Critch30 Oct 2022 6:31 UTC
58 points
13 comments15 min readLW link

[Book] In­ter­pretable Ma­chine Learn­ing: A Guide for Mak­ing Black Box Models Explainable

Esben Kran31 Oct 2022 11:38 UTC
19 points
1 comment1 min readLW link
(christophm.github.io)

“Cars and Elephants”: a hand­wavy ar­gu­ment/​anal­ogy against mechanis­tic interpretability

David Scott Krueger (formerly: capybaralet)31 Oct 2022 21:26 UTC
47 points
25 comments2 min readLW link

ML Safety Schol­ars Sum­mer 2022 Retrospective

TW1231 Nov 2022 3:09 UTC
29 points
0 comments1 min readLW link

What sorts of sys­tems can be de­cep­tive?

Andrei Alexandru31 Oct 2022 22:00 UTC
14 points
0 comments7 min readLW link

All AGI Safety ques­tions wel­come (es­pe­cially ba­sic ones) [~monthly thread]

Robert Miles1 Nov 2022 23:23 UTC
67 points
100 comments2 min readLW link

Real-Time Re­search Record­ing: Can a Trans­former Re-Derive Po­si­tional Info?

Neel Nanda1 Nov 2022 23:56 UTC
68 points
14 comments1 min readLW link
(youtu.be)

On the cor­re­spon­dence be­tween AI-mis­al­ign­ment and cog­ni­tive dis­so­nance us­ing a be­hav­ioral eco­nomics model

Stijn Bruers1 Nov 2022 17:39 UTC
4 points
0 comments6 min readLW link

WFW?: Op­por­tu­nity and The­ory of Impact

DavidCorfield2 Nov 2022 1:24 UTC
1 point
0 comments1 min readLW link

AI Safety Needs Great Product Builders

goodgravy2 Nov 2022 11:33 UTC
14 points
2 comments1 min readLW link

A Mys­tery About High Di­men­sional Con­cept Encoding

Fabien Roger3 Nov 2022 17:05 UTC
46 points
13 comments7 min readLW link

Ethan Ca­ballero on Bro­ken Neu­ral Scal­ing Laws, De­cep­tion, and Re­cur­sive Self Improvement

4 Nov 2022 18:09 UTC
14 points
11 comments5 min readLW link
(theinsideview.ai)

Can we pre­dict the abil­ities of fu­ture AI? MLAISU W44

4 Nov 2022 15:19 UTC
10 points
0 comments3 min readLW link
(newsletter.apartresearch.com)

My sum­mary of “Prag­matic AI Safety”

Eleni Angelou5 Nov 2022 12:54 UTC
2 points
0 comments5 min readLW link

Re­view of the Challenge

SD Marlow5 Nov 2022 6:38 UTC
−14 points
5 comments2 min readLW link

How to store hu­man val­ues on a computer

Oliver Siegel5 Nov 2022 19:17 UTC
−12 points
17 comments1 min readLW link

Should AI fo­cus on prob­lem-solv­ing or strate­gic plan­ning? Why not both?

Oliver Siegel5 Nov 2022 19:17 UTC
−12 points
3 comments1 min readLW link

In­stead of tech­ni­cal re­search, more peo­ple should fo­cus on buy­ing time

5 Nov 2022 20:43 UTC
80 points
51 comments14 min readLW link

[Question] Is there some kind of back­log or de­lay for data cen­ter AI?

trevor7 Nov 2022 8:18 UTC
5 points
2 comments1 min readLW link

A Walk­through of In­ter­pretabil­ity in the Wild (w/​ au­thors Kevin Wang, Arthur Conmy & Alexan­dre Variengien)

Neel Nanda7 Nov 2022 22:39 UTC
29 points
15 comments3 min readLW link
(youtu.be)

How could we know that an AGI sys­tem will have good con­se­quences?

So8res7 Nov 2022 22:42 UTC
86 points
24 comments5 min readLW link

Peo­ple care about each other even though they have im­perfect mo­ti­va­tional poin­t­ers?

TurnTrout8 Nov 2022 18:15 UTC
32 points
25 comments7 min readLW link

[ASoT] Thoughts on GPT-N

Ulisse Mini8 Nov 2022 7:14 UTC
8 points
0 comments1 min readLW link

In­verse scal­ing can be­come U-shaped

Edouard Harris8 Nov 2022 19:04 UTC
27 points
15 comments1 min readLW link
(arxiv.org)

Counterfactability

Scott Garrabrant7 Nov 2022 5:39 UTC
36 points
4 comments11 min readLW link

Take­aways from a sur­vey on AI al­ign­ment resources

DanielFilan5 Nov 2022 23:40 UTC
73 points
9 comments6 min readLW link
(danielfilan.com)

[ASoT] In­stru­men­tal con­ver­gence is useful

Ulisse Mini9 Nov 2022 20:20 UTC
5 points
9 comments1 min readLW link

Me­satrans­la­tion and Metatranslation

jdp9 Nov 2022 18:46 UTC
23 points
4 comments11 min readLW link

The In­ter­pretabil­ity Playground

Esben Kran10 Nov 2022 17:15 UTC
8 points
0 comments1 min readLW link
(alignmentjam.com)

Align­ment al­lows “non­ro­bust” de­ci­sion-in­fluences and doesn’t re­quire ro­bust grading

TurnTrout29 Nov 2022 6:23 UTC
55 points
27 comments15 min readLW link

[Question] What are some low-cost out­side-the-box ways to do/​fund al­ign­ment re­search?

trevor11 Nov 2022 5:25 UTC
10 points
0 comments1 min readLW link

In­stru­men­tal con­ver­gence is what makes gen­eral in­tel­li­gence possible

tailcalled11 Nov 2022 16:38 UTC
72 points
11 comments4 min readLW link

A short cri­tique of Vanessa Kosoy’s PreDCA

Martín Soto13 Nov 2022 16:00 UTC
25 points
8 comments4 min readLW link

[Question] Why don’t we have self driv­ing cars yet?

Linda Linsefors14 Nov 2022 12:19 UTC
21 points
16 comments1 min readLW link

Win­ners of the AI Safety Nudge Competition

Marc Carauleanu15 Nov 2022 1:06 UTC
4 points
0 comments1 min readLW link

[Question] Will nan­otech/​biotech be what leads to AI doom?

tailcalled15 Nov 2022 17:38 UTC
4 points
8 comments2 min readLW link

[Question] What is our cur­rent best in­fo­haz­ard policy for AGI (safety) re­search?

Roman Leventov15 Nov 2022 22:33 UTC
12 points
2 comments1 min readLW link

Disagree­ment with bio an­chors that lead to shorter timelines

Marius Hobbhahn16 Nov 2022 14:40 UTC
72 points
16 comments7 min readLW link

Cur­rent themes in mechanis­tic in­ter­pretabil­ity research

16 Nov 2022 14:14 UTC
82 points
3 comments12 min readLW link

[Question] Is there some rea­son LLMs haven’t seen broader use?

tailcalled16 Nov 2022 20:04 UTC
25 points
27 comments1 min readLW link

AI Fore­cast­ing Re­search Ideas

Jsevillamol17 Nov 2022 17:37 UTC
21 points
2 comments1 min readLW link

Re­sults from the in­ter­pretabil­ity hackathon

17 Nov 2022 14:51 UTC
80 points
0 comments6 min readLW link

Don’t de­sign agents which ex­ploit ad­ver­sar­ial inputs

18 Nov 2022 1:48 UTC
60 points
61 comments12 min readLW link

AI Ethics != Ai Safety

Dentin18 Nov 2022 3:02 UTC
2 points
0 comments1 min readLW link

Up­date to Mys­ter­ies of mode col­lapse: text-davinci-002 not RLHF

janus19 Nov 2022 23:51 UTC
69 points
8 comments2 min readLW link

Limits to the Con­trol­la­bil­ity of AGI

20 Nov 2022 19:18 UTC
10 points
2 comments9 min readLW link

[ASoT] Reflec­tivity in Nar­row AI

Ulisse Mini21 Nov 2022 0:51 UTC
6 points
1 comment1 min readLW link

Here’s the exit.

Valentine21 Nov 2022 18:07 UTC
85 points
138 comments10 min readLW link

Clar­ify­ing wire­head­ing terminology

leogao24 Nov 2022 4:53 UTC
53 points
6 comments1 min readLW link

A Walk­through of In-Con­text Learn­ing and In­duc­tion Heads (w/​ Charles Frye) Part 1 of 2

Neel Nanda22 Nov 2022 17:12 UTC
20 points
0 comments1 min readLW link
(www.youtube.com)

An­nounc­ing AI safety Men­tors and Mentees

Marius Hobbhahn23 Nov 2022 15:21 UTC
54 points
7 comments10 min readLW link

My take on Ja­cob Can­nell’s take on AGI safety

Steven Byrnes28 Nov 2022 14:01 UTC
61 points
13 comments30 min readLW link

Don’t al­ign agents to eval­u­a­tions of plans

TurnTrout26 Nov 2022 21:16 UTC
37 points
46 comments18 min readLW link

[Question] Dumb and ill-posed ques­tion: Is con­cep­tual re­search like this MIRI pa­per on the shut­down prob­lem/​Cor­rigi­bil­ity “real”

joraine24 Nov 2022 5:08 UTC
25 points
11 comments1 min readLW link

Refin­ing the Sharp Left Turn threat model, part 2: ap­ply­ing al­ign­ment techniques

25 Nov 2022 14:36 UTC
36 points
4 comments6 min readLW link
(vkrakovna.wordpress.com)

Pod­cast: Shoshan­nah Tekofsky on skil­ling up in AI safety, vis­it­ing Berkeley, and de­vel­op­ing novel re­search ideas

Akash25 Nov 2022 20:47 UTC
37 points
2 comments9 min readLW link

Mechanis­tic anomaly de­tec­tion and ELK

paulfchristiano25 Nov 2022 18:50 UTC
121 points
17 comments21 min readLW link
(ai-alignment.com)

The First Filter

26 Nov 2022 19:37 UTC
55 points
5 comments1 min readLW link

Dis­cussing how to al­ign Trans­for­ma­tive AI if it’s de­vel­oped very soon

28 Nov 2022 16:17 UTC
36 points
2 comments30 min readLW link

On the Di­plo­macy AI

Zvi28 Nov 2022 13:20 UTC
119 points
29 comments11 min readLW link
(thezvi.wordpress.com)

Why Would AI “Aim” To Defeat Hu­man­ity?

HoldenKarnofsky29 Nov 2022 19:30 UTC
68 points
9 comments33 min readLW link
(www.cold-takes.com)

Dist­in­guish­ing test from training

So8res29 Nov 2022 21:41 UTC
65 points
10 comments6 min readLW link

[Question] Do any of the AI Risk eval­u­a­tions fo­cus on hu­mans as the risk?

jmh30 Nov 2022 3:09 UTC
10 points
8 comments1 min readLW link

Ap­ply to at­tend win­ter AI al­ign­ment work­shops (Dec 28-30 & Jan 3-5) near Berkeley

1 Dec 2022 20:46 UTC
25 points
1 comment1 min readLW link

The­o­ries of im­pact for Science of Deep Learning

Marius Hobbhahn1 Dec 2022 14:39 UTC
16 points
0 comments11 min readLW link

In­ner and outer al­ign­ment de­com­pose one hard prob­lem into two ex­tremely hard problems

TurnTrout2 Dec 2022 2:43 UTC
96 points
18 comments53 min readLW link

The Plan − 2022 Update

johnswentworth1 Dec 2022 20:43 UTC
211 points
33 comments8 min readLW link

Find­ing gliders in the game of life

paulfchristiano1 Dec 2022 20:40 UTC
91 points
7 comments16 min readLW link
(ai-alignment.com)

Take 1: We’re not go­ing to re­verse-en­g­ineer the AI.

Charlie Steiner1 Dec 2022 22:41 UTC
38 points
4 comments4 min readLW link

Un­der­stand­ing goals in com­plex systems

Johannes C. Mayer1 Dec 2022 23:49 UTC
9 points
0 comments1 min readLW link
(www.youtube.com)

Mas­ter­ing Strat­ego (Deep­mind)

svemirski2 Dec 2022 2:21 UTC
6 points
0 comments1 min readLW link
(www.deepmind.com)

Jailbreak­ing ChatGPT on Re­lease Day

Zvi2 Dec 2022 13:10 UTC
237 points
74 comments6 min readLW link
(thezvi.wordpress.com)

[Question] Did I just catch GPTchat do­ing some­thing un­ex­pect­edly in­sight­ful?

trevor2 Dec 2022 7:48 UTC
9 points
0 comments1 min readLW link

Take 2: Build­ing tools to help build FAI is a le­gi­t­i­mate strat­egy, but it’s dual-use.

Charlie Steiner3 Dec 2022 0:54 UTC
16 points
1 comment2 min readLW link

Causal scrub­bing: re­sults on in­duc­tion heads

3 Dec 2022 0:59 UTC
32 points
0 comments17 min readLW link

Log­i­cal in­duc­tion for soft­ware engineers

Alex Flint3 Dec 2022 19:55 UTC
124 points
2 comments27 min readLW link

ChatGPT is sur­pris­ingly and un­can­ningly good at pre­tend­ing to be sentient

Victor Novikov3 Dec 2022 14:47 UTC
17 points
11 comments18 min readLW link

Monthly Shorts 11/​22

Celer5 Dec 2022 7:30 UTC
8 points
0 comments3 min readLW link
(keller.substack.com)

Take 4: One prob­lem with nat­u­ral ab­strac­tions is there’s too many of them.

Charlie Steiner5 Dec 2022 10:39 UTC
34 points
4 comments1 min readLW link

The No Free Lunch the­o­rem for dummies

Steven Byrnes5 Dec 2022 21:46 UTC
28 points
16 comments3 min readLW link

[Link] Why I’m op­ti­mistic about OpenAI’s al­ign­ment approach

janleike5 Dec 2022 22:51 UTC
93 points
13 comments1 min readLW link
(aligned.substack.com)

Up­dat­ing my AI timelines

Matthew Barnett5 Dec 2022 20:46 UTC
134 points
40 comments2 min readLW link

ChatGPT and Ide­olog­i­cal Tur­ing Test

Viliam5 Dec 2022 21:45 UTC
41 points
1 comment1 min readLW link

Ver­ifi­ca­tion Is Not Easier Than Gen­er­a­tion In General

johnswentworth6 Dec 2022 5:20 UTC
56 points
23 comments1 min readLW link

[Question] What are the ma­jor un­der­ly­ing di­vi­sions in AI safety?

Chris Leong6 Dec 2022 3:28 UTC
5 points
2 comments1 min readLW link

Take 5: Another prob­lem for nat­u­ral ab­strac­tions is laz­i­ness.

Charlie Steiner6 Dec 2022 7:00 UTC
30 points
4 comments3 min readLW link

Mesa-Op­ti­miz­ers via Grokking

orthonormal6 Dec 2022 20:05 UTC
35 points
4 comments6 min readLW link

[Question] How do finite fac­tored sets com­pare with phase space?

Alex_Altair6 Dec 2022 20:05 UTC
14 points
1 comment1 min readLW link

Us­ing GPT-Eliezer against ChatGPT Jailbreaking

6 Dec 2022 19:54 UTC
159 points
77 comments9 min readLW link

Take 6: CAIS is ac­tu­ally Or­wellian.

Charlie Steiner7 Dec 2022 13:50 UTC
14 points
5 comments2 min readLW link

[Question] Look­ing for ideas of pub­lic as­sets (stocks, funds, ETFs) that I can in­vest in to have a chance at prof­it­ing from the mass adop­tion and com­mer­cial­iza­tion of AI technology

Annapurna7 Dec 2022 22:35 UTC
15 points
9 comments1 min readLW link

You should con­sider launch­ing an AI startup

joshc8 Dec 2022 0:28 UTC
5 points
16 comments4 min readLW link

Ma­chine Learn­ing Consent

jefftk8 Dec 2022 3:50 UTC
38 points
14 comments3 min readLW link
(www.jefftk.com)

Rele­vant to nat­u­ral ab­strac­tions: Eu­clidean Sym­me­try Equiv­ar­i­ant Ma­chine Learn­ing—Overview, Ap­pli­ca­tions, and Open Questions

the gears to ascension8 Dec 2022 18:01 UTC
7 points
0 comments1 min readLW link
(youtu.be)

AI Safety Seems Hard to Measure

HoldenKarnofsky8 Dec 2022 19:50 UTC
68 points
5 comments14 min readLW link
(www.cold-takes.com)

[Question] How is the “sharp left turn defined”?

Chris_Leong9 Dec 2022 0:04 UTC
13 points
3 comments1 min readLW link

Linkpost for a gen­er­al­ist al­gorith­mic learner: ca­pa­ble of car­ry­ing out sort­ing, short­est paths, string match­ing, con­vex hull find­ing in one network

lovetheusers9 Dec 2022 0:02 UTC
7 points
1 comment1 min readLW link
(twitter.com)

Timelines ARE rele­vant to al­ign­ment re­search (timelines 2 of ?)

Nathan Helm-Burger24 Aug 2022 0:19 UTC
11 points
5 comments6 min readLW link

Pro­saic mis­al­ign­ment from the Solomonoff Predictor

Cleo Nardo9 Dec 2022 17:53 UTC
11 points
0 comments5 min readLW link

[Question] Does a LLM have a util­ity func­tion?

Dagon9 Dec 2022 17:19 UTC
16 points
6 comments1 min readLW link

ML Safety at NeurIPS & Paradig­matic AI Safety? MLAISU W49

9 Dec 2022 10:38 UTC
14 points
0 comments4 min readLW link
(newsletter.apartresearch.com)

Take 8: Queer the in­ner/​outer al­ign­ment di­chotomy.

Charlie Steiner9 Dec 2022 17:46 UTC
26 points
2 comments2 min readLW link

My thoughts on OpenAI’s Align­ment plan

Donald Hobson10 Dec 2022 10:35 UTC
20 points
0 comments6 min readLW link

[ASoT] Nat­u­ral ab­strac­tions and AlphaZero

Ulisse Mini10 Dec 2022 17:53 UTC
31 points
1 comment1 min readLW link
(arxiv.org)

[Question] How promis­ing are le­gal av­enues to re­strict AI train­ing data?

thehalliard10 Dec 2022 16:31 UTC
9 points
2 comments1 min readLW link

Con­sider us­ing re­versible au­tomata for al­ign­ment research

Alex_Altair11 Dec 2022 1:00 UTC
81 points
29 comments2 min readLW link

[fic­tion] Our Fi­nal Hour

Mati_Roy11 Dec 2022 5:49 UTC
16 points
5 comments3 min readLW link

A crisis for on­line com­mu­ni­ca­tion: bots and bot users will over­run the In­ter­net?

Mitchell_Porter11 Dec 2022 21:11 UTC
23 points
11 comments1 min readLW link

Refram­ing in­ner alignment

davidad11 Dec 2022 13:53 UTC
47 points
13 comments4 min readLW link

Side-chan­nels: in­put ver­sus output

davidad12 Dec 2022 12:32 UTC
35 points
9 comments2 min readLW link

Psy­cholog­i­cal Di­sor­ders and Problems

12 Dec 2022 18:15 UTC
35 points
5 comments1 min readLW link

Prod­ding ChatGPT to solve a ba­sic alge­bra problem

Shmi12 Dec 2022 4:09 UTC
14 points
6 comments1 min readLW link
(twitter.com)

A brain­teaser for lan­guage models

Adam Scherlis12 Dec 2022 2:43 UTC
46 points
3 comments2 min readLW link

Take 9: No, RLHF/​IDA/​de­bate doesn’t solve outer al­ign­ment.

Charlie Steiner12 Dec 2022 11:51 UTC
36 points
14 comments2 min readLW link

12 ca­reer-re­lated ques­tions that may (or may not) be helpful for peo­ple in­ter­ested in al­ign­ment research

Akash12 Dec 2022 22:36 UTC
18 points
0 comments2 min readLW link

Finite Fac­tored Sets in Pictures

Magdalena Wache11 Dec 2022 18:49 UTC
149 points
31 comments12 min readLW link

Con­cept ex­trap­o­la­tion for hy­poth­e­sis generation

12 Dec 2022 22:09 UTC
20 points
2 comments3 min readLW link

Take 10: Fine-tun­ing with RLHF is aes­thet­i­cally un­satis­fy­ing.

Charlie Steiner13 Dec 2022 7:04 UTC
30 points
3 comments2 min readLW link

AI al­ign­ment is dis­tinct from its near-term applications

paulfchristiano13 Dec 2022 7:10 UTC
233 points
5 comments2 min readLW link
(ai-alignment.com)

Okay, I feel it now

g113 Dec 2022 11:01 UTC
84 points
14 comments1 min readLW link

What Does It Mean to Align AI With Hu­man Values?

Algon13 Dec 2022 16:56 UTC
8 points
3 comments1 min readLW link
(www.quantamagazine.org)

[Question] Is the ChatGPT-simu­lated Linux vir­tual ma­chine real?

Kenoubi13 Dec 2022 15:41 UTC
18 points
7 comments1 min readLW link

[In­terim re­search re­port] Tak­ing fea­tures out of su­per­po­si­tion with sparse autoencoders

13 Dec 2022 15:41 UTC
80 points
10 comments22 min readLW link

Ex­is­ten­tial AI Safety is NOT sep­a­rate from near-term applications

scasper13 Dec 2022 14:47 UTC
37 points
16 comments3 min readLW link

My AGI safety re­search—2022 re­view, ’23 plans

Steven Byrnes14 Dec 2022 15:15 UTC
34 points
6 comments6 min readLW link

Try­ing to dis­am­biguate differ­ent ques­tions about whether RLHF is “good”

Buck14 Dec 2022 4:03 UTC
92 points
40 comments7 min readLW link

Pre­dict­ing GPU performance

14 Dec 2022 16:27 UTC
59 points
24 comments1 min readLW link
(epochai.org)

[Question] Is the AI timeline too short to have chil­dren?

Yoreth14 Dec 2022 18:32 UTC
33 points
20 comments1 min readLW link

«Boundaries», Part 3b: Align­ment prob­lems in terms of bound­aries

Andrew_Critch14 Dec 2022 22:34 UTC
49 points
2 comments13 min readLW link

[Question] Is Paul Chris­ti­ano still as op­ti­mistic about Ap­proval-Directed Agents as he was in 2018?

Chris_Leong14 Dec 2022 23:28 UTC
8 points
0 comments1 min readLW link

Align­ing al­ign­ment with performance

Marv K14 Dec 2022 22:19 UTC
2 points
0 comments2 min readLW link

AI Ne­o­re­al­ism: a threat model & suc­cess crite­rion for ex­is­ten­tial safety

davidad15 Dec 2022 13:42 UTC
39 points
0 comments3 min readLW link

The next decades might be wild

Marius Hobbhahn15 Dec 2022 16:10 UTC
157 points
27 comments41 min readLW link

High-level hopes for AI alignment

HoldenKarnofsky15 Dec 2022 18:00 UTC
42 points
3 comments19 min readLW link
(www.cold-takes.com)

[Question] How is ARC plan­ning to use ELK?

jacquesthibs15 Dec 2022 20:11 UTC
23 points
5 comments1 min readLW link

AI over­hangs de­pend on whether al­gorithms, com­pute and data are sub­sti­tutes or complements

NathanBarnard16 Dec 2022 2:23 UTC
2 points
0 comments3 min readLW link

Paper: Trans­form­ers learn in-con­text by gra­di­ent descent

LawrenceC16 Dec 2022 11:10 UTC
26 points
11 comments2 min readLW link
(arxiv.org)

How im­por­tant are ac­cu­rate AI timelines for the op­ti­mal spend­ing sched­ule on AI risk in­ter­ven­tions?

Tristan Cook16 Dec 2022 16:05 UTC
27 points
2 comments1 min readLW link

Will Machines Ever Rule the World? MLAISU W50

Esben Kran16 Dec 2022 11:03 UTC
12 points
7 comments4 min readLW link
(newsletter.apartresearch.com)

Can we effi­ciently ex­plain model be­hav­iors?

paulfchristiano16 Dec 2022 19:40 UTC
63 points
0 comments9 min readLW link
(ai-alignment.com)

[Question] Col­lege Selec­tion Ad­vice for Tech­ni­cal Alignment

TempCollegeAsk16 Dec 2022 17:11 UTC
11 points
8 comments1 min readLW link

Paper: Con­sti­tu­tional AI: Harm­less­ness from AI Feed­back (An­thropic)

LawrenceC16 Dec 2022 22:12 UTC
60 points
10 comments1 min readLW link
(www.anthropic.com)

Pos­i­tive val­ues seem more ro­bust and last­ing than prohibitions

TurnTrout17 Dec 2022 21:43 UTC
42 points
12 comments2 min readLW link

Take 11: “Align­ing lan­guage mod­els” should be weirder.

Charlie Steiner18 Dec 2022 14:14 UTC
29 points
0 comments2 min readLW link

Why I think that teach­ing philos­o­phy is high impact

Eleni Angelou19 Dec 2022 3:11 UTC
5 points
0 comments2 min readLW link

Event [Berkeley]: Align­ment Col­lab­o­ra­tor Speed-Meeting

19 Dec 2022 2:24 UTC
18 points
2 comments1 min readLW link

The ‘Old AI’: Les­sons for AI gov­er­nance from early elec­tric­ity regulation

19 Dec 2022 2:42 UTC
7 points
0 comments13 min readLW link

Note on al­gorithms with mul­ti­ple trained components

Steven Byrnes20 Dec 2022 17:08 UTC
19 points
4 comments2 min readLW link

Why mechanis­tic in­ter­pretabil­ity does not and can­not con­tribute to long-term AGI safety (from mes­sages with a friend)

Remmelt19 Dec 2022 12:02 UTC
8 points
6 comments31 min readLW link

Next Level Seinfeld

Zvi19 Dec 2022 13:30 UTC
45 points
6 comments1 min readLW link
(thezvi.wordpress.com)

Solu­tion to The Align­ment Problem

Algon19 Dec 2022 20:12 UTC
10 points
0 comments2 min readLW link

Shard The­ory in Nine Th­e­ses: a Distil­la­tion and Crit­i­cal Appraisal

LawrenceC19 Dec 2022 22:52 UTC
80 points
14 comments17 min readLW link

The “Min­i­mal La­tents” Ap­proach to Nat­u­ral Abstractions

johnswentworth20 Dec 2022 1:22 UTC
41 points
14 comments12 min readLW link

Take 12: RLHF’s use is ev­i­dence that orgs will jam RL at real-world prob­lems.

Charlie Steiner20 Dec 2022 5:01 UTC
23 points
0 comments3 min readLW link

[link, 2019] AI paradigm: in­ter­ac­tive learn­ing from un­la­beled instructions

the gears to ascension20 Dec 2022 6:45 UTC
2 points
0 comments2 min readLW link
(jgrizou.github.io)

Dis­cov­er­ing Lan­guage Model Be­hav­iors with Model-Writ­ten Evaluations

20 Dec 2022 20:08 UTC
45 points
6 comments1 min readLW link
(www.anthropic.com)

Pod­cast: Tam­era Lan­ham on AI risk, threat mod­els, al­ign­ment pro­pos­als, ex­ter­nal­ized rea­son­ing over­sight, and work­ing at Anthropic

Akash20 Dec 2022 21:39 UTC
14 points
2 comments11 min readLW link

Google Search loses to ChatGPT fair and square

Shmi21 Dec 2022 8:11 UTC
12 points
6 comments1 min readLW link
(www.surgehq.ai)

A Com­pre­hen­sive Mechanis­tic In­ter­pretabil­ity Ex­plainer & Glossary

Neel Nanda21 Dec 2022 12:35 UTC
40 points
0 comments2 min readLW link
(neelnanda.io)

Price’s equa­tion for neu­ral networks

tailcalled21 Dec 2022 13:09 UTC
22 points
3 comments2 min readLW link

[Question] [DISC] Are Values Ro­bust?

DragonGod21 Dec 2022 1:00 UTC
12 points
5 comments2 min readLW link

Me­taphor.systems

the gears to ascension21 Dec 2022 21:31 UTC
9 points
2 comments1 min readLW link
(metaphor.systems)

The Hu­man’s Hid­den Utility Func­tion (Maybe)

lukeprog23 Jan 2012 19:39 UTC
64 points
90 comments3 min readLW link

Us­ing vec­tor fields to vi­su­al­ise prefer­ences and make them consistent

28 Jan 2020 19:44 UTC
41 points
32 comments11 min readLW link

[Ar­ti­cle re­view] Ar­tifi­cial In­tel­li­gence, Values, and Alignment

MichaelA9 Mar 2020 12:42 UTC
13 points
5 comments10 min readLW link

Clar­ify­ing some key hy­pothe­ses in AI alignment

15 Aug 2019 21:29 UTC
78 points
12 comments9 min readLW link

Failures in tech­nol­ogy fore­cast­ing? A re­ply to Ord and Yudkowsky

MichaelA8 May 2020 12:41 UTC
44 points
19 comments11 min readLW link

[Link and com­men­tary] The Offense-Defense Balance of Scien­tific Knowl­edge: Does Pub­lish­ing AI Re­search Re­duce Mi­suse?

MichaelA16 Feb 2020 19:56 UTC
24 points
4 comments3 min readLW link

How can In­ter­pretabil­ity help Align­ment?

23 May 2020 16:16 UTC
37 points
3 comments9 min readLW link

A Prob­lem With Patternism

B Jacobs19 May 2020 20:16 UTC
5 points
52 comments1 min readLW link

Goal-di­rect­ed­ness is be­hav­ioral, not structural

adamShimi8 Jun 2020 23:05 UTC
6 points
12 comments3 min readLW link

Learn­ing Deep Learn­ing: Join­ing data sci­ence re­search as a mathematician

magfrump19 Oct 2017 19:14 UTC
10 points
4 comments3 min readLW link

Will AI un­dergo dis­con­tin­u­ous progress?

Sammy Martin21 Feb 2020 22:16 UTC
26 points
21 comments20 min readLW link

The Value Defi­ni­tion Problem

Sammy Martin18 Nov 2019 19:56 UTC
14 points
6 comments11 min readLW link

Life at Three Tails of the Bell Curve

lsusr27 Jun 2020 8:49 UTC
63 points
10 comments4 min readLW link

How do take­off speeds af­fect the prob­a­bil­ity of bad out­comes from AGI?

KR29 Jun 2020 22:06 UTC
15 points
2 comments8 min readLW link

AI Benefits Post 2: How AI Benefits Differs from AI Align­ment & AI for Good

Cullen29 Jun 2020 17:00 UTC
8 points
7 comments2 min readLW link

Null-box­ing New­comb’s Problem

Yitz13 Jul 2020 16:32 UTC
33 points
10 comments4 min readLW link

No non­sense ver­sion of the “racial al­gorithm bias”

Yuxi_Liu13 Jul 2019 15:39 UTC
115 points
20 comments2 min readLW link

Ed­u­ca­tion 2.0 — A brand new ed­u­ca­tion system

aryan15 Jul 2020 10:09 UTC
−8 points
3 comments6 min readLW link

What it means to optimise

Neel Nanda25 Jul 2020 9:40 UTC
5 points
0 comments8 min readLW link
(www.neelnanda.io)

[Question] Where are peo­ple think­ing and talk­ing about global co­or­di­na­tion for AI safety?

Wei Dai22 May 2019 6:24 UTC
103 points
22 comments1 min readLW link

The strat­egy-steal­ing assumption

paulfchristiano16 Sep 2019 15:23 UTC
72 points
46 comments12 min readLW link3 reviews

Con­ver­sa­tion with Paul Christiano

abergal11 Sep 2019 23:20 UTC
44 points
6 comments30 min readLW link
(aiimpacts.org)

Tran­scrip­tion of Eliezer’s Jan­uary 2010 video Q&A

curiousepic14 Nov 2011 17:02 UTC
112 points
9 comments56 min readLW link

Re­sources for AI Align­ment Cartography

Gyrodiot4 Apr 2020 14:20 UTC
45 points
8 comments9 min readLW link

Thoughts on Ben Garfinkel’s “How sure are we about this AI stuff?”

David Scott Krueger (formerly: capybaralet)6 Feb 2019 19:09 UTC
25 points
17 comments1 min readLW link

An­nounce­ment: AI al­ign­ment prize round 2 win­ners and next round

cousin_it16 Apr 2018 3:08 UTC
64 points
29 comments2 min readLW link

An­nounce­ment: AI al­ign­ment prize round 3 win­ners and next round

cousin_it15 Jul 2018 7:40 UTC
93 points
7 comments1 min readLW link

Se­cu­rity Mind­set and the Lo­gis­tic Suc­cess Curve

Eliezer Yudkowsky26 Nov 2017 15:58 UTC
76 points
48 comments20 min readLW link

Ar­bital scrape

emmab6 Jun 2019 23:11 UTC
89 points
23 comments1 min readLW link

The Strangest Thing An AI Could Tell You

Eliezer Yudkowsky15 Jul 2009 2:27 UTC
116 points
605 comments2 min readLW link

Self-fulfilling correlations

PhilGoetz26 Aug 2010 21:07 UTC
144 points
50 comments3 min readLW link

Zoom In: An In­tro­duc­tion to Circuits

evhub10 Mar 2020 19:36 UTC
84 points
11 comments2 min readLW link
(distill.pub)

Should ethi­cists be in­side or out­side a pro­fes­sion?

Eliezer Yudkowsky12 Dec 2018 1:40 UTC
87 points
6 comments9 min readLW link

Im­plicit extortion

paulfchristiano13 Apr 2018 16:33 UTC
29 points
16 comments6 min readLW link
(ai-alignment.com)

Bayesian Judo

Eliezer Yudkowsky31 Jul 2007 5:53 UTC
87 points
108 comments1 min readLW link

An­nounc­ing Align­men­tFo­rum.org Beta

Raemon10 Jul 2018 20:19 UTC
67 points
35 comments2 min readLW link

An­nounc­ing the Align­ment Newsletter

Rohin Shah9 Apr 2018 21:16 UTC
29 points
3 comments1 min readLW link

He­len Toner on China, CSET, and AI

Rob Bensinger21 Apr 2019 4:10 UTC
68 points
3 comments7 min readLW link
(rationallyspeakingpodcast.org)

A sim­ple en­vi­ron­ment for show­ing mesa misalignment

Matthew Barnett26 Sep 2019 4:44 UTC
70 points
9 comments2 min readLW link

The E-Coli Test for AI Alignment

johnswentworth16 Dec 2018 8:10 UTC
69 points
24 comments1 min readLW link

Re­cent Progress in the The­ory of Neu­ral Networks

interstice4 Dec 2019 23:11 UTC
76 points
9 comments9 min readLW link

The Art of the Ar­tifi­cial: In­sights from ‘Ar­tifi­cial In­tel­li­gence: A Modern Ap­proach’

TurnTrout25 Mar 2018 6:55 UTC
31 points
8 comments15 min readLW link

Head­ing off a near-term AGI arms race

lincolnquirk22 Aug 2012 14:23 UTC
10 points
70 comments1 min readLW link

Out­perform­ing the hu­man Atari benchmark

Vaniver31 Mar 2020 19:33 UTC
58 points
5 comments1 min readLW link
(deepmind.com)

Con­ver­sa­tional Pre­sen­ta­tion of Why Au­toma­tion is Differ­ent This Time

ryan_b17 Jan 2018 22:11 UTC
33 points
26 comments1 min readLW link

A rant against robots

Lê Nguyên Hoang14 Jan 2020 22:03 UTC
64 points
7 comments5 min readLW link

Clar­ify­ing “AI Align­ment”

paulfchristiano15 Nov 2018 14:41 UTC
64 points
82 comments3 min readLW link2 reviews

Tiling Agents for Self-Mod­ify­ing AI (OPFAI #2)

Eliezer Yudkowsky6 Jun 2013 20:24 UTC
84 points
259 comments3 min readLW link

EDT solves 5 and 10 with con­di­tional oracles

jessicata30 Sep 2018 7:57 UTC
59 points
8 comments13 min readLW link

AGI and Friendly AI in the dom­i­nant AI textbook

lukeprog11 Mar 2011 4:12 UTC
73 points
27 comments3 min readLW link

Ta­boo­ing ‘Agent’ for Pro­saic Alignment

Hjalmar_Wijk23 Aug 2019 2:55 UTC
54 points
10 comments6 min readLW link

Is this what FAI out­reach suc­cess looks like?

Charlie Steiner9 Mar 2018 13:12 UTC
17 points
3 comments1 min readLW link
(www.youtube.com)

Align­ing a toy model of optimization

paulfchristiano28 Jun 2019 20:23 UTC
52 points
26 comments3 min readLW link

Deep­Mind ar­ti­cle: AI Safety Gridworlds

scarcegreengrass30 Nov 2017 16:13 UTC
24 points
5 comments1 min readLW link
(deepmind.com)

Bot­world: a cel­lu­lar au­toma­ton for study­ing self-mod­ify­ing agents em­bed­ded in their environment

So8res12 Apr 2014 0:56 UTC
78 points
55 comments7 min readLW link

“UDT2” and “against UD+ASSA”

Wei Dai12 May 2019 4:18 UTC
50 points
7 comments7 min readLW link

Us­ing ly­ing to de­tect hu­man values

Stuart_Armstrong15 Mar 2018 11:37 UTC
19 points
6 comments1 min readLW link

Another AI Win­ter?

PeterMcCluskey25 Dec 2019 0:58 UTC
47 points
14 comments4 min readLW link
(www.bayesianinvestor.com)

Model­ing AGI Safety Frame­works with Causal In­fluence Diagrams

Ramana Kumar21 Jun 2019 12:50 UTC
43 points
6 comments1 min readLW link
(arxiv.org)

The Ur­gent Meta-Ethics of Friendly Ar­tifi­cial Intelligence

lukeprog1 Feb 2011 14:15 UTC
76 points
252 comments1 min readLW link

Henry Kiss­inger: AI Could Mean the End of Hu­man History

ESRogs15 May 2018 20:11 UTC
17 points
12 comments1 min readLW link
(www.theatlantic.com)

Self-con­firm­ing pre­dic­tions can be ar­bi­trar­ily bad

Stuart_Armstrong3 May 2019 11:34 UTC
46 points
11 comments5 min readLW link

A Vi­su­al­iza­tion of Nick Bostrom’s Superintelligence

[deleted]23 Jul 2014 0:24 UTC
62 points
28 comments3 min readLW link

[Question] What are the most plau­si­ble “AI Safety warn­ing shot” sce­nar­ios?

Daniel Kokotajlo26 Mar 2020 20:59 UTC
35 points
16 comments1 min readLW link

AGI in a vuln­er­a­ble world

26 Mar 2020 0:10 UTC
42 points
21 comments1 min readLW link
(aiimpacts.org)

Three Kinds of Competitiveness

Daniel Kokotajlo31 Mar 2020 1:00 UTC
36 points
18 comments5 min readLW link

Biolog­i­cal hu­mans and the ris­ing tide of AI

cousin_it29 Jan 2018 16:04 UTC
22 points
23 comments1 min readLW link

HLAI 2018 Field Report

Gordon Seidoh Worley29 Aug 2018 0:11 UTC
48 points
12 comments5 min readLW link

Mag­i­cal Categories

Eliezer Yudkowsky24 Aug 2008 19:51 UTC
65 points
133 comments9 min readLW link

Align­ment as Translation

johnswentworth19 Mar 2020 21:40 UTC
62 points
39 comments4 min readLW link

Re­solv­ing hu­man val­ues, com­pletely and adequately

Stuart_Armstrong30 Mar 2018 3:35 UTC
32 points
30 comments12 min readLW link

Will trans­parency help catch de­cep­tion? Per­haps not

Matthew Barnett4 Nov 2019 20:52 UTC
43 points
5 comments7 min readLW link

A dilemma for pro­saic AI alignment

Daniel Kokotajlo17 Dec 2019 22:11 UTC
40 points
30 comments3 min readLW link

[1911.08265] Mas­ter­ing Atari, Go, Chess and Shogi by Plan­ning with a Learned Model | Arxiv

DragonGod21 Nov 2019 1:18 UTC
52 points
4 comments1 min readLW link
(arxiv.org)

Glenn Beck dis­cusses the Sin­gu­lar­ity, cites SI researchers

Brihaspati12 Jun 2012 16:45 UTC
73 points
183 comments10 min readLW link

Siren wor­lds and the per­ils of over-op­ti­mised search

Stuart_Armstrong7 Apr 2014 11:00 UTC
73 points
417 comments7 min readLW link

Hu­man-Aligned AI Sum­mer School: A Summary

Michaël Trazzi11 Aug 2018 8:11 UTC
39 points
5 comments4 min readLW link

Top 9+2 myths about AI risk

Stuart_Armstrong29 Jun 2015 20:41 UTC
68 points
45 comments2 min readLW link

Learn­ing bi­ases and re­wards simultaneously

Rohin Shah6 Jul 2019 1:45 UTC
41 points
3 comments4 min readLW link

Look­ing for AI Safety Ex­perts to Provide High Level Guidance for RAISE

Ofer6 May 2018 2:06 UTC
17 points
5 comments1 min readLW link

[Question] How much fund­ing and re­searchers were in AI, and AI Safety, in 2018?

Raemon3 Mar 2019 21:46 UTC
41 points
11 comments1 min readLW link

Deep learn­ing—deeper flaws?

Richard_Ngo24 Sep 2018 18:40 UTC
39 points
17 comments4 min readLW link
(thinkingcomplete.blogspot.com)

A model of UDT with a con­crete prior over log­i­cal statements

Benya28 Aug 2012 21:45 UTC
62 points
24 comments4 min readLW link

Mal­ign gen­er­al­iza­tion with­out in­ter­nal search

Matthew Barnett12 Jan 2020 18:03 UTC
43 points
12 comments4 min readLW link

An­nounc­ing the sec­ond AI Safety Camp

Lachouette11 Jun 2018 18:59 UTC
34 points
0 comments1 min readLW link

Vaniver’s View on Fac­tored Cognition

Vaniver23 Aug 2019 2:54 UTC
48 points
4 comments8 min readLW link

De­tached Lever Fallacy

Eliezer Yudkowsky31 Jul 2008 18:57 UTC
70 points
41 comments7 min readLW link

When to use quantilization

RyanCarey5 Feb 2019 17:17 UTC
65 points
5 comments4 min readLW link

The first AI Safety Camp & onwards

Remmelt7 Jun 2018 20:13 UTC
45 points
0 comments8 min readLW link

Learn­ing prefer­ences by look­ing at the world

Rohin Shah12 Feb 2019 22:25 UTC
43 points
10 comments7 min readLW link
(bair.berkeley.edu)

Sel­ling Nonapples

Eliezer Yudkowsky13 Nov 2008 20:10 UTC
71 points
78 comments7 min readLW link

The AI Align­ment Prob­lem Has Already Been Solved(?) Once

SquirrelInHell22 Apr 2017 13:24 UTC
50 points
45 comments4 min readLW link
(squirrelinhell.blogspot.com)

Trace README

johnswentworth11 Mar 2020 21:08 UTC
35 points
1 comment8 min readLW link

[Link] Com­puter im­proves its Civ­i­liza­tion II game­play by read­ing the manual

Kaj_Sotala13 Jul 2011 12:00 UTC
49 points
5 comments4 min readLW link

Idea: Open Ac­cess AI Safety Journal

Gordon Seidoh Worley23 Mar 2018 18:27 UTC
28 points
11 comments1 min readLW link

Another take on agent foun­da­tions: for­mal­iz­ing zero-shot reasoning

zhukeepa1 Jul 2018 6:12 UTC
59 points
20 comments12 min readLW link

Log­i­cal Up­date­less­ness as a Ro­bust Del­e­ga­tion Problem

Scott Garrabrant27 Oct 2017 21:16 UTC
30 points
2 comments2 min readLW link

Some thoughts af­ter read­ing Ar­tifi­cial In­tel­li­gence: A Modern Approach

swift_spiral19 Mar 2019 23:39 UTC
38 points
4 comments2 min readLW link

AI safety with­out goal-di­rected behavior

Rohin Shah7 Jan 2019 7:48 UTC
65 points
15 comments4 min readLW link

No Univer­sally Com­pel­ling Arguments

Eliezer Yudkowsky26 Jun 2008 8:29 UTC
62 points
57 comments5 min readLW link

What AI Safety Re­searchers Have Writ­ten About the Na­ture of Hu­man Values

avturchin16 Jan 2019 13:59 UTC
50 points
3 comments15 min readLW link

Disam­biguat­ing “al­ign­ment” and re­lated no­tions

David Scott Krueger (formerly: capybaralet)5 Jun 2018 15:35 UTC
22 points
21 comments2 min readLW link

In­duc­tive bi­ases stick around

evhub18 Dec 2019 19:52 UTC
63 points
14 comments3 min readLW link

Bill Gates: prob­lem of strong AI with con­flict­ing goals “very wor­thy of study and time”

Paul Crowley22 Jan 2015 20:21 UTC
73 points
18 comments1 min readLW link

So You Want to Save the World

lukeprog1 Jan 2012 7:39 UTC
54 points
149 comments12 min readLW link

Me­taphilo­soph­i­cal com­pe­tence can’t be dis­en­tan­gled from alignment

zhukeepa1 Apr 2018 0:38 UTC
32 points
39 comments3 min readLW link

Some Thoughts on Metaphilosophy

Wei Dai10 Feb 2019 0:28 UTC
62 points
27 comments4 min readLW link

Rea­sons com­pute may not drive AI ca­pa­bil­ities growth

Tristan H19 Dec 2018 22:13 UTC
42 points
10 comments8 min readLW link

Dis­tance Func­tions are Hard

Grue_Slinky13 Aug 2019 17:33 UTC
31 points
19 comments6 min readLW link

Take­aways from safety by de­fault interviews

3 Apr 2020 17:20 UTC
28 points
2 comments13 min readLW link
(aiimpacts.org)

Bridge Col­lapse: Re­duc­tion­ism as Eng­ineer­ing Problem

Rob Bensinger18 Feb 2014 22:03 UTC
78 points
62 comments15 min readLW link

Prob­a­bil­ity as Min­i­mal Map

johnswentworth1 Sep 2019 19:19 UTC
49 points
10 comments5 min readLW link

Policy Alignment

abramdemski30 Jun 2018 0:24 UTC
50 points
25 comments8 min readLW link

Stable Poin­t­ers to Value: An Agent Embed­ded in Its Own Utility Function

abramdemski17 Aug 2017 0:22 UTC
15 points
9 comments5 min readLW link

Stable Poin­t­ers to Value II: En­vi­ron­men­tal Goals

abramdemski9 Feb 2018 6:03 UTC
18 points
2 comments4 min readLW link

The Ar­gu­ment from Philo­soph­i­cal Difficulty

Wei Dai10 Feb 2019 0:28 UTC
54 points
31 comments1 min readLW link

hu­man psy­chol­in­guists: a crit­i­cal appraisal

nostalgebraist31 Dec 2019 0:20 UTC
174 points
59 comments16 min readLW link2 reviews
(nostalgebraist.tumblr.com)

My take on agent foun­da­tions: for­mal­iz­ing metaphilo­soph­i­cal competence

zhukeepa1 Apr 2018 6:33 UTC
20 points
6 comments1 min readLW link

Cri­tique my Model: The EV of AGI to Selfish Individuals

ozziegooen8 Apr 2018 20:04 UTC
19 points
9 comments4 min readLW link

AI Safety De­bate and Its Applications

VojtaKovarik23 Jul 2019 22:31 UTC
36 points
5 comments12 min readLW link

TAISU 2019 Field Report

Gordon Seidoh Worley15 Oct 2019 1:09 UTC
36 points
5 comments5 min readLW link

Hu­man-AI Collaboration

Rohin Shah22 Oct 2019 6:32 UTC
42 points
7 comments2 min readLW link
(bair.berkeley.edu)

An­a­lyz­ing the Prob­lem GPT-3 is Try­ing to Solve

adamShimi6 Aug 2020 21:58 UTC
16 points
2 comments4 min readLW link

[LINK] Speed su­per­in­tel­li­gence?

Stuart_Armstrong14 Aug 2014 15:57 UTC
53 points
20 comments1 min readLW link

A big Sin­gu­lar­ity-themed Hol­ly­wood movie out in April offers many op­por­tu­ni­ties to talk about AI risk

chaosmage7 Jan 2014 17:48 UTC
49 points
85 comments1 min readLW link

New pa­per: (When) is Truth-tel­ling Fa­vored in AI de­bate?

VojtaKovarik26 Dec 2019 19:59 UTC
32 points
7 comments5 min readLW link
(medium.com)

Ar­tifi­cial Addition

Eliezer Yudkowsky20 Nov 2007 7:58 UTC
68 points
129 comments6 min readLW link

Ex­plor­ing safe exploration

evhub6 Jan 2020 21:07 UTC
37 points
8 comments3 min readLW link

‘Dumb’ AI ob­serves and ma­nipu­lates controllers

Stuart_Armstrong13 Jan 2015 13:35 UTC
52 points
19 comments2 min readLW link

AI Read­ing Group Thoughts (1/​?): The Man­date of Heaven

Alicorn10 Aug 2018 0:24 UTC
45 points
18 comments4 min readLW link

AI Read­ing Group Thoughts (2/​?): Re­con­struc­tive Psychosurgery

Alicorn25 Sep 2018 4:25 UTC
27 points
6 comments3 min readLW link

(notes on) Policy Desider­ata for Su­per­in­tel­li­gent AI: A Vec­tor Field Approach

Ben Pace4 Feb 2019 22:08 UTC
43 points
5 comments7 min readLW link

AI Gover­nance: A Re­search Agenda

habryka5 Sep 2018 18:00 UTC
25 points
3 comments1 min readLW link
(www.fhi.ox.ac.uk)

Global on­line de­bate on the gov­er­nance of AI

CarolineJ5 Jan 2018 15:31 UTC
8 points
5 comments1 min readLW link

[AN #61] AI policy and gov­er­nance, from two peo­ple in the field

Rohin Shah5 Aug 2019 17:00 UTC
12 points
2 comments9 min readLW link
(mailchi.mp)

2019 AI Align­ment Liter­a­ture Re­view and Char­ity Comparison

Larks19 Dec 2019 3:00 UTC
130 points
18 comments62 min readLW link

[Question] What’s wrong with these analo­gies for un­der­stand­ing In­formed Over­sight and IDA?

Wei Dai20 Mar 2019 9:11 UTC
35 points
3 comments1 min readLW link

The Align­ment Newslet­ter #1: 04/​09/​18

Rohin Shah9 Apr 2018 16:00 UTC
12 points
3 comments4 min readLW link

The Align­ment Newslet­ter #2: 04/​16/​18

Rohin Shah16 Apr 2018 16:00 UTC
8 points
0 comments5 min readLW link

The Align­ment Newslet­ter #3: 04/​23/​18

Rohin Shah23 Apr 2018 16:00 UTC
9 points
0 comments6 min readLW link

The Align­ment Newslet­ter #4: 04/​30/​18

Rohin Shah30 Apr 2018 16:00 UTC
8 points
0 comments3 min readLW link

The Align­ment Newslet­ter #5: 05/​07/​18

Rohin Shah7 May 2018 16:00 UTC
8 points
0 comments7 min readLW link

The Align­ment Newslet­ter #6: 05/​14/​18

Rohin Shah14 May 2018 16:00 UTC
8 points
0 comments2 min readLW link

The Align­ment Newslet­ter #7: 05/​21/​18

Rohin Shah21 May 2018 16:00 UTC
8 points
0 comments5 min readLW link

The Align­ment Newslet­ter #8: 05/​28/​18

Rohin Shah28 May 2018 16:00 UTC
8 points
0 comments6 min readLW link

The Align­ment Newslet­ter #9: 06/​04/​18

Rohin Shah4 Jun 2018 16:00 UTC
8 points
0 comments2 min readLW link

The Align­ment Newslet­ter #10: 06/​11/​18

Rohin Shah11 Jun 2018 16:00 UTC
16 points
0 comments9 min readLW link

The Align­ment Newslet­ter #11: 06/​18/​18

Rohin Shah18 Jun 2018 16:00 UTC
8 points
0 comments10 min readLW link

The Align­ment Newslet­ter #12: 06/​25/​18

Rohin Shah25 Jun 2018 16:00 UTC
15 points
0 comments3 min readLW link

Align­ment Newslet­ter #13: 07/​02/​18

Rohin Shah2 Jul 2018 16:10 UTC
70 points
12 comments8 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #14

Rohin Shah9 Jul 2018 16:20 UTC
14 points
0 comments9 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #15: 07/​16/​18

Rohin Shah16 Jul 2018 16:10 UTC
42 points
0 comments15 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #17

Rohin Shah30 Jul 2018 16:10 UTC
32 points
0 comments13 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #18

Rohin Shah6 Aug 2018 16:00 UTC
17 points
0 comments10 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #19

Rohin Shah14 Aug 2018 2:10 UTC
18 points
0 comments13 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #20

Rohin Shah20 Aug 2018 16:00 UTC
12 points
2 comments6 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #21

Rohin Shah27 Aug 2018 16:20 UTC
25 points
0 comments7 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #22

Rohin Shah3 Sep 2018 16:10 UTC
18 points
0 comments6 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #23

Rohin Shah10 Sep 2018 17:10 UTC
16 points
0 comments7 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #24

Rohin Shah17 Sep 2018 16:20 UTC
10 points
6 comments12 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #25

Rohin Shah24 Sep 2018 16:10 UTC
18 points
3 comments9 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #26

Rohin Shah2 Oct 2018 16:10 UTC
13 points
0 comments7 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #27

Rohin Shah9 Oct 2018 1:10 UTC
16 points
0 comments9 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #28

Rohin Shah15 Oct 2018 21:20 UTC
11 points
0 comments8 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #29

Rohin Shah22 Oct 2018 16:20 UTC
15 points
0 comments9 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #30

Rohin Shah29 Oct 2018 16:10 UTC
29 points
2 comments6 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #31

Rohin Shah5 Nov 2018 23:50 UTC
17 points
0 comments12 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #32

Rohin Shah12 Nov 2018 17:20 UTC
18 points
0 comments12 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #33

Rohin Shah19 Nov 2018 17:20 UTC
23 points
0 comments9 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #34

Rohin Shah26 Nov 2018 23:10 UTC
24 points
0 comments10 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #35

Rohin Shah4 Dec 2018 1:10 UTC
15 points
0 comments6 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #37

Rohin Shah17 Dec 2018 19:10 UTC
25 points
4 comments10 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #38

Rohin Shah25 Dec 2018 16:10 UTC
9 points
0 comments8 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #39

Rohin Shah1 Jan 2019 8:10 UTC
32 points
2 comments5 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #40

Rohin Shah8 Jan 2019 20:10 UTC
21 points
2 comments5 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #41

Rohin Shah17 Jan 2019 8:10 UTC
22 points
6 comments10 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #42

Rohin Shah22 Jan 2019 2:00 UTC
20 points
1 comment10 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #43

Rohin Shah29 Jan 2019 21:10 UTC
14 points
2 comments13 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #44

Rohin Shah6 Feb 2019 8:30 UTC
18 points
0 comments9 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #45

Rohin Shah14 Feb 2019 2:10 UTC
25 points
2 comments8 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #46

Rohin Shah22 Feb 2019 0:10 UTC
12 points
0 comments9 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #48

Rohin Shah11 Mar 2019 21:10 UTC
29 points
14 comments9 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #49

Rohin Shah20 Mar 2019 4:20 UTC
23 points
1 comment11 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #50

Rohin Shah28 Mar 2019 18:10 UTC
15 points
2 comments10 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #51

Rohin Shah3 Apr 2019 4:10 UTC
25 points
2 comments15 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #52

Rohin Shah6 Apr 2019 1:20 UTC
19 points
1 comment8 min readLW link
(mailchi.mp)

Align­ment Newslet­ter One Year Retrospective

Rohin Shah10 Apr 2019 6:58 UTC
93 points
31 comments21 min readLW link

Align­ment Newslet­ter #53

Rohin Shah18 Apr 2019 17:20 UTC
20 points
0 comments8 min readLW link
(mailchi.mp)

[AN #54] Box­ing a finite-hori­zon AI sys­tem to keep it unambitious

Rohin Shah28 Apr 2019 5:20 UTC
20 points
0 comments8 min readLW link
(mailchi.mp)

[AN #55] Reg­u­la­tory mar­kets and in­ter­na­tional stan­dards as a means of en­sur­ing benefi­cial AI

Rohin Shah5 May 2019 2:20 UTC
17 points
2 comments8 min readLW link
(mailchi.mp)

[AN #56] Should ML re­searchers stop run­ning ex­per­i­ments be­fore mak­ing hy­pothe­ses?

Rohin Shah21 May 2019 2:20 UTC
21 points
8 comments9 min readLW link
(mailchi.mp)

[AN #57] Why we should fo­cus on ro­bust­ness in AI safety, and the analo­gous prob­lems in programming

Rohin Shah5 Jun 2019 23:20 UTC
26 points
15 comments7 min readLW link
(mailchi.mp)

[AN #58] Mesa op­ti­miza­tion: what it is, and why we should care

Rohin Shah24 Jun 2019 16:10 UTC
54 points
9 comments8 min readLW link
(mailchi.mp)

[AN #59] How ar­gu­ments for AI risk have changed over time

Rohin Shah8 Jul 2019 17:20 UTC
43 points
4 comments7 min readLW link
(mailchi.mp)

[AN #60] A new AI challenge: Minecraft agents that as­sist hu­man play­ers in cre­ative mode

Rohin Shah22 Jul 2019 17:00 UTC
23 points
6 comments9 min readLW link
(mailchi.mp)

[AN #62] Are ad­ver­sar­ial ex­am­ples caused by real but im­per­cep­ti­ble fea­tures?

Rohin Shah22 Aug 2019 17:10 UTC
27 points
10 comments9 min readLW link
(mailchi.mp)

[AN #63] How ar­chi­tec­ture search, meta learn­ing, and en­vi­ron­ment de­sign could lead to gen­eral intelligence

Rohin Shah10 Sep 2019 19:10 UTC
21 points
12 comments8 min readLW link
(mailchi.mp)

[AN #64]: Us­ing Deep RL and Re­ward Uncer­tainty to In­cen­tivize Prefer­ence Learning

Rohin Shah16 Sep 2019 17:10 UTC
11 points
8 comments7 min readLW link
(mailchi.mp)

[AN #65]: Learn­ing use­ful skills by watch­ing hu­mans “play”

Rohin Shah23 Sep 2019 17:30 UTC
11 points
0 comments9 min readLW link
(mailchi.mp)

[AN #66]: De­com­pos­ing ro­bust­ness into ca­pa­bil­ity ro­bust­ness and al­ign­ment robustness

Rohin Shah30 Sep 2019 18:00 UTC
12 points
1 comment7 min readLW link
(mailchi.mp)

[AN #67]: Creat­ing en­vi­ron­ments in which to study in­ner al­ign­ment failures

Rohin Shah7 Oct 2019 17:10 UTC
17 points
0 comments8 min readLW link
(mailchi.mp)

[AN #68]: The at­tain­able util­ity the­ory of impact

Rohin Shah14 Oct 2019 17:00 UTC
17 points
0 comments8 min readLW link
(mailchi.mp)

[AN #69] Stu­art Rus­sell’s new book on why we need to re­place the stan­dard model of AI

Rohin Shah19 Oct 2019 0:30 UTC
60 points
12 comments15 min readLW link
(mailchi.mp)

[AN #70]: Agents that help hu­mans who are still learn­ing about their own preferences

Rohin Shah23 Oct 2019 17:10 UTC
16 points
0 comments9 min readLW link
(mailchi.mp)

[AN #71]: Avoid­ing re­ward tam­per­ing through cur­rent-RF optimization

Rohin Shah30 Oct 2019 17:10 UTC
12 points
0 comments7 min readLW link
(mailchi.mp)

[AN #72]: Align­ment, ro­bust­ness, method­ol­ogy, and sys­tem build­ing as re­search pri­ori­ties for AI safety

Rohin Shah6 Nov 2019 18:10 UTC
26 points
4 comments10 min readLW link
(mailchi.mp)

[AN #73]: De­tect­ing catas­trophic failures by learn­ing how agents tend to break

Rohin Shah13 Nov 2019 18:10 UTC
11 points
0 comments7 min readLW link
(mailchi.mp)

[AN #74]: Separat­ing benefi­cial AI into com­pe­tence, al­ign­ment, and cop­ing with impacts

Rohin Shah20 Nov 2019 18:20 UTC
19 points
0 comments7 min readLW link
(mailchi.mp)

[AN #75]: Solv­ing Atari and Go with learned game mod­els, and thoughts from a MIRI employee

Rohin Shah27 Nov 2019 18:10 UTC
38 points
1 comment10 min readLW link
(mailchi.mp)

[AN #76]: How dataset size af­fects ro­bust­ness, and bench­mark­ing safe ex­plo­ra­tion by mea­sur­ing con­straint violations

Rohin Shah4 Dec 2019 18:10 UTC
14 points
6 comments9 min readLW link
(mailchi.mp)

[AN #77]: Dou­ble de­scent: a unifi­ca­tion of statis­ti­cal the­ory and mod­ern ML practice

Rohin Shah18 Dec 2019 18:30 UTC
21 points
4 comments14 min readLW link
(mailchi.mp)

[AN #78] For­mal­iz­ing power and in­stru­men­tal con­ver­gence, and the end-of-year AI safety char­ity comparison

Rohin Shah26 Dec 2019 1:10 UTC
26 points
10 comments9 min readLW link
(mailchi.mp)

[AN #79]: Re­cur­sive re­ward mod­el­ing as an al­ign­ment tech­nique in­te­grated with deep RL

Rohin Shah1 Jan 2020 18:00 UTC
13 points
0 comments12 min readLW link
(mailchi.mp)

[AN #81]: Univer­sal­ity as a po­ten­tial solu­tion to con­cep­tual difficul­ties in in­tent alignment

Rohin Shah8 Jan 2020 18:00 UTC
31 points
4 comments11 min readLW link
(mailchi.mp)

[AN #82]: How OpenAI Five dis­tributed their train­ing computation

Rohin Shah15 Jan 2020 18:20 UTC
19 points
0 comments8 min readLW link
(mailchi.mp)

[AN #83]: Sam­ple-effi­cient deep learn­ing with ReMixMatch

Rohin Shah22 Jan 2020 18:10 UTC
15 points
4 comments11 min readLW link
(mailchi.mp)

[AN #84] Re­view­ing AI al­ign­ment work in 2018-19

Rohin Shah29 Jan 2020 18:30 UTC
23 points
0 comments6 min readLW link
(mailchi.mp)

[AN #85]: The nor­ma­tive ques­tions we should be ask­ing for AI al­ign­ment, and a sur­pris­ingly good chatbot

Rohin Shah5 Feb 2020 18:20 UTC
14 points
2 comments7 min readLW link
(mailchi.mp)

[AN #86]: Im­prov­ing de­bate and fac­tored cog­ni­tion through hu­man experiments

Rohin Shah12 Feb 2020 18:10 UTC
14 points
0 comments9 min readLW link
(mailchi.mp)

[AN #87]: What might hap­pen as deep learn­ing scales even fur­ther?

Rohin Shah19 Feb 2020 18:20 UTC
28 points
0 comments4 min readLW link
(mailchi.mp)

[AN #88]: How the prin­ci­pal-agent liter­a­ture re­lates to AI risk

Rohin Shah27 Feb 2020 9:10 UTC
18 points
0 comments9 min readLW link
(mailchi.mp)

[AN #89]: A unify­ing for­mal­ism for prefer­ence learn­ing algorithms

Rohin Shah4 Mar 2020 18:20 UTC
16 points
0 comments9 min readLW link
(mailchi.mp)

[AN #90]: How search land­scapes can con­tain self-re­in­forc­ing feed­back loops

Rohin Shah11 Mar 2020 17:30 UTC
11 points
6 comments8 min readLW link
(mailchi.mp)

[AN #91]: Con­cepts, im­ple­men­ta­tions, prob­lems, and a bench­mark for im­pact measurement

Rohin Shah18 Mar 2020 17:10 UTC
15 points
10 comments13 min readLW link
(mailchi.mp)

[AN #92]: Learn­ing good rep­re­sen­ta­tions with con­trastive pre­dic­tive coding

Rohin Shah25 Mar 2020 17:20 UTC
18 points
1 comment10 min readLW link
(mailchi.mp)

[AN #93]: The Precipice we’re stand­ing at, and how we can back away from it

Rohin Shah1 Apr 2020 17:10 UTC
24 points
0 comments7 min readLW link
(mailchi.mp)

Fore­cast­ing AI Progress: A Re­search Agenda

10 Aug 2020 1:04 UTC
39 points
4 comments1 min readLW link

The Steer­ing Problem

paulfchristiano13 Nov 2018 17:14 UTC
43 points
12 comments7 min readLW link

Will hu­mans build goal-di­rected agents?

Rohin Shah5 Jan 2019 1:33 UTC
51 points
43 comments5 min readLW link

Pro­saic AI alignment

paulfchristiano20 Nov 2018 13:56 UTC
40 points
10 comments8 min readLW link

David Chalmers’ “The Sin­gu­lar­ity: A Philo­soph­i­cal Anal­y­sis”

lukeprog29 Jan 2011 2:52 UTC
55 points
203 comments4 min readLW link

[Talk] Paul Chris­ti­ano on his al­ign­ment taxonomy

jp27 Sep 2019 18:37 UTC
31 points
1 comment1 min readLW link
(www.youtube.com)

Dreams of AI Design

Eliezer Yudkowsky27 Aug 2008 4:04 UTC
26 points
61 comments5 min readLW link

Qual­i­ta­tive Strate­gies of Friendliness

Eliezer Yudkowsky30 Aug 2008 2:12 UTC
30 points
56 comments12 min readLW link

Or­a­cles, se­quence pre­dic­tors, and self-con­firm­ing predictions

Stuart_Armstrong3 May 2019 14:09 UTC
22 points
0 comments3 min readLW link

Self-con­firm­ing prophe­cies, and sim­plified Or­a­cle designs

Stuart_Armstrong28 Jun 2019 9:57 UTC
6 points
1 comment5 min readLW link

In­vest­ment idea: bas­ket of tech stocks weighted to­wards AI

ioannes12 Aug 2020 21:30 UTC
14 points
7 comments3 min readLW link

Con­cep­tual is­sues in AI safety: the paradig­matic gap

vedevazz24 Jun 2018 15:09 UTC
33 points
0 comments1 min readLW link
(www.foldl.me)

Disagree­ment with Paul: al­ign­ment induction

Stuart_Armstrong10 Sep 2018 13:54 UTC
31 points
6 comments1 min readLW link

Largest open col­lec­tion quotes about AI

teradimich12 Jul 2019 17:18 UTC
35 points
2 comments3 min readLW link
(drive.google.com)

S.E.A.R.L.E’s COBOL room

Stuart_Armstrong1 Feb 2013 20:29 UTC
52 points
36 comments2 min readLW link

In­tro­duc­ing Cor­rigi­bil­ity (an FAI re­search sub­field)

So8res20 Oct 2014 21:09 UTC
52 points
28 comments3 min readLW link

NES-game play­ing AI [video link and AI-box­ing-re­lated com­ment]

Dr_Manhattan12 Apr 2013 13:11 UTC
42 points
22 comments1 min readLW link

On un­fix­ably un­safe AGI architectures

Steven Byrnes19 Feb 2020 21:16 UTC
33 points
8 comments5 min readLW link

To con­tribute to AI safety, con­sider do­ing AI research

Vika16 Jan 2016 20:42 UTC
39 points
39 comments2 min readLW link

Ghosts in the Machine

Eliezer Yudkowsky17 Jun 2008 23:29 UTC
54 points
30 comments4 min readLW link

Tech­ni­cal AGI safety re­search out­side AI

Richard_Ngo18 Oct 2019 15:00 UTC
43 points
3 comments3 min readLW link

De­ci­pher­ing China’s AI Dream

Qiaochu_Yuan18 Mar 2018 3:26 UTC
12 points
2 comments1 min readLW link
(www.fhi.ox.ac.uk)

Above-Aver­age AI Scientists

Eliezer Yudkowsky28 Sep 2008 11:04 UTC
57 points
97 comments8 min readLW link

The Na­ture of Logic

Eliezer Yudkowsky15 Nov 2008 6:20 UTC
37 points
12 comments10 min readLW link

Or­a­cle paper

Stuart_Armstrong13 Dec 2017 14:59 UTC
12 points
7 comments1 min readLW link

AI Align­ment Writ­ing Day Roundup #1

Ben Pace30 Aug 2019 1:26 UTC
32 points
12 comments1 min readLW link

Notes on the Safety in Ar­tifi­cial In­tel­li­gence conference

UmamiSalami1 Jul 2016 0:36 UTC
40 points
15 comments13 min readLW link

Rein­ter­pret­ing “AI and Com­pute”

habryka25 Dec 2018 21:12 UTC
30 points
10 comments1 min readLW link
(aiimpacts.org)

AI Safety Pr­ereq­ui­sites Course: Re­vamp and New Lessons

philip_b3 Feb 2019 21:04 UTC
24 points
5 comments1 min readLW link

An an­gle of at­tack on Open Prob­lem #1

Benya18 Aug 2012 12:08 UTC
47 points
85 comments7 min readLW link

Eval­u­at­ing the fea­si­bil­ity of SI’s plan

JoshuaFox10 Jan 2013 8:17 UTC
38 points
188 comments4 min readLW link

Only hu­mans can have hu­man values

PhilGoetz26 Apr 2010 18:57 UTC
51 points
161 comments17 min readLW link

Mas­ter­ing Chess and Shogi by Self-Play with a Gen­eral Re­in­force­ment Learn­ing Algorithm

DragonGod6 Dec 2017 6:01 UTC
13 points
4 comments1 min readLW link
(arxiv.org)

Cake, or death!

Stuart_Armstrong25 Oct 2012 10:33 UTC
46 points
13 comments4 min readLW link

Self-reg­u­la­tion of safety in AI research

Gordon Seidoh Worley25 Feb 2018 23:17 UTC
12 points
6 comments2 min readLW link

How safe “safe” AI de­vel­op­ment?

Gordon Seidoh Worley28 Feb 2018 23:21 UTC
9 points
1 comment1 min readLW link

Stan­ford In­tro to AI course to be taught for free online

Psy-Kosh30 Jul 2011 16:22 UTC
38 points
39 comments1 min readLW link

Bayesian Utility: Rep­re­sent­ing Prefer­ence by Prob­a­bil­ity Measures

Vladimir_Nesov27 Jul 2009 14:28 UTC
45 points
37 comments2 min readLW link

Gains from trade: Slug ver­sus Galaxy—how much would I give up to con­trol you?

Stuart_Armstrong23 Jul 2013 19:06 UTC
55 points
67 comments7 min readLW link

Defeat­ing Mun­dane Holo­causts With Robots

lsparrish30 May 2011 22:34 UTC
34 points
28 comments2 min readLW link

As­sum­ing we’ve solved X, could we do Y...

Stuart_Armstrong11 Dec 2018 18:13 UTC
31 points
16 comments2 min readLW link

The Stamp Collector

So8res1 May 2015 23:11 UTC
45 points
14 comments6 min readLW link

Sav­ing the world in 80 days: Prologue

Logan Riggs9 May 2018 21:16 UTC
12 points
16 comments2 min readLW link

Pro­ject Pro­posal: Con­sid­er­a­tions for trad­ing off ca­pa­bil­ities and safety im­pacts of AI research

David Scott Krueger (formerly: capybaralet)6 Aug 2019 22:22 UTC
25 points
11 comments2 min readLW link

AI Safety Pr­ereq­ui­sites Course: Ba­sic ab­stract rep­re­sen­ta­tions of computation

RAISE13 Mar 2019 19:38 UTC
28 points
2 comments1 min readLW link

What I Think, If Not Why

Eliezer Yudkowsky11 Dec 2008 17:41 UTC
41 points
103 comments4 min readLW link

RFC: Philo­soph­i­cal Con­ser­vatism in AI Align­ment Research

Gordon Seidoh Worley15 May 2018 3:29 UTC
17 points
13 comments1 min readLW link

Pre­dicted AI al­ign­ment event/​meet­ing calendar

rmoehn14 Aug 2019 7:14 UTC
29 points
14 comments1 min readLW link

Sim­plified prefer­ences needed; sim­plified prefer­ences sufficient

Stuart_Armstrong5 Mar 2019 19:39 UTC
29 points
6 comments3 min readLW link

Re­ward func­tion learn­ing: the value function

Stuart_Armstrong24 Apr 2018 16:29 UTC
9 points
0 comments11 min readLW link

Re­ward func­tion learn­ing: the learn­ing process

Stuart_Armstrong24 Apr 2018 12:56 UTC
6 points
11 comments8 min readLW link

Utility ver­sus Re­ward func­tion: par­tial equivalence

Stuart_Armstrong13 Apr 2018 14:58 UTC
17 points
5 comments5 min readLW link

Full toy model for prefer­ence learning

Stuart_Armstrong16 Oct 2019 11:06 UTC
20 points
2 comments12 min readLW link

New(ish) AI con­trol ideas

Stuart_Armstrong31 Oct 2017 12:52 UTC
0 points
0 comments4 min readLW link

Rig­ging is a form of wireheading

Stuart_Armstrong3 May 2018 12:50 UTC
11 points
2 comments1 min readLW link

The re­ward en­g­ineer­ing prob­lem

paulfchristiano16 Jan 2019 18:47 UTC
26 points
3 comments7 min readLW link

AI co­op­er­a­tion in practice

cousin_it30 Jul 2010 16:21 UTC
37 points
166 comments1 min readLW link

Ex­am­ples of AI’s be­hav­ing badly

Stuart_Armstrong16 Jul 2015 10:01 UTC
41 points
37 comments1 min readLW link

Con­trol­ling Con­stant Programs

Vladimir_Nesov5 Sep 2010 13:45 UTC
34 points
33 comments5 min readLW link

Recom­mended Read­ing for Friendly AI Research

Vladimir_Nesov9 Oct 2010 13:46 UTC
36 points
30 comments2 min readLW link

Autism, Wat­son, the Tur­ing test, and Gen­eral Intelligence

Stuart_Armstrong24 Sep 2013 11:00 UTC
11 points
22 comments1 min readLW link

Pes­simism About Un­known Un­knowns In­spires Conservatism

michaelcohen3 Feb 2020 14:48 UTC
31 points
2 comments5 min readLW link

The Na­tional Se­cu­rity Com­mis­sion on Ar­tifi­cial In­tel­li­gence Wants You (to sub­mit es­says and ar­ti­cles on the fu­ture of gov­ern­ment AI policy)

quanticle18 Jul 2019 17:21 UTC
30 points
0 comments1 min readLW link
(warontherocks.com)

Sys­tems Eng­ineer­ing and the META Program

ryan_b20 Dec 2018 20:19 UTC
30 points
3 comments1 min readLW link

Hu­man er­rors, hu­man values

PhilGoetz9 Apr 2011 2:50 UTC
45 points
138 comments1 min readLW link

ISO: Name of Problem

johnswentworth24 Jul 2018 17:15 UTC
28 points
15 comments1 min readLW link

Muehlhauser-Go­ertzel Dialogue, Part 1

lukeprog16 Mar 2012 17:12 UTC
42 points
161 comments33 min readLW link

Speci­fi­ca­tion gam­ing ex­am­ples in AI

Samuel Rødal10 Nov 2018 12:00 UTC
24 points
6 comments1 min readLW link
(docs.google.com)

Su­per­in­tel­li­gence Read­ing Group—Sec­tion 1: Past Devel­op­ments and Pre­sent Capabilities

KatjaGrace16 Sep 2014 1:00 UTC
43 points
233 comments7 min readLW link

[Question] What are the differ­ences be­tween all the iter­a­tive/​re­cur­sive ap­proaches to AI al­ign­ment?

riceissa21 Sep 2019 2:09 UTC
30 points
14 comments2 min readLW link

Al­gorith­mic Similarity

LukasM23 Aug 2019 16:39 UTC
27 points
10 comments11 min readLW link

Direc­tions and desider­ata for AI alignment

paulfchristiano13 Jan 2019 7:47 UTC
47 points
1 comment14 min readLW link

Friendly AI Re­search and Taskification

multifoliaterose14 Dec 2010 6:30 UTC
30 points
47 comments5 min readLW link

Against easy su­per­in­tel­li­gence: the un­fore­seen fric­tion argument

Stuart_Armstrong10 Jul 2013 13:47 UTC
39 points
48 comments5 min readLW link

[Question] Why are the peo­ple who could be do­ing safety re­search, but aren’t, do­ing some­thing else?

Adam Scholl29 Aug 2019 8:51 UTC
27 points
19 comments1 min readLW link

TV’s “Ele­men­tary” Tack­les Friendly AI and X-Risk—“Bella” (Pos­si­ble Spoilers)

pjeby22 Nov 2014 19:51 UTC
48 points
18 comments2 min readLW link

Univer­sal­ity Unwrapped

adamShimi21 Aug 2020 18:53 UTC
28 points
2 comments18 min readLW link

AI Risk and Op­por­tu­nity: Hu­man­ity’s Efforts So Far

lukeprog21 Mar 2012 2:49 UTC
53 points
49 comments23 min readLW link

Learn­ing with catastrophes

paulfchristiano23 Jan 2019 3:01 UTC
27 points
9 comments4 min readLW link

[Question] De­gree of du­pli­ca­tion and co­or­di­na­tion in pro­jects that ex­am­ine com­put­ing prices, AI progress, and re­lated top­ics?

riceissa23 Apr 2019 12:27 UTC
26 points
1 comment2 min readLW link

Solv­ing the AI Race Finalists

Gordon Seidoh Worley19 Jul 2018 21:04 UTC
24 points
0 comments1 min readLW link
(medium.com)

An Agent is a Wor­ldline in Teg­mark V

komponisto12 Jul 2018 5:12 UTC
24 points
12 comments2 min readLW link

Towards for­mal­iz­ing universality

paulfchristiano13 Jan 2019 20:39 UTC
27 points
19 comments18 min readLW link

Con­cep­tual Anal­y­sis for AI Align­ment

David Scott Krueger (formerly: capybaralet)30 Dec 2018 0:46 UTC
26 points
3 comments2 min readLW link

Gw­ern’s “Why Tool AIs Want to Be Agent AIs: The Power of Agency”

habryka5 May 2019 5:11 UTC
26 points
3 comments1 min readLW link
(www.gwern.net)

[Question] Why not tool AI?

smithee19 Jan 2019 22:18 UTC
19 points
10 comments1 min readLW link

Su­per­in­tel­li­gence 16: Tool AIs

KatjaGrace30 Dec 2014 2:00 UTC
12 points
37 comments7 min readLW link

Think­ing of tool AIs

Michele Campolo20 Nov 2019 21:47 UTC
6 points
2 comments4 min readLW link

Re­ply to Holden on ‘Tool AI’

Eliezer Yudkowsky12 Jun 2012 18:00 UTC
152 points
357 comments17 min readLW link

Re­ply to Holden on The Sin­gu­lar­ity Institute

lukeprog10 Jul 2012 23:20 UTC
69 points
215 comments26 min readLW link

Levels of AI Self-Im­prove­ment

avturchin29 Apr 2018 11:45 UTC
11 points
0 comments39 min readLW link

AI: re­quire­ments for per­ni­cious policies

Stuart_Armstrong17 Jul 2015 14:18 UTC
11 points
3 comments3 min readLW link

Tools want to be­come agents

Stuart_Armstrong4 Jul 2014 10:12 UTC
24 points
81 comments1 min readLW link

Su­per­in­tel­li­gence read­ing group

KatjaGrace31 Aug 2014 14:59 UTC
31 points
2 comments2 min readLW link

Su­per­in­tel­li­gence Read­ing Group 2: Fore­cast­ing AI

KatjaGrace23 Sep 2014 1:00 UTC
17 points
109 comments11 min readLW link

Su­per­in­tel­li­gence Read­ing Group 3: AI and Uploads

KatjaGrace30 Sep 2014 1:00 UTC
17 points
139 comments6 min readLW link

SRG 4: Biolog­i­cal Cog­ni­tion, BCIs, Organizations

KatjaGrace7 Oct 2014 1:00 UTC
14 points
139 comments5 min readLW link

Su­per­in­tel­li­gence 5: Forms of Superintelligence

KatjaGrace14 Oct 2014 1:00 UTC
22 points
114 comments5 min readLW link

Su­per­in­tel­li­gence 6: In­tel­li­gence ex­plo­sion kinetics

KatjaGrace21 Oct 2014 1:00 UTC
15 points
68 comments8 min readLW link

Su­per­in­tel­li­gence 7: De­ci­sive strate­gic advantage

KatjaGrace28 Oct 2014 1:01 UTC
18 points
60 comments6 min readLW link

Su­per­in­tel­li­gence 8: Cog­ni­tive superpowers

KatjaGrace4 Nov 2014 2:01 UTC
14 points
96 comments6 min readLW link

Su­per­in­tel­li­gence 9: The or­thog­o­nal­ity of in­tel­li­gence and goals

KatjaGrace11 Nov 2014 2:00 UTC
13 points
80 comments7 min readLW link

Su­per­in­tel­li­gence 10: In­stru­men­tally con­ver­gent goals

KatjaGrace18 Nov 2014 2:00 UTC
13 points
33 comments5 min readLW link

Su­per­in­tel­li­gence 11: The treach­er­ous turn

KatjaGrace25 Nov 2014 2:00 UTC
16 points
50 comments6 min readLW link

Su­per­in­tel­li­gence 12: Mal­ig­nant failure modes

KatjaGrace2 Dec 2014 2:02 UTC
15 points
51 comments5 min readLW link

Su­per­in­tel­li­gence 13: Ca­pa­bil­ity con­trol methods

KatjaGrace9 Dec 2014 2:00 UTC
14 points
48 comments6 min readLW link

Su­per­in­tel­li­gence 14: Mo­ti­va­tion se­lec­tion methods

KatjaGrace16 Dec 2014 2:00 UTC
9 points
28 comments5 min readLW link

Su­per­in­tel­li­gence 15: Or­a­cles, ge­nies and sovereigns

KatjaGrace23 Dec 2014 2:01 UTC
11 points
30 comments7 min readLW link

Su­per­in­tel­li­gence 17: Mul­tipo­lar scenarios

KatjaGrace6 Jan 2015 6:44 UTC
9 points
38 comments6 min readLW link

Su­per­in­tel­li­gence 18: Life in an al­gorith­mic economy

KatjaGrace13 Jan 2015 2:00 UTC
10 points
52 comments6 min readLW link

Su­per­in­tel­li­gence 19: Post-tran­si­tion for­ma­tion of a singleton

KatjaGrace20 Jan 2015 2:00 UTC
12 points
35 comments7 min readLW link

Su­per­in­tel­li­gence 20: The value-load­ing problem

KatjaGrace27 Jan 2015 2:00 UTC
8 points
21 comments6 min readLW link

Su­per­in­tel­li­gence 21: Value learning

KatjaGrace3 Feb 2015 2:01 UTC
12 points
33 comments4 min readLW link

Su­per­in­tel­li­gence 22: Emu­la­tion mod­u­la­tion and in­sti­tu­tional design

KatjaGrace10 Feb 2015 2:06 UTC
13 points
11 comments6 min readLW link

Su­per­in­tel­li­gence 23: Co­her­ent ex­trap­o­lated volition

KatjaGrace17 Feb 2015 2:00 UTC
11 points
97 comments7 min readLW link

Su­per­in­tel­li­gence 24: Mo­ral­ity mod­els and “do what I mean”

KatjaGrace24 Feb 2015 2:00 UTC
13 points
47 comments6 min readLW link

Ob­jec­tions to Co­her­ent Ex­trap­o­lated Volition

XiXiDu22 Nov 2011 10:32 UTC
12 points
56 comments3 min readLW link

CEV: co­her­ence ver­sus extrapolation

Stuart_Armstrong22 Sep 2014 11:24 UTC
21 points
17 comments2 min readLW link

What if AI doesn’t quite go FOOM?

Mass_Driver20 Jun 2010 0:03 UTC
16 points
191 comments5 min readLW link

Su­per­in­tel­li­gence 25: Com­po­nents list for ac­quiring values

KatjaGrace3 Mar 2015 2:01 UTC
11 points
12 comments8 min readLW link

Su­per­in­tel­li­gence 26: Science and tech­nol­ogy strategy

KatjaGrace10 Mar 2015 1:43 UTC
14 points
21 comments6 min readLW link

Su­per­in­tel­li­gence 27: Path­ways and enablers

KatjaGrace17 Mar 2015 1:00 UTC
15 points
21 comments8 min readLW link

Su­per­in­tel­li­gence 28: Collaboration

KatjaGrace24 Mar 2015 1:29 UTC
13 points
21 comments6 min readLW link

Su­per­in­tel­li­gence 29: Crunch time

KatjaGrace31 Mar 2015 4:24 UTC
14 points
27 comments6 min readLW link

Univer­sal agents and util­ity functions

Anja14 Nov 2012 4:05 UTC
43 points
38 comments6 min readLW link

Look­ing for re­mote writ­ing part­ners (for AI al­ign­ment re­search)

rmoehn1 Oct 2019 2:16 UTC
23 points
4 comments2 min readLW link

Self-Su­per­vised Learn­ing and AGI Safety

Steven Byrnes7 Aug 2019 14:21 UTC
29 points
9 comments12 min readLW link

Which of these five AI al­ign­ment re­search pro­jects ideas are no good?

rmoehn8 Aug 2019 7:17 UTC
25 points
13 comments1 min readLW link

Un­der­stand­ing understanding

mthq23 Aug 2019 18:10 UTC
24 points
1 comment2 min readLW link

Eval­u­at­ing Ex­ist­ing Ap­proaches to AGI Alignment

Gordon Seidoh Worley27 Mar 2018 19:57 UTC
12 points
0 comments4 min readLW link
(mapandterritory.org)

CEV: a util­i­tar­ian critique

Pablo26 Jan 2013 16:12 UTC
32 points
94 comments5 min readLW link

Vingean Reflec­tion: Reli­able Rea­son­ing for Self-Im­prov­ing Agents

So8res15 Jan 2015 22:47 UTC
37 points
5 comments9 min readLW link

Slide deck: In­tro­duc­tion to AI Safety

Aryeh Englander29 Jan 2020 15:57 UTC
22 points
0 comments1 min readLW link
(drive.google.com)

The Self-Unaware AI Oracle

Steven Byrnes22 Jul 2019 19:04 UTC
21 points
38 comments8 min readLW link

May Gw­ern.net newslet­ter (w/​GPT-3 com­men­tary)

gwern2 Jun 2020 15:40 UTC
32 points
7 comments1 min readLW link
(www.gwern.net)

Build a Causal De­ci­sion Theorist

michaelcohen23 Sep 2019 20:43 UTC
1 point
14 comments4 min readLW link

A trick for Safer GPT-N

Razied23 Aug 2020 0:39 UTC
7 points
1 comment2 min readLW link

In­tro­duc­tion To The In­fra-Bayesi­anism Sequence

26 Aug 2020 20:31 UTC
104 points
64 comments14 min readLW link2 reviews

Model splin­ter­ing: mov­ing from one im­perfect model to another

Stuart_Armstrong27 Aug 2020 11:53 UTC
74 points
10 comments33 min readLW link

Al­gorith­mic Progress in Six Domains

lukeprog3 Aug 2013 2:29 UTC
38 points
32 comments1 min readLW link

[Question] What are some good ex­am­ples of in­cor­rigi­bil­ity?

RyanCarey28 Apr 2019 0:22 UTC
23 points
17 comments1 min readLW link

Safely and use­fully spec­tat­ing on AIs op­ti­miz­ing over toy worlds

AlexMennen31 Jul 2018 18:30 UTC
24 points
16 comments2 min readLW link

Up­dates and ad­di­tions to “Embed­ded Agency”

29 Aug 2020 4:22 UTC
73 points
1 comment3 min readLW link

[LINK] Ter­ror­ists tar­get AI researchers

RobertLumley15 Sep 2011 14:22 UTC
32 points
35 comments1 min readLW link

Analysing: Danger­ous mes­sages from fu­ture UFAI via Oracles

Stuart_Armstrong22 Nov 2019 14:17 UTC
22 points
16 comments4 min readLW link

Ex­plor­ing Botworld

So8res30 Apr 2014 22:29 UTC
34 points
2 comments6 min readLW link

in­ter­pret­ing GPT: the logit lens

nostalgebraist31 Aug 2020 2:47 UTC
158 points
32 comments11 min readLW link

From GPT to AGI

ChristianKl31 Aug 2020 13:28 UTC
6 points
7 comments1 min readLW link

Log­i­cal or Con­nec­tion­ist AI?

Eliezer Yudkowsky17 Nov 2008 8:03 UTC
39 points
26 comments9 min readLW link

Ar­tifi­cial In­tel­li­gence and Life Sciences (Why Big Data is not enough to cap­ture biolog­i­cal sys­tems?)

HansNauj15 Jan 2020 1:59 UTC
6 points
3 comments6 min readLW link

The Case against Killer Robots (link)

D_Alex20 Nov 2012 7:47 UTC
12 points
25 comments1 min readLW link

Near-Term Risk: Killer Robots a Threat to Free­dom and Democracy

Epiphany14 Jun 2013 6:28 UTC
15 points
105 comments2 min readLW link

Muehlhauser-Wang Dialogue

lukeprog22 Apr 2012 22:40 UTC
34 points
288 comments12 min readLW link

Google may be try­ing to take over the world

[deleted]27 Jan 2014 9:33 UTC
33 points
133 comments1 min readLW link

Gw­ern about cen­taurs: there is no chance that any use­ful man+ma­chine com­bi­na­tion will work to­gether for more than 10 years, as hu­mans soon will be only a liability

avturchin15 Dec 2018 21:32 UTC
31 points
4 comments1 min readLW link
(www.reddit.com)

Q&A with Abram Dem­ski on risks from AI

XiXiDu17 Jan 2012 9:43 UTC
33 points
71 comments9 min readLW link

Q&A with ex­perts on risks from AI #2

XiXiDu9 Jan 2012 19:40 UTC
22 points
29 comments7 min readLW link

Let the AI teach you how to flirt

DirectedEvolution17 Sep 2020 19:04 UTC
47 points
11 comments2 min readLW link

On­line AI Safety Dis­cus­sion Day

Linda Linsefors8 Oct 2020 12:11 UTC
5 points
0 comments1 min readLW link

New(ish) AI con­trol ideas

Stuart_Armstrong5 Mar 2015 17:03 UTC
34 points
14 comments3 min readLW link

Not Tak­ing Over the World

Eliezer Yudkowsky15 Dec 2008 22:18 UTC
35 points
97 comments4 min readLW link

Nat­u­ral­is­tic trust among AIs: The parable of the the­sis ad­vi­sor’s theorem

Benya15 Dec 2013 8:32 UTC
36 points
20 comments6 min readLW link

The Solomonoff Prior is Malign

Mark Xu14 Oct 2020 1:33 UTC
148 points
52 comments16 min readLW link3 reviews

Twenty-three AI al­ign­ment re­search pro­ject definitions

rmoehn3 Feb 2020 22:21 UTC
23 points
0 comments6 min readLW link

When Good­hart­ing is op­ti­mal: lin­ear vs diminish­ing re­turns, un­likely vs likely, and other factors

Stuart_Armstrong19 Dec 2019 13:55 UTC
24 points
18 comments7 min readLW link

[Question] As a Washed Up Former Data Scien­tist and Ma­chine Learn­ing Re­searcher What Direc­tion Should I Go In Now?

Darklight19 Oct 2020 20:13 UTC
13 points
7 comments3 min readLW link

Ar­tifi­cial Mys­te­ri­ous Intelligence

Eliezer Yudkowsky7 Dec 2008 20:05 UTC
29 points
24 comments5 min readLW link

A Pre­ma­ture Word on AI

Eliezer Yudkowsky31 May 2008 17:48 UTC
26 points
69 comments8 min readLW link

Let’s reim­ple­ment EURISKO!

cousin_it11 Jun 2009 16:28 UTC
23 points
162 comments1 min readLW link

Cor­rigi­bil­ity thoughts III: ma­nipu­lat­ing ver­sus deceiving

Stuart_Armstrong18 Jan 2017 15:57 UTC
3 points
0 comments1 min readLW link

[Question] [Meta] Do you want AIS We­bi­nars?

Linda Linsefors21 Mar 2020 16:01 UTC
18 points
7 comments1 min readLW link

New ar­ti­cle from Oren Etzioni

Aryeh Englander25 Feb 2020 15:25 UTC
19 points
19 comments2 min readLW link

Sin­gle­tons Rule OK

Eliezer Yudkowsky30 Nov 2008 16:45 UTC
20 points
47 comments5 min readLW link

“On the Im­pos­si­bil­ity of Su­per­sized Machines”

crmflynn31 Mar 2017 23:32 UTC
24 points
4 comments1 min readLW link
(philpapers.org)

Non­sen­tient Optimizers

Eliezer Yudkowsky27 Dec 2008 2:32 UTC
34 points
48 comments6 min readLW link

Build­ing Some­thing Smarter

Eliezer Yudkowsky2 Nov 2008 17:00 UTC
22 points
57 comments4 min readLW link

Let’s Read: an es­say on AI Theology

Yuxi_Liu4 Jul 2019 7:50 UTC
22 points
9 comments7 min readLW link

Wanted: Python open source volunteers

Eliezer Yudkowsky11 Mar 2009 4:59 UTC
16 points
13 comments1 min readLW link

Equil­ibrium and prior se­lec­tion prob­lems in mul­ti­po­lar deployment

JesseClifton2 Apr 2020 20:06 UTC
20 points
11 comments11 min readLW link

[Question] The Si­mu­la­tion Epiphany Problem

Koen.Holtman31 Oct 2019 22:12 UTC
15 points
13 comments4 min readLW link

Chang­ing ac­cepted pub­lic opinion and Skynet

Roko22 May 2009 11:05 UTC
17 points
71 comments2 min readLW link

In­tro­duc­ing CADIE

MBlume1 Apr 2009 7:32 UTC
0 points
8 comments1 min readLW link

Deep­mind Plans for Rat-Level AI

moridinamael18 Aug 2016 16:26 UTC
34 points
9 comments1 min readLW link

“Robot sci­en­tists can think for them­selves”

CronoDAS2 Apr 2009 21:16 UTC
−1 points
11 comments1 min readLW link

Au­tomat­ing rea­son­ing about the fu­ture at Ought

jungofthewon9 Nov 2020 21:51 UTC
17 points
0 comments1 min readLW link
(ought.org)

Neu­ral pro­gram syn­the­sis is a dan­ger­ous technology

syllogism12 Jan 2018 16:19 UTC
10 points
6 comments2 min readLW link

New, Brief Pop­u­lar-Level In­tro­duc­tion to AI Risks and Superintelligence

LyleN23 Jan 2015 15:43 UTC
33 points
3 comments1 min readLW link

In the be­gin­ning, Dart­mouth cre­ated the AI and the hype

Stuart_Armstrong24 Jan 2013 16:49 UTC
33 points
22 comments1 min readLW link

Fun­da­men­tal Philo­soph­i­cal Prob­lems In­her­ent in AI discourse

AlexSadler16 Sep 2018 21:03 UTC
23 points
1 comment17 min readLW link

Re­search Pri­ori­ties for Ar­tifi­cial In­tel­li­gence: An Open Letter

jimrandomh11 Jan 2015 19:52 UTC
38 points
11 comments1 min readLW link

[Question] How can I help re­search Friendly AI?

avichapman9 Jul 2019 0:15 UTC
22 points
3 comments1 min readLW link

FAI Re­search Con­straints and AGI Side Effects

JustinShovelain3 Jun 2015 19:25 UTC
26 points
59 comments7 min readLW link

[Question] How to deal with a mis­lead­ing con­fer­ence talk about AI risk?

rmoehn27 Jun 2019 21:04 UTC
21 points
13 comments4 min readLW link

Im­pli­ca­tions of Quan­tum Com­put­ing for Ar­tifi­cial In­tel­li­gence Align­ment Research

22 Aug 2019 10:33 UTC
24 points
3 comments13 min readLW link

[Question] How can labour pro­duc­tivity growth be an in­di­ca­tor of au­toma­tion?

Polytopos16 Nov 2020 21:16 UTC
2 points
5 comments1 min readLW link

[Question] Should I do it?

MrLight19 Nov 2020 1:08 UTC
−3 points
16 comments2 min readLW link

My in­tel­lec­tual influences

Richard_Ngo22 Nov 2020 18:00 UTC
92 points
1 comment5 min readLW link
(thinkingcomplete.blogspot.com)

Del­e­gated agents in prac­tice: How com­pa­nies might end up sel­l­ing AI ser­vices that act on be­half of con­sumers and coal­i­tions, and what this im­plies for safety research

Remmelt26 Nov 2020 11:17 UTC
7 points
5 comments4 min readLW link

SETI Predictions

hippke30 Nov 2020 20:09 UTC
23 points
8 comments1 min readLW link

What hap­pens when your be­liefs fully propagate

Alexei14 Feb 2012 7:53 UTC
29 points
79 comments7 min readLW link

In­ter­ac­tive ex­plo­ra­tion of LessWrong and other large col­lec­tions of documents

20 Dec 2020 19:06 UTC
49 points
9 comments10 min readLW link

[Question] Will AGI have “hu­man” flaws?

Agustinus Theodorus23 Dec 2020 3:43 UTC
1 point
2 comments1 min readLW link

Op­ti­mum num­ber of sin­gle points of failure

Douglas_Reay14 Mar 2018 13:30 UTC
7 points
4 comments4 min readLW link

Don’t put all your eggs in one basket

Douglas_Reay15 Mar 2018 8:07 UTC
5 points
0 comments7 min readLW link

Defect or Cooperate

Douglas_Reay16 Mar 2018 14:12 UTC
4 points
5 comments6 min readLW link

En­vi­ron­ments for kil­ling AIs

Douglas_Reay17 Mar 2018 15:23 UTC
3 points
1 comment9 min readLW link

The ad­van­tage of not be­ing open-ended

Douglas_Reay18 Mar 2018 13:50 UTC
7 points
2 comments6 min readLW link

Metamorphosis

Douglas_Reay12 Apr 2018 21:53 UTC
2 points
0 comments4 min readLW link

Believ­able Promises

Douglas_Reay16 Apr 2018 16:17 UTC
5 points
0 comments5 min readLW link

Trust­wor­thy Computing

Douglas_Reay10 Apr 2018 7:55 UTC
9 points
1 comment6 min readLW link

Edge of the Cliff

akaTrickster5 Jan 2021 17:21 UTC
1 point
0 comments5 min readLW link

[Question] How is re­in­force­ment learn­ing pos­si­ble in non-sen­tient agents?

SomeoneKind5 Jan 2021 20:57 UTC
3 points
5 comments1 min readLW link

AI Align­ment Us­ing Re­v­erse Simulation

Sven Nilsen12 Jan 2021 20:48 UTC
1 point
0 comments1 min readLW link

A toy model of the con­trol problem

Stuart_Armstrong16 Sep 2015 14:59 UTC
36 points
24 comments3 min readLW link

On the na­ture of pur­pose

Nora_Ammann22 Jan 2021 8:30 UTC
28 points
15 comments9 min readLW link

Learn­ing Nor­ma­tivity: Language

Bunthut5 Feb 2021 22:26 UTC
14 points
4 comments8 min readLW link

Sin­gu­lar­ity&phase tran­si­tion-2. A pri­ori prob­a­bil­ity and ways to check.

Valentin20268 Feb 2021 2:21 UTC
1 point
0 comments3 min readLW link

Non­per­son Predicates

Eliezer Yudkowsky27 Dec 2008 1:47 UTC
52 points
176 comments6 min readLW link

Map­ping the Con­cep­tual Ter­ri­tory in AI Ex­is­ten­tial Safety and Alignment

jbkjr12 Feb 2021 7:55 UTC
15 points
0 comments26 min readLW link

2021-03-01 Na­tional Library of Medicine Pre­sen­ta­tion: “At­las of AI: Map­ping the so­cial and eco­nomic forces be­hind AI”

IrenicTruth17 Feb 2021 18:23 UTC
1 point
0 comments2 min readLW link

Chaotic era: avoid or sur­vive?

Valentin202622 Feb 2021 1:34 UTC
3 points
3 comments2 min readLW link

Suffer­ing-Fo­cused Ethics in the In­finite Uni­verse. How can we re­deem our­selves if Mul­ti­verse Im­mor­tal­ity is real and sub­jec­tive death is im­pos­si­ble.

Szymon Kucharski24 Feb 2021 21:02 UTC
−3 points
4 comments70 min readLW link

AIDun­geon 3.1

Yair Halberstadt1 Mar 2021 5:56 UTC
2 points
0 comments2 min readLW link

Phys­i­cal­ism im­plies ex­pe­rience never dies. So what am I go­ing to ex­pe­rience af­ter it does?

Szymon Kucharski14 Mar 2021 14:45 UTC
−2 points
1 comment30 min readLW link

An An­tropic Ar­gu­ment for Post-sin­gu­lar­ity Antinatalism

monkaap16 Mar 2021 17:40 UTC
3 points
4 comments3 min readLW link

[Question] Is a Self-Iter­at­ing AGI Vuln­er­a­ble to Thomp­son-style Tro­jans?

sxae25 Mar 2021 14:46 UTC
15 points
7 comments3 min readLW link

AI or­a­cles on blockchain

Caravaggio6 Apr 2021 20:13 UTC
5 points
0 comments3 min readLW link

What if AGI is near?

Wulky Wilkinsen14 Apr 2021 0:05 UTC
11 points
5 comments1 min readLW link

Re­view of “Why AI is Harder Than We Think”

electroswing30 Apr 2021 18:14 UTC
40 points
10 comments8 min readLW link

Thoughts on the Align­ment Im­pli­ca­tions of Scal­ing Lan­guage Models

leogao2 Jun 2021 21:32 UTC
79 points
11 comments17 min readLW link

[Question] Sup­pose $1 billion is given to AI Safety. How should it be spent?

hunterglenn15 May 2021 23:24 UTC
23 points
2 comments1 min readLW link

Con­trol­ling In­tel­li­gent Agents The Only Way We Know How: Ideal Bureau­cratic Struc­ture (IBS)

Justin Bullock24 May 2021 12:53 UTC
11 points
11 comments6 min readLW link

Cu­rated con­ver­sa­tions with brilli­ant rationalists

spencerg28 May 2021 14:23 UTC
153 points
18 comments6 min readLW link

Se­cu­rity Mind­set and Or­di­nary Paranoia

Eliezer Yudkowsky25 Nov 2017 17:53 UTC
98 points
24 comments29 min readLW link

The Anti-Carter Basilisk

Jon Gilbert26 May 2021 22:56 UTC
0 points
0 comments2 min readLW link

Pa­ram­e­ter counts in Ma­chine Learning

19 Jun 2021 16:04 UTC
47 points
16 comments7 min readLW link

Ir­ra­tional Modesty

Tomás B.20 Jun 2021 19:38 UTC
132 points
7 comments1 min readLW link

[Question] Thoughts on a “Se­quences In­spired” PhD Topic

goose00017 Jun 2021 20:36 UTC
7 points
2 comments2 min readLW link

Some al­ter­na­tives to “Friendly AI”

lukeprog15 Jun 2014 19:53 UTC
30 points
44 comments2 min readLW link

In­tel­li­gence with­out Consciousness

Andrew Vlahos7 Jul 2021 5:27 UTC
13 points
5 comments1 min readLW link

[Question] What would it look like if it looked like AGI was very near?

Tomás B.12 Jul 2021 15:22 UTC
52 points
25 comments1 min readLW link

Is the ar­gu­ment that AI is an xrisk valid?

MACannon19 Jul 2021 13:20 UTC
5 points
62 comments1 min readLW link
(onlinelibrary.wiley.com)

[Question] Jay­ne­sian in­ter­pre­ta­tion—How does “es­ti­mat­ing prob­a­bil­ities” make sense?

Haziq Muhammad21 Jul 2021 21:36 UTC
4 points
40 comments1 min readLW link

The biolog­i­cal in­tel­li­gence explosion

Rob Lucas25 Jul 2021 13:08 UTC
8 points
6 comments4 min readLW link

[Question] Do Bayesi­ans like Bayesian model Aver­ag­ing?

Haziq Muhammad2 Aug 2021 12:24 UTC
4 points
13 comments1 min readLW link

[Question] Ques­tion about Test-sets and Bayesian ma­chine learn­ing

Haziq Muhammad9 Aug 2021 17:16 UTC
2 points
8 comments1 min readLW link

[Question] Halpern’s pa­per—A re­fu­ta­tion of Cox’s the­o­rem?

Haziq Muhammad11 Aug 2021 9:25 UTC
11 points
7 comments1 min readLW link

New GPT-3 competitor

Quintin Pope12 Aug 2021 7:05 UTC
32 points
10 comments1 min readLW link

[Question] Jaynes-Cox Prob­a­bil­ity: Are plau­si­bil­ities ob­jec­tive?

Haziq Muhammad12 Aug 2021 14:23 UTC
9 points
17 comments1 min readLW link

A gen­tle apoc­a­lypse

pchvykov16 Aug 2021 5:03 UTC
3 points
5 comments3 min readLW link

[Question] Is it worth mak­ing a database for moral pre­dic­tions?

Jonas Hallgren16 Aug 2021 14:51 UTC
1 point
0 comments2 min readLW link

Cyn­i­cal ex­pla­na­tions of FAI crit­ics (in­clud­ing my­self)

Wei Dai13 Aug 2012 21:19 UTC
31 points
49 comments1 min readLW link

[Question] Has Van Horn fixed Cox’s the­o­rem?

Haziq Muhammad29 Aug 2021 18:36 UTC
9 points
1 comment1 min readLW link

The Gover­nance Prob­lem and the “Pretty Good” X-Risk

Zach Stein-Perlman29 Aug 2021 18:00 UTC
5 points
2 comments11 min readLW link

Limits of and to (ar­tifi­cial) Intelligence

MoritzG25 Aug 2019 22:16 UTC
1 point
3 comments7 min readLW link

Grokking the In­ten­tional Stance

jbkjr31 Aug 2021 15:49 UTC
41 points
20 comments20 min readLW link

In­tel­li­gence, Fast and Slow

Mateusz Mazurkiewicz1 Sep 2021 19:52 UTC
−3 points
2 comments2 min readLW link

[Question] Is LessWrong dead with­out Cox’s the­o­rem?

Haziq Muhammad4 Sep 2021 5:45 UTC
−2 points
88 comments1 min readLW link

Align­ment via man­u­ally im­ple­ment­ing the util­ity function

Chantiel7 Sep 2021 20:20 UTC
1 point
6 comments2 min readLW link

Pivot!

Carlos Ramirez12 Sep 2021 20:39 UTC
−19 points
5 comments1 min readLW link

The Me­taethics and Nor­ma­tive Ethics of AGI Value Align­ment: Many Ques­tions, Some Implications

Eleos Arete Citrini16 Sep 2021 16:13 UTC
6 points
0 comments8 min readLW link

Why will AI be dan­ger­ous?

Legionnaire4 Feb 2022 23:41 UTC
37 points
14 comments1 min readLW link

Oc­cam’s Ra­zor and the Univer­sal Prior

Peter Chatain3 Oct 2021 3:23 UTC
22 points
5 comments21 min readLW link

We’re Red­wood Re­search, we do ap­plied al­ign­ment re­search, AMA

Nate Thomas6 Oct 2021 5:51 UTC
56 points
3 comments2 min readLW link
(forum.effectivealtruism.org)

[LINK] Wait But Why—The AI Revolu­tion Part 2

Adam Zerner4 Feb 2015 16:02 UTC
27 points
88 comments1 min readLW link

Slate Star Codex Notes on the Asilo­mar Con­fer­ence on Benefi­cial AI

Gunnar_Zarncke7 Feb 2017 12:14 UTC
24 points
8 comments1 min readLW link
(slatestarcodex.com)

Three Ap­proaches to “Friendli­ness”

Wei Dai17 Jul 2013 7:46 UTC
32 points
86 comments3 min readLW link

P₂B: Plan to P₂B Better

24 Oct 2021 15:21 UTC
33 points
14 comments6 min readLW link

A Roadmap to a Post-Scarcity Economy

lorepieri30 Oct 2021 9:04 UTC
3 points
3 comments1 min readLW link

What is the link be­tween al­tru­ism and in­tel­li­gence?

Ruralvisitor833 Nov 2021 23:59 UTC
3 points
13 comments1 min readLW link

Model­ing the im­pact of safety agendas

Ben Cottier5 Nov 2021 19:46 UTC
51 points
6 comments10 min readLW link

[Question] Does any­one know what Marvin Min­sky is talk­ing about here?

delton13719 Nov 2021 0:56 UTC
1 point
6 comments3 min readLW link

In­te­grat­ing Three Models of (Hu­man) Cognition

jbkjr23 Nov 2021 1:06 UTC
29 points
4 comments32 min readLW link

[Question] I cur­rently trans­late AGI-re­lated texts to Rus­sian. Is that use­ful?

Tapatakt27 Nov 2021 17:51 UTC
29 points
7 comments1 min readLW link

Ques­tion/​Is­sue with the 5/​10 Problem

acgt29 Nov 2021 10:45 UTC
6 points
3 comments3 min readLW link

Can solip­sism be dis­proven?

nx20594 Dec 2021 8:24 UTC
−2 points
5 comments2 min readLW link

[Question] Misc. ques­tions about EfficientZero

Daniel Kokotajlo4 Dec 2021 19:45 UTC
51 points
17 comments1 min readLW link

Fram­ing ap­proaches to al­ign­ment and the hard prob­lem of AI cognition

ryan_greenblatt15 Dec 2021 19:06 UTC
8 points
15 comments27 min readLW link

HIRING: In­form and shape a new pro­ject on AI safety at Part­ner­ship on AI

madhu_lika7 Dec 2021 19:37 UTC
1 point
0 comments1 min readLW link

What role should evolu­tion­ary analo­gies play in un­der­stand­ing AI take­off speeds?

anson.ho11 Dec 2021 1:19 UTC
14 points
0 comments42 min readLW link

Mo­ti­va­tions, Nat­u­ral Selec­tion, and Cur­ricu­lum Engineering

Oliver Sourbut16 Dec 2021 1:07 UTC
16 points
0 comments42 min readLW link

Emer­gent mod­u­lar­ity and safety

Richard_Ngo21 Oct 2021 1:54 UTC
31 points
15 comments3 min readLW link

Ev­i­dence Sets: Towards In­duc­tive-Bi­ases based Anal­y­sis of Pro­saic AGI

bayesian_kitten16 Dec 2021 22:41 UTC
22 points
10 comments21 min readLW link

Univer­sal­ity and the “Filter”

maggiehayes16 Dec 2021 0:47 UTC
10 points
3 comments11 min readLW link

[Question] Can you prove that 0 = 1?

purplelight4 Feb 2022 21:31 UTC
−10 points
4 comments1 min readLW link

Ex­pec­ta­tions In­fluence Real­ity (and AI)

purplelight4 Feb 2022 21:31 UTC
0 points
3 comments7 min readLW link

[Question] What ques­tions do you have about do­ing work on AI safety?

peterbarnett21 Dec 2021 16:36 UTC
13 points
8 comments1 min readLW link

Re­views of “Is power-seek­ing AI an ex­is­ten­tial risk?”

Joe Carlsmith16 Dec 2021 20:48 UTC
76 points
20 comments1 min readLW link

Elic­it­ing La­tent Knowl­edge Via Hy­po­thet­i­cal Sensors

John_Maxwell30 Dec 2021 15:53 UTC
38 points
2 comments6 min readLW link

Lat­eral Think­ing (AI safety HPMOR fan­fic)

SlytherinsMonster2 Jan 2022 23:50 UTC
75 points
9 comments5 min readLW link

SONN : What’s Next ?

D𝜋9 Jan 2022 8:15 UTC
−17 points
3 comments1 min readLW link

An Open Philan­thropy grant pro­posal: Causal rep­re­sen­ta­tion learn­ing of hu­man preferences

PabloAMC11 Jan 2022 11:28 UTC
19 points
6 comments8 min readLW link

Ac­tion: Help ex­pand fund­ing for AI Safety by co­or­di­nat­ing on NSF response

Evan R. Murphy19 Jan 2022 22:47 UTC
23 points
8 comments3 min readLW link

Emo­tions = Re­ward Functions

jpyykko20 Jan 2022 18:46 UTC
16 points
10 comments5 min readLW link

[Question] Is AI Align­ment a pseu­do­science?

mocny-chlapik23 Jan 2022 10:32 UTC
21 points
41 comments1 min readLW link

De­con­fus­ing Deception

J Bostock29 Jan 2022 16:43 UTC
26 points
6 comments2 min readLW link

Re­vis­it­ing Brave New World Re­vis­ited (Chap­ter 3)

Justin Bullock1 Feb 2022 17:17 UTC
5 points
0 comments10 min readLW link

[Question] Do mesa-op­ti­miza­tion prob­lems cor­re­late with low-slack?

sudo4 Feb 2022 21:11 UTC
1 point
1 comment1 min readLW link

Can the laws of physics/​na­ture pre­vent hell?

superads916 Feb 2022 20:39 UTC
−7 points
10 comments2 min readLW link

Ngo and Yud­kowsky on sci­en­tific rea­son­ing and pivotal acts

21 Feb 2022 20:54 UTC
51 points
13 comments35 min readLW link

Bet­ter a Brave New World than a dead one

Yitz25 Feb 2022 23:11 UTC
8 points
5 comments4 min readLW link

Be­ing an in­di­vi­d­ual al­ign­ment grantmaker

A_donor28 Feb 2022 20:02 UTC
64 points
5 comments2 min readLW link

How to de­velop safe superintelligence

martillopart1 Mar 2022 21:57 UTC
−5 points
3 comments13 min readLW link

Deep Dives: My Ad­vice for Pur­su­ing Work in Re­search

scasper11 Mar 2022 17:56 UTC
21 points
2 comments3 min readLW link

One pos­si­ble ap­proach to de­velop the best pos­si­ble gen­eral learn­ing algorithm

martillopart14 Mar 2022 19:24 UTC
3 points
0 comments7 min readLW link

[Question] Our time in his­tory as ev­i­dence for simu­la­tion the­ory?

Garrett Garzonie18 Mar 2022 3:35 UTC
3 points
2 comments1 min readLW link

The weak­est ar­gu­ments for and against hu­man level AI

Stuart_Armstrong15 Aug 2012 11:04 UTC
22 points
34 comments1 min readLW link

Chris­ti­ano and Yud­kowsky on AI pre­dic­tions and hu­man intelligence

Eliezer Yudkowsky23 Feb 2022 21:34 UTC
69 points
35 comments42 min readLW link

Even more cu­rated con­ver­sa­tions with brilli­ant rationalists

spencerg21 Mar 2022 23:49 UTC
57 points
0 comments15 min readLW link

Man­hat­tan pro­ject for al­igned AI

Chris van Merwijk27 Mar 2022 11:41 UTC
34 points
6 comments2 min readLW link

Gears-Level Men­tal Models of Trans­former Interpretability

KevinRoWang29 Mar 2022 20:09 UTC
56 points
4 comments6 min readLW link

Meta wants to use AI to write Wikipe­dia ar­ti­cles; I am Ner­vous™

Yitz30 Mar 2022 19:05 UTC
14 points
12 comments1 min readLW link

[Question] If AGI were com­ing in a year, what should we do?

MichaelStJules1 Apr 2022 0:41 UTC
20 points
16 comments1 min readLW link

On Agent In­cen­tives to Ma­nipu­late Hu­man Feed­back in Multi-Agent Re­ward Learn­ing Scenarios

Francis Rhys Ward3 Apr 2022 18:20 UTC
27 points
11 comments8 min readLW link

[Question] How to write a LW se­quence to learn a topic?

PabloAMC3 Apr 2022 20:09 UTC
3 points
2 comments1 min readLW link

Save Hu­man­ity! Breed Sapi­ent Oc­to­puses!

Yair Halberstadt5 Apr 2022 18:39 UTC
54 points
17 comments1 min readLW link

What Should We Op­ti­mize—A Conversation

Johannes C. Mayer7 Apr 2022 3:47 UTC
1 point
0 comments14 min readLW link

The Ex­plana­tory Gap of AI

David Valdman7 Apr 2022 18:28 UTC
1 point
0 comments4 min readLW link

Progress re­port 3: clus­ter­ing trans­former neurons

Nathan Helm-Burger5 Apr 2022 23:13 UTC
5 points
0 comments2 min readLW link

God­shat­ter Ver­sus Leg­i­bil­ity: A Fun­da­men­tally Differ­ent Ap­proach To AI Alignment

LukeOnline9 Apr 2022 21:43 UTC
11 points
14 comments7 min readLW link

Is Fish­e­rian Ru­n­away Gra­di­ent Hack­ing?

Ryan Kidd10 Apr 2022 13:47 UTC
15 points
7 comments4 min readLW link

The Glitch And Notes On Digi­tal Beings

Ghvst11 Apr 2022 19:46 UTC
−4 points
0 comments2 min readLW link
(ghvsted.com)

Post-his­tory is writ­ten by the martyrs

Veedrac11 Apr 2022 15:45 UTC
37 points
2 comments19 min readLW link
(www.royalroad.com)

An AI-in-a-box suc­cess model

azsantosk11 Apr 2022 22:28 UTC
16 points
1 comment10 min readLW link

Ra­tion­al­ist Should Win. Not Dy­ing with Dig­nity and Fund­ing WBE.

CitizenTen12 Apr 2022 2:14 UTC
23 points
15 comments5 min readLW link

Re­ward model hack­ing as a challenge for re­ward learning

Erik Jenner12 Apr 2022 9:39 UTC
25 points
1 comment9 min readLW link

Is tech­ni­cal AI al­ign­ment re­search a net pos­i­tive?

cranberry_bear12 Apr 2022 13:07 UTC
4 points
2 comments2 min readLW link

Another list of the­o­ries of im­pact for interpretability

Beth Barnes13 Apr 2022 13:29 UTC
32 points
1 comment5 min readLW link

Some rea­sons why a pre­dic­tor wants to be a consequentialist

Lauro Langosco15 Apr 2022 15:02 UTC
23 points
16 comments5 min readLW link

Red­wood Re­search is hiring for sev­eral roles (Oper­a­tions and Tech­ni­cal)

14 Apr 2022 16:57 UTC
29 points
0 comments1 min readLW link

[Question] Con­vince me that hu­man­ity *isn’t* doomed by AGI

Yitz15 Apr 2022 17:26 UTC
60 points
53 comments1 min readLW link

Another ar­gu­ment that you will let the AI out of the box

Garrett Baker19 Apr 2022 21:54 UTC
8 points
16 comments2 min readLW link

For ev­ery choice of AGI difficulty, con­di­tion­ing on grad­ual take-off im­plies shorter timelines.

Francis Rhys Ward21 Apr 2022 7:44 UTC
29 points
13 comments3 min readLW link

Reflec­tions on My Own Miss­ing Mood

Lone Pine21 Apr 2022 16:19 UTC
51 points
25 comments5 min readLW link

Key ques­tions about ar­tifi­cial sen­tience: an opinionated guide

Robbo25 Apr 2022 12:09 UTC
45 points
31 comments18 min readLW link

[Question] What is be­ing im­proved in re­cur­sive self im­prove­ment?

Lone Pine25 Apr 2022 18:30 UTC
7 points
7 comments1 min readLW link

Why Copi­lot Ac­cel­er­ates Timelines

Michaël Trazzi26 Apr 2022 22:06 UTC
35 points
14 comments7 min readLW link

[Question] Is it de­sir­able for the first AGI to be con­scious?

Charbel-Raphaël1 May 2022 21:29 UTC
5 points
12 comments1 min readLW link

[Question] What Was Your Best /​ Most Suc­cess­ful DALL-E 2 Prompt?

Evidential4 May 2022 3:16 UTC
1 point
0 comments1 min readLW link

Ne­go­ti­at­ing Up and Down the Si­mu­la­tion Hier­ar­chy: Why We Might Sur­vive the Unal­igned Singularity

David Udell4 May 2022 4:21 UTC
24 points
16 comments2 min readLW link

High-stakes al­ign­ment via ad­ver­sar­ial train­ing [Red­wood Re­search re­port]

5 May 2022 0:59 UTC
136 points
29 comments9 min readLW link

Deriv­ing Con­di­tional Ex­pected Utility from Pareto-Effi­cient Decisions

Thomas Kwa5 May 2022 3:21 UTC
23 points
1 comment6 min readLW link

Tran­scripts of in­ter­views with AI researchers

Vael Gates9 May 2022 5:57 UTC
160 points
8 comments2 min readLW link

Agency As a Nat­u­ral Abstraction

Thane Ruthenis13 May 2022 18:02 UTC
55 points
9 comments13 min readLW link

Pre­dict­ing the Elec­tions with Deep Learn­ing—Part 1 - Results

Quentin Chenevier14 May 2022 12:54 UTC
0 points
0 comments1 min readLW link

On sav­ing one’s world

Rob Bensinger17 May 2022 19:53 UTC
190 points
5 comments1 min readLW link

In defence of flailing

acylhalide18 Jun 2022 5:26 UTC
10 points
14 comments4 min readLW link

Re­shap­ing the AI Industry

Thane Ruthenis29 May 2022 22:54 UTC
143 points
34 comments21 min readLW link

Science for the Pos­si­ble World

Zechen Zhang23 May 2022 14:01 UTC
7 points
0 comments3 min readLW link

Syn­thetic Me­dia and The Fu­ture of Film

ifalpha24 May 2022 5:54 UTC
35 points
13 comments8 min readLW link

Ex­plain­ing in­ner al­ign­ment to myself

Jeremy Gillen24 May 2022 23:10 UTC
9 points
2 comments10 min readLW link

A dis­cus­sion of the pa­per, “Large Lan­guage Models are Zero-Shot Rea­son­ers”

HiroSakuraba26 May 2022 15:55 UTC
7 points
0 comments4 min readLW link

On in­ner and outer al­ign­ment, and their confusion

Nina Panickssery26 May 2022 21:56 UTC
6 points
7 comments4 min readLW link

RL with KL penalties is bet­ter seen as Bayesian inference

25 May 2022 9:23 UTC
90 points
15 comments12 min readLW link

Bits of Op­ti­miza­tion Can Only Be Lost Over A Distance

johnswentworth23 May 2022 18:55 UTC
26 points
15 comments2 min readLW link

Gra­da­tions of Agency

Daniel Kokotajlo23 May 2022 1:10 UTC
40 points
6 comments5 min readLW link

Utilitarianism

C S SRUTHI28 May 2022 19:35 UTC
0 points
1 comment1 min readLW link

Distil­led—AGI Safety from First Principles

Harrison G29 May 2022 0:57 UTC
8 points
1 comment14 min readLW link

Mul­ti­ple AIs in boxes, eval­u­at­ing each other’s alignment

Moebius31429 May 2022 8:36 UTC
7 points
0 comments14 min readLW link

The im­pact you might have work­ing on AI safety

Fabien Roger29 May 2022 16:31 UTC
5 points
1 comment4 min readLW link

My SERI MATS Application

Daniel Paleka30 May 2022 2:04 UTC
16 points
0 comments8 min readLW link

[Question] A ter­rify­ing var­i­ant of Boltz­mann’s brains problem

Zeruel01730 May 2022 20:08 UTC
5 points
12 comments4 min readLW link

The Re­v­erse Basilisk

Dunning K.30 May 2022 23:10 UTC
15 points
23 comments2 min readLW link

The Hard In­tel­li­gence Hy­poth­e­sis and Its Bear­ing on Suc­ces­sion In­duced Foom

DragonGod31 May 2022 19:04 UTC
10 points
7 comments4 min readLW link

Machines vs Memes Part 1: AI Align­ment and Memetics

Harriet Farlow31 May 2022 22:03 UTC
16 points
0 comments6 min readLW link

[Question] What will hap­pen when an all-reach­ing AGI starts at­tempt­ing to fix hu­man char­ac­ter flaws?

Michael Bright1 Jun 2022 18:45 UTC
1 point
6 comments1 min readLW link

New co­op­er­a­tion mechanism—quadratic fund­ing with­out a match­ing pool

Filip Sondej5 Jun 2022 13:55 UTC
11 points
0 comments5 min readLW link

Miriam Ye­vick on why both sym­bols and net­works are nec­es­sary for ar­tifi­cial minds

Bill Benzon6 Jun 2022 8:34 UTC
1 point
0 comments4 min readLW link

Six Di­men­sions of Oper­a­tional Ad­e­quacy in AGI Projects

Eliezer Yudkowsky30 May 2022 17:00 UTC
270 points
65 comments13 min readLW link

Grokking “Fore­cast­ing TAI with biolog­i­cal an­chors”

anson.ho6 Jun 2022 18:58 UTC
34 points
0 comments14 min readLW link

Who mod­els the mod­els that model mod­els? An ex­plo­ra­tion of GPT-3′s in-con­text model fit­ting ability

Lovre7 Jun 2022 19:37 UTC
112 points
14 comments9 min readLW link

Pitch­ing an Align­ment Softball

mu_(negative)7 Jun 2022 4:10 UTC
47 points
13 comments10 min readLW link

[Question] Con­fused Thoughts on AI After­life (se­ri­ously)

Epirito7 Jun 2022 14:37 UTC
−6 points
6 comments1 min readLW link

Trans­former Re­search Ques­tions from Stained Glass Windows

StefanHex8 Jun 2022 12:38 UTC
4 points
0 comments2 min readLW link

Elic­it­ing La­tent Knowl­edge (ELK) - Distil­la­tion/​Summary

Marius Hobbhahn8 Jun 2022 13:18 UTC
49 points
2 comments21 min readLW link

Towards Gears-Level Un­der­stand­ing of Agency

Thane Ruthenis16 Jun 2022 22:00 UTC
24 points
4 comments18 min readLW link

Vael Gates: Risks from Ad­vanced AI (June 2022)

Vael Gates14 Jun 2022 0:54 UTC
38 points
2 comments30 min readLW link

Ex­plor­ing Mild Be­havi­our in Embed­ded Agents

Megan Kinniment27 Jun 2022 18:56 UTC
21 points
3 comments18 min readLW link

Oper­a­tional­iz­ing two tasks in Gary Mar­cus’s AGI challenge

Bill Benzon9 Jun 2022 18:31 UTC
10 points
3 comments8 min readLW link

A plau­si­ble story about AI risk.

DeLesley Hutchins10 Jun 2022 2:08 UTC
14 points
1 comment4 min readLW link

I No Longer Believe In­tel­li­gence to be “Mag­i­cal”

DragonGod10 Jun 2022 8:58 UTC
31 points
34 comments6 min readLW link

[Question] Why don’t you in­tro­duce re­ally im­pres­sive peo­ple you per­son­ally know to AI al­ign­ment (more of­ten)?

Verden11 Jun 2022 15:59 UTC
33 points
15 comments1 min readLW link

Godzilla Strategies

johnswentworth11 Jun 2022 15:44 UTC
151 points
65 comments3 min readLW link

In­tu­itive Ex­pla­na­tion of AIXI

Thomas Larsen12 Jun 2022 21:41 UTC
13 points
0 comments5 min readLW link

Train­ing Trace Priors

Adam Jermyn13 Jun 2022 14:22 UTC
12 points
17 comments4 min readLW link

Why multi-agent safety is im­por­tant

Akbir Khan14 Jun 2022 9:23 UTC
8 points
2 comments10 min readLW link

Con­tra EY: Can AGI de­stroy us with­out trial & er­ror?

Nikita Sokolsky13 Jun 2022 18:26 UTC
124 points
76 comments15 min readLW link

A Modest Pivotal Act

anonymousaisafety13 Jun 2022 19:24 UTC
−15 points
1 comment5 min readLW link

OpenAI: GPT-based LLMs show abil­ity to dis­crim­i­nate be­tween its own wrong an­swers, but in­abil­ity to ex­plain how/​why it makes that dis­crim­i­na­tion, even as model scales

Aditya Jain13 Jun 2022 23:33 UTC
14 points
5 comments1 min readLW link
(openai.com)

Re­sources I send to AI re­searchers about AI safety

Vael Gates14 Jun 2022 2:24 UTC
62 points
12 comments10 min readLW link

In­ves­ti­gat­ing causal un­der­stand­ing in LLMs

14 Jun 2022 13:57 UTC
28 points
4 comments13 min readLW link

[Question] How Do You Quan­tify [Physics In­ter­fac­ing] Real World Ca­pa­bil­ities?

DragonGod14 Jun 2022 14:49 UTC
17 points
1 comment4 min readLW link

Cryp­to­graphic Life: How to tran­scend in a sub-light­speed world via Ho­mo­mor­phic encryption

Golol14 Jun 2022 19:22 UTC
1 point
0 comments3 min readLW link

Align­ment Risk Doesn’t Re­quire Superintelligence

JustisMills15 Jun 2022 3:12 UTC
35 points
4 comments2 min readLW link

Multi­gate Priors

Adam Jermyn15 Jun 2022 19:30 UTC
4 points
0 comments3 min readLW link

In­fo­haz­ards and in­fer­en­tial distances

acylhalide16 Jun 2022 7:59 UTC
8 points
0 comments6 min readLW link

Ap­ply to the Ma­chine Learn­ing For Good boot­camp in France

Alexandre Variengien17 Jun 2022 7:32 UTC
10 points
0 comments1 min readLW link

Adap­ta­tion Ex­ecu­tors and the Telos Margin

Plinthist20 Jun 2022 13:06 UTC
2 points
8 comments5 min readLW link

Causal con­fu­sion as an ar­gu­ment against the scal­ing hypothesis

20 Jun 2022 10:54 UTC
83 points
30 comments18 min readLW link

[Question] What is the most prob­a­ble AI?

Zeruel01720 Jun 2022 23:26 UTC
−2 points
0 comments3 min readLW link

Reflec­tion Mechanisms as an Align­ment tar­get: A survey

22 Jun 2022 15:05 UTC
28 points
1 comment14 min readLW link

The Limits of Automation

milkandcigarettes23 Jun 2022 18:03 UTC
5 points
1 comment5 min readLW link
(milkandcigarettes.com)

Con­ver­sa­tion with Eliezer: What do you want the sys­tem to do?

Akash25 Jun 2022 17:36 UTC
112 points
38 comments2 min readLW link

[Yann Le­cun] A Path Towards Au­tonomous Ma­chine In­tel­li­gence

DragonGod27 Jun 2022 19:24 UTC
38 points
12 comments1 min readLW link
(openreview.net)

Yann LeCun, A Path Towards Au­tonomous Ma­chine In­tel­li­gence [link]

Bill Benzon27 Jun 2022 23:29 UTC
5 points
1 comment1 min readLW link

Doom doubts—is in­ner al­ign­ment a likely prob­lem?

Crissman28 Jun 2022 12:42 UTC
6 points
7 comments1 min readLW link

What suc­cess looks like

28 Jun 2022 14:38 UTC
19 points
4 comments1 min readLW link
(forum.effectivealtruism.org)

La­tent Ad­ver­sar­ial Training

Adam Jermyn29 Jun 2022 20:04 UTC
24 points
9 comments5 min readLW link

He­donis­tic Iso­topes:

Trozxzr30 Jun 2022 16:49 UTC
1 point
0 comments1 min readLW link

[Question] What about tran­shu­mans and be­yond?

AlignmentMirror2 Jul 2022 13:58 UTC
7 points
6 comments1 min readLW link

New US Se­nate Bill on X-Risk Miti­ga­tion [Linkpost]

Evan R. Murphy4 Jul 2022 1:25 UTC
35 points
12 comments1 min readLW link
(www.hsgac.senate.gov)

When is it ap­pro­pri­ate to use statis­ti­cal mod­els and prob­a­bil­ities for de­ci­sion mak­ing ?

Younes Kamel5 Jul 2022 12:34 UTC
10 points
7 comments4 min readLW link
(youneskamel.substack.com)

How hu­man­ity would re­spond to slow take­off, with take­aways from the en­tire COVID-19 pan­demic

Noosphere896 Jul 2022 17:52 UTC
4 points
1 comment2 min readLW link

Four So­cietal In­ter­ven­tions to Im­prove our AGI Position

Rafael Cosman6 Jul 2022 18:32 UTC
−6 points
2 comments6 min readLW link
(rafaelcosman.com)

Deep neu­ral net­works are not opaque.

jem-mosig6 Jul 2022 18:03 UTC
22 points
14 comments3 min readLW link

Co­op­er­a­tion with and be­tween AGI\’s

PeterMcCluskey7 Jul 2022 16:45 UTC
10 points
3 comments10 min readLW link
(www.bayesianinvestor.com)

Mak­ing it harder for an AGI to “trick” us, with STVs

Tor Økland Barstad9 Jul 2022 14:42 UTC
14 points
5 comments22 min readLW link

Grouped Loss may dis­fa­vor dis­con­tin­u­ous capabilities

Adam Jermyn9 Jul 2022 17:22 UTC
14 points
2 comments4 min readLW link

We are now at the point of deep­fake job interviews

trevor10 Jul 2022 3:37 UTC
6 points
0 comments1 min readLW link
(www.businessinsider.com)

Ac­cept­abil­ity Ver­ifi­ca­tion: A Re­search Agenda

12 Jul 2022 20:11 UTC
43 points
0 comments1 min readLW link
(docs.google.com)

Find­ing Skele­tons on Rashomon Ridge

24 Jul 2022 22:31 UTC
30 points
2 comments7 min readLW link

A note about differ­en­tial tech­nolog­i­cal development

So8res15 Jul 2022 4:46 UTC
178 points
31 comments6 min readLW link

How In­ter­pretabil­ity can be Impactful

Connall Garrod18 Jul 2022 0:06 UTC
18 points
0 comments37 min readLW link

AI Hiroshima (Does A Vivid Ex­am­ple Of Destruc­tion Fore­stall Apoca­lypse?)

Sable18 Jul 2022 12:06 UTC
4 points
4 comments2 min readLW link

Bounded com­plex­ity of solv­ing ELK and its implications

Rubi J. Hudson19 Jul 2022 6:56 UTC
10 points
4 comments18 min readLW link

Abram Dem­ski’s ELK thoughts and pro­posal—distillation

Rubi J. Hudson19 Jul 2022 6:57 UTC
15 points
4 comments16 min readLW link

Help ARC eval­u­ate ca­pa­bil­ities of cur­rent lan­guage mod­els (still need peo­ple)

Beth Barnes19 Jul 2022 4:55 UTC
94 points
6 comments2 min readLW link

A Cri­tique of AI Align­ment Pessimism

ExCeph19 Jul 2022 2:28 UTC
8 points
1 comment9 min readLW link

Model­ling Deception

Garrett Baker18 Jul 2022 21:21 UTC
15 points
0 comments7 min readLW link

En­light­en­ment Values in a Vuln­er­a­ble World

Maxwell Tabarrok20 Jul 2022 19:52 UTC
15 points
6 comments31 min readLW link
(maximumprogress.substack.com)

AI Safety Cheat­sheet /​ Quick Reference

Zohar Jackson20 Jul 2022 9:39 UTC
3 points
0 comments1 min readLW link
(github.com)

Coun­ter­ing ar­gu­ments against work­ing on AI safety

Rauno Arike20 Jul 2022 18:23 UTC
6 points
2 comments7 min readLW link

Why AGI Timeline Re­search/​Dis­course Might Be Overrated

Noosphere8920 Jul 2022 20:26 UTC
5 points
0 comments1 min readLW link
(forum.effectivealtruism.org)

Con­nor Leahy on Dy­ing with Dig­nity, EleutherAI and Conjecture

Michaël Trazzi22 Jul 2022 18:44 UTC
176 points
29 comments14 min readLW link
(theinsideview.ai)

Brain­storm of things that could force an AI team to burn their lead

So8res24 Jul 2022 23:58 UTC
103 points
4 comments13 min readLW link

Align­ment be­ing im­pos­si­ble might be bet­ter than it be­ing re­ally difficult

Martín Soto25 Jul 2022 23:57 UTC
12 points
2 comments2 min readLW link

AI ethics vs AI alignment

Wei Dai26 Jul 2022 13:08 UTC
4 points
1 comment1 min readLW link

NeurIPS ML Safety Work­shop 2022

Dan H26 Jul 2022 15:28 UTC
72 points
2 comments1 min readLW link
(neurips2022.mlsafety.org)

Quan­tum Ad­van­tage in Learn­ing from Experiments

Dennis Towne27 Jul 2022 15:49 UTC
5 points
5 comments1 min readLW link
(ai.googleblog.com)

AGI ruin sce­nar­ios are likely (and dis­junc­tive)

So8res27 Jul 2022 3:21 UTC
148 points
37 comments6 min readLW link

A Quick Note on AI Scal­ing Asymptotes

alyssavance25 May 2022 2:55 UTC
43 points
6 comments1 min readLW link

[Question] How likely do you think worse-than-ex­tinc­tion type fates to be?

span11 Aug 2022 4:08 UTC
3 points
3 comments1 min readLW link

[Question] I want to donate some money (not much, just what I can af­ford) to AGI Align­ment re­search, to what­ever or­ga­ni­za­tion has the best chance of mak­ing sure that AGI goes well and doesn’t kill us all. What are my best op­tions, where can I make the most differ­ence per dol­lar?

lumenwrites2 Aug 2022 12:08 UTC
15 points
9 comments1 min readLW link

Law-Fol­low­ing AI 4: Don’t Rely on Vi­car­i­ous Liability

Cullen2 Aug 2022 23:26 UTC
5 points
2 comments3 min readLW link

Ex­ter­nal­ized rea­son­ing over­sight: a re­search di­rec­tion for lan­guage model alignment

tamera3 Aug 2022 12:03 UTC
103 points
22 comments6 min readLW link

Trans­former lan­guage mod­els are do­ing some­thing more general

Numendil3 Aug 2022 21:13 UTC
44 points
6 comments2 min readLW link

Three pillars for avoid­ing AGI catas­tro­phe: Tech­ni­cal al­ign­ment, de­ploy­ment de­ci­sions, and coordination

Alex Lintz3 Aug 2022 23:15 UTC
17 points
0 comments12 min readLW link

Sur­prised by ELK re­port’s coun­terex­am­ple to De­bate, IDA

Evan R. Murphy4 Aug 2022 2:12 UTC
18 points
0 comments5 min readLW link

Bias to­wards sim­ple func­tions; ap­pli­ca­tion to al­ign­ment?

DavidHolmes18 Aug 2022 16:15 UTC
3 points
7 comments2 min readLW link

What do ML re­searchers think about AI in 2022?

KatjaGrace4 Aug 2022 15:40 UTC
217 points
33 comments3 min readLW link
(aiimpacts.org)

Deon­tol­ogy and Tool AI

Nathan11235 Aug 2022 5:20 UTC
4 points
5 comments6 min readLW link

Bridg­ing Ex­pected Utility Max­i­miza­tion and Optimization

Whispermute5 Aug 2022 8:18 UTC
23 points
5 comments14 min readLW link

Coun­ter­fac­tu­als are Con­fus­ing be­cause of an On­tolog­i­cal Shift

Chris_Leong5 Aug 2022 19:03 UTC
17 points
35 comments2 min readLW link

A Data limited future

Donald Hobson6 Aug 2022 14:56 UTC
52 points
25 comments2 min readLW link

A Com­mu­nity for Un­der­stand­ing Con­scious­ness: Rais­ing r/​MathPie

Navjotツ7 Aug 2022 8:17 UTC
−12 points
0 comments3 min readLW link
(www.reddit.com)

Com­plex­ity No Bar to AI (Or, why Com­pu­ta­tional Com­plex­ity mat­ters less than you think for real life prob­lems)

Noosphere897 Aug 2022 19:55 UTC
17 points
14 comments3 min readLW link
(www.gwern.net)

A suffi­ciently para­noid pa­per­clip maximizer

RomanS8 Aug 2022 11:17 UTC
17 points
10 comments2 min readLW link

Steganog­ra­phy in Chain of Thought Reasoning

A Ray8 Aug 2022 3:47 UTC
49 points
13 comments6 min readLW link

In­ter­pretabil­ity/​Tool-ness/​Align­ment/​Cor­rigi­bil­ity are not Composable

johnswentworth8 Aug 2022 18:05 UTC
111 points
8 comments3 min readLW link

How (not) to choose a re­search project

9 Aug 2022 0:26 UTC
76 points
11 comments7 min readLW link

Team Shard Sta­tus Report

David Udell9 Aug 2022 5:33 UTC
38 points
8 comments3 min readLW link

[Question] How would two su­per­in­tel­li­gent AIs in­ter­act, if they are un­al­igned with each other?

Nathan11239 Aug 2022 18:58 UTC
4 points
6 comments1 min readLW link

The Host Minds of HBO’s West­world.

Nerret12 Aug 2022 18:53 UTC
1 point
0 comments3 min readLW link

Anti-squat­ted AI x-risk do­mains index

plex12 Aug 2022 12:01 UTC
50 points
3 comments1 min readLW link

The Dumbest Pos­si­ble Gets There First

Artaxerxes13 Aug 2022 10:20 UTC
35 points
7 comments2 min readLW link

[Question] The OpenAI play­ground for GPT-3 is a ter­rible in­ter­face. Is there any great lo­cal (or web) app for ex­plor­ing/​learn­ing with lan­guage mod­els?

aviv13 Aug 2022 16:34 UTC
2 points
1 comment1 min readLW link

I missed the crux of the al­ign­ment prob­lem the whole time

zeshen13 Aug 2022 10:11 UTC
53 points
7 comments3 min readLW link

An Un­canny Prison

Nathan112313 Aug 2022 21:40 UTC
3 points
3 comments2 min readLW link

[Question] What is the prob­a­bil­ity that a su­per­in­tel­li­gent, sen­tient AGI is ac­tu­ally in­fea­si­ble?

Nathan112314 Aug 2022 22:41 UTC
−3 points
6 comments1 min readLW link

Re­in­force­ment Learn­ing Goal Mis­gen­er­al­iza­tion: Can we guess what kind of goals are se­lected by de­fault?

25 Oct 2022 20:48 UTC
9 points
1 comment4 min readLW link

What’s Gen­eral-Pur­pose Search, And Why Might We Ex­pect To See It In Trained ML Sys­tems?

johnswentworth15 Aug 2022 22:48 UTC
103 points
15 comments10 min readLW link

Dis­cov­er­ing Agents

zac_kenton18 Aug 2022 17:33 UTC
56 points
8 comments6 min readLW link

In­ter­pretabil­ity Tools Are an At­tack Channel

Thane Ruthenis17 Aug 2022 18:47 UTC
42 points
22 comments1 min readLW link

Con­di­tion­ing, Prompts, and Fine-Tuning

Adam Jermyn17 Aug 2022 20:52 UTC
32 points
9 comments4 min readLW link

De­bate AI and the De­ci­sion to Re­lease an AI

Chris_Leong17 Jan 2019 14:36 UTC
9 points
18 comments3 min readLW link

What’s the Least Im­pres­sive Thing GPT-4 Won’t be Able to Do

Algon20 Aug 2022 19:48 UTC
75 points
80 comments1 min readLW link

The Align­ment Prob­lem Needs More Pos­i­tive Fiction

Netcentrica21 Aug 2022 22:01 UTC
4 points
2 comments5 min readLW link

AI al­ign­ment as “nav­i­gat­ing the space of in­tel­li­gent be­havi­our”

Nora_Ammann23 Aug 2022 13:28 UTC
18 points
0 comments6 min readLW link

AGI Timelines Are Mostly Not Strate­gi­cally Rele­vant To Alignment

johnswentworth23 Aug 2022 20:15 UTC
44 points
35 comments1 min readLW link

[Question] Would you ask a ge­nie to give you the solu­tion to al­ign­ment?

sudo24 Aug 2022 1:29 UTC
6 points
1 comment1 min readLW link

Ethan Perez on the In­verse Scal­ing Prize, Lan­guage Feed­back and Red Teaming

Michaël Trazzi24 Aug 2022 16:35 UTC
25 points
0 comments3 min readLW link
(theinsideview.ai)

Prepar­ing for the apoc­a­lypse might help pre­vent it

Ocracoke25 Aug 2022 0:18 UTC
1 point
1 comment1 min readLW link

Your posts should be on arXiv

JanB25 Aug 2022 10:35 UTC
136 points
39 comments3 min readLW link

The Solomonoff prior is ma­lign. It’s not a big deal.

Charlie Steiner25 Aug 2022 8:25 UTC
38 points
9 comments7 min readLW link

AI strat­egy nearcasting

HoldenKarnofsky25 Aug 2022 17:26 UTC
79 points
3 comments9 min readLW link

Com­mon mis­con­cep­tions about OpenAI

Jacob_Hilton25 Aug 2022 14:02 UTC
226 points
138 comments5 min readLW link

AI Risk in Terms of Un­sta­ble Nu­clear Software

Thane Ruthenis26 Aug 2022 18:49 UTC
29 points
1 comment6 min readLW link

What’s the Most Im­pres­sive Thing That GPT-4 Could Plau­si­bly Do?

bayesed26 Aug 2022 15:34 UTC
23 points
24 comments1 min readLW link

Tak­ing the pa­ram­e­ters which seem to mat­ter and ro­tat­ing them un­til they don’t

Garrett Baker26 Aug 2022 18:26 UTC
117 points
48 comments1 min readLW link

An­nual AGI Bench­mark­ing Event

Lawrence Phillips27 Aug 2022 0:06 UTC
24 points
3 comments2 min readLW link
(www.metaculus.com)

Is there a benefit in low ca­pa­bil­ity AI Align­ment re­search?

Letti26 Aug 2022 23:51 UTC
1 point
1 comment2 min readLW link

Help Un­der­stand­ing Prefer­ences And Evil

Netcentrica27 Aug 2022 3:42 UTC
6 points
7 comments2 min readLW link

Solv­ing Align­ment by “solv­ing” semantics

Q Home27 Aug 2022 4:17 UTC
15 points
10 comments26 min readLW link

An In­tro­duc­tion to Cur­rent The­o­ries of Consciousness

hohenheim28 Aug 2022 17:55 UTC
59 points
44 comments49 min readLW link

*New* Canada AI Safety & Gover­nance community

Wyatt Tessari L'Allié29 Aug 2022 18:45 UTC
21 points
0 comments1 min readLW link

Are Gen­er­a­tive World Models a Mesa-Op­ti­miza­tion Risk?

Thane Ruthenis29 Aug 2022 18:37 UTC
12 points
2 comments3 min readLW link

How might we al­ign trans­for­ma­tive AI if it’s de­vel­oped very soon?

HoldenKarnofsky29 Aug 2022 15:42 UTC
107 points
17 comments45 min readLW link

Wor­lds Where Iter­a­tive De­sign Fails

johnswentworth30 Aug 2022 20:48 UTC
144 points
26 comments10 min readLW link

[Question] How might we make bet­ter use of AI ca­pa­bil­ities re­search for al­ign­ment pur­poses?

Jemal Young31 Aug 2022 4:19 UTC
11 points
4 comments1 min readLW link

ML Model At­tri­bu­tion Challenge [Linkpost]

aogara30 Aug 2022 19:34 UTC
11 points
0 comments1 min readLW link
(mlmac.io)

I Tripped and Be­came GPT! (And How This Up­dated My Timelines)

Frankophone1 Sep 2022 17:56 UTC
31 points
0 comments4 min readLW link

[Question] Can some­one ex­plain to me why most re­searchers think al­ign­ment is prob­a­bly some­thing that is hu­manly tractable?

iamthouthouarti3 Sep 2022 1:12 UTC
32 points
11 comments1 min readLW link

An Up­date on Academia vs. In­dus­try (one year into my fac­ulty job)

David Scott Krueger (formerly: capybaralet)3 Sep 2022 20:43 UTC
118 points
18 comments4 min readLW link

Fram­ing AI Childhoods

David Udell6 Sep 2022 23:40 UTC
37 points
8 comments4 min readLW link

A Game About AI Align­ment (& Meta-Ethics): What Are the Must Haves?

JonathanErhardt5 Sep 2022 7:55 UTC
18 points
13 comments2 min readLW link

Is train­ing data go­ing to be diluted by AI-gen­er­ated con­tent?

Hannes Thurnherr7 Sep 2022 18:13 UTC
10 points
7 comments1 min readLW link

Turn­ing What­sApp Chat Data into Prompt-Re­sponse Form for Fine-Tuning

casualphysicsenjoyer8 Sep 2022 20:05 UTC
1 point
0 comments1 min readLW link

[An email with a bunch of links I sent an ex­pe­rienced ML re­searcher in­ter­ested in learn­ing about Align­ment /​ x-safety.]

David Scott Krueger (formerly: capybaralet)8 Sep 2022 22:28 UTC
46 points
1 comment5 min readLW link

Mon­i­tor­ing for de­cep­tive alignment

evhub8 Sep 2022 23:07 UTC
118 points
7 comments9 min readLW link

Samotsvety’s AI risk forecasts

elifland9 Sep 2022 4:01 UTC
44 points
0 comments4 min readLW link

Ought will host a fac­tored cog­ni­tion “Lab Meet­ing”

9 Sep 2022 23:46 UTC
35 points
1 comment1 min readLW link

AI Risk In­tro 1: Ad­vanced AI Might Be Very Bad

11 Sep 2022 10:57 UTC
43 points
13 comments30 min readLW link

An in­ves­ti­ga­tion into when agents may be in­cen­tivized to ma­nipu­late our be­liefs.

Felix Hofstätter13 Sep 2022 17:08 UTC
15 points
0 comments14 min readLW link

Risk aver­sion and GPT-3

casualphysicsenjoyer13 Sep 2022 20:50 UTC
1 point
0 comments1 min readLW link

[Question] Would a Misal­igned SSI Really Kill Us All?

DragonGod14 Sep 2022 12:15 UTC
6 points
7 comments6 min readLW link

[Question] Why Do Peo­ple Think Hu­mans Are Stupid?

DragonGod14 Sep 2022 13:55 UTC
21 points
39 comments3 min readLW link

Pre­cise P(doom) isn’t very im­por­tant for pri­ori­ti­za­tion or strategy

harsimony14 Sep 2022 17:19 UTC
18 points
6 comments1 min readLW link

Co­or­di­nate-Free In­ter­pretabil­ity Theory

johnswentworth14 Sep 2022 23:33 UTC
41 points
14 comments5 min readLW link

Ca­pa­bil­ity and Agency as Corner­stones of AI risk ­— My cur­rent model

wilm15 Sep 2022 8:25 UTC
10 points
4 comments12 min readLW link

[Question] Are Hu­man Brains Univer­sal?

DragonGod15 Sep 2022 15:15 UTC
16 points
28 comments5 min readLW link

Should AI learn hu­man val­ues, hu­man norms or some­thing else?

Q Home17 Sep 2022 6:19 UTC
5 points
2 comments4 min readLW link

The ELK Fram­ing I’ve Used

sudo19 Sep 2022 10:28 UTC
4 points
1 comment1 min readLW link

[Question] If we have Hu­man-level chat­bots, won’t we end up be­ing ruled by pos­si­ble peo­ple?

Erlja Jkdf.20 Sep 2022 13:59 UTC
5 points
13 comments1 min readLW link

Char­ac­ter alignment

p.b.20 Sep 2022 8:27 UTC
22 points
0 comments2 min readLW link

Cryp­tocur­rency Ex­ploits Show the Im­por­tance of Proac­tive Poli­cies for AI X-Risk

eSpencer20 Sep 2022 17:53 UTC
1 point
0 comments4 min readLW link

Do­ing over­sight from the very start of train­ing seems hard

peterbarnett20 Sep 2022 17:21 UTC
14 points
3 comments3 min readLW link

Trends in Train­ing Dataset Sizes

Pablo Villalobos21 Sep 2022 15:47 UTC
24 points
2 comments5 min readLW link
(epochai.org)

Two rea­sons we might be closer to solv­ing al­ign­ment than it seems

24 Sep 2022 20:00 UTC
56 points
9 comments4 min readLW link

Fund­ing is All You Need: Get­ting into Grad School by Hack­ing the NSF GRFP Fellowship

hapanin22 Sep 2022 21:39 UTC
93 points
9 comments12 min readLW link

[Question] Papers to start get­ting into NLP-fo­cused al­ign­ment research

Feraidoon24 Sep 2022 23:53 UTC
6 points
0 comments1 min readLW link

How to Study Un­safe AGI’s safely (and why we might have no choice)

Punoxysm7 Mar 2014 7:24 UTC
10 points
47 comments5 min readLW link

On Generality

Eris Discordia26 Sep 2022 4:06 UTC
2 points
0 comments5 min readLW link

Oren’s Field Guide of Bad AGI Outcomes

Eris Discordia26 Sep 2022 4:06 UTC
0 points
0 comments1 min readLW link

Sum­mary of ML Safety Course

zeshen27 Sep 2022 13:05 UTC
6 points
0 comments6 min readLW link

My Thoughts on the ML Safety Course

zeshen27 Sep 2022 13:15 UTC
49 points
3 comments17 min readLW link

Re­ward IS the Op­ti­miza­tion Target

Carn28 Sep 2022 17:59 UTC
−1 points
3 comments5 min readLW link

A Library and Tu­to­rial for Fac­tored Cog­ni­tion with Lan­guage Models

28 Sep 2022 18:15 UTC
47 points
0 comments1 min readLW link

Will Values and Com­pe­ti­tion De­cou­ple?

interstice28 Sep 2022 16:27 UTC
15 points
11 comments17 min readLW link

Make-A-Video by Meta AI

P.29 Sep 2022 17:07 UTC
9 points
4 comments1 min readLW link
(makeavideo.studio)

Open ap­pli­ca­tion to be­come an AI safety pro­ject mentor

Charbel-Raphaël29 Sep 2022 11:27 UTC
7 points
0 comments1 min readLW link
(docs.google.com)

It mat­ters when the first sharp left turn happens

Adam Jermyn29 Sep 2022 20:12 UTC
35 points
9 comments4 min readLW link

Eli’s re­view of “Is power-seek­ing AI an ex­is­ten­tial risk?”

elifland30 Sep 2022 12:21 UTC
58 points
0 comments3 min readLW link
(docs.google.com)

[Question] Rank the fol­low­ing based on like­li­hood to nul­lify AI-risk

Aorou30 Sep 2022 11:15 UTC
3 points
1 comment4 min readLW link

Distri­bu­tion Shifts and The Im­por­tance of AI Safety

Leon Lang29 Sep 2022 22:38 UTC
17 points
2 comments12 min readLW link

[Question] What Is the Idea Be­hind (Un-)Su­per­vised Learn­ing and Re­in­force­ment Learn­ing?

Morpheus30 Sep 2022 16:48 UTC
9 points
6 comments2 min readLW link

(Struc­tural) Sta­bil­ity of Cou­pled Optimizers

Paul Bricman30 Sep 2022 11:28 UTC
25 points
0 comments10 min readLW link

Where I cur­rently dis­agree with Ryan Green­blatt’s ver­sion of the ELK approach

So8res29 Sep 2022 21:18 UTC
63 points
7 comments5 min readLW link

Paper: Large Lan­guage Models Can Self-im­prove [Linkpost]

Evan R. Murphy2 Oct 2022 1:29 UTC
52 points
14 comments1 min readLW link
(openreview.net)

[Question] Is there a cul­ture over­hang?

Aleksi Liimatainen3 Oct 2022 7:26 UTC
18 points
4 comments1 min readLW link

Vi­su­al­iz­ing Learned Rep­re­sen­ta­tions of Rice Disease

muhia_bee3 Oct 2022 9:09 UTC
7 points
0 comments4 min readLW link
(indecisive-sand-24a.notion.site)

If you want to learn tech­ni­cal AI safety, here’s a list of AI safety courses, read­ing lists, and resources

KatWoods3 Oct 2022 12:43 UTC
12 points
3 comments1 min readLW link

Frontline of AGI Align­ment

SD Marlow4 Oct 2022 3:47 UTC
−10 points
0 comments1 min readLW link
(robothouse.substack.com)

Hu­mans aren’t fit­ness maximizers

So8res4 Oct 2022 1:31 UTC
52 points
45 comments5 min readLW link

Smoke with­out fire is scary

Adam Jermyn4 Oct 2022 21:08 UTC
49 points
22 comments4 min readLW link

CHAI, As­sis­tance Games, And Fully-Up­dated Defer­ence [Scott Alexan­der]

lberglund4 Oct 2022 17:04 UTC
21 points
1 comment17 min readLW link
(astralcodexten.substack.com)

Gen­er­a­tive, Epi­sodic Ob­jec­tives for Safe AI

Michael Glass5 Oct 2022 23:18 UTC
11 points
3 comments8 min readLW link

[Linkpost] “Blueprint for an AI Bill of Rights”—Office of Science and Tech­nol­ogy Policy, USA (2022)

Fer32dwt34r3dfsz5 Oct 2022 16:42 UTC
8 points
4 comments2 min readLW link
(www.whitehouse.gov)

The Answer

Alex Beyman5 Oct 2022 21:23 UTC
−3 points
0 comments4 min readLW link

The prob­a­bil­ity that Ar­tifi­cial Gen­eral In­tel­li­gence will be de­vel­oped by 2043 is ex­tremely low.

cveres6 Oct 2022 18:05 UTC
−14 points
8 comments1 min readLW link

The Shape of Things to Come

Alex Beyman7 Oct 2022 16:11 UTC
12 points
3 comments8 min readLW link

The Slow Reveal

Alex Beyman9 Oct 2022 3:16 UTC
3 points
0 comments24 min readLW link

What does it mean for an AGI to be ‘safe’?

So8res7 Oct 2022 4:13 UTC
72 points
32 comments3 min readLW link

Boolean Prim­i­tives for Cou­pled Optimizers

Paul Bricman7 Oct 2022 18:02 UTC
9 points
0 comments8 min readLW link

Anal­y­sis: US re­stricts GPU sales to China

aogara7 Oct 2022 18:38 UTC
94 points
58 comments5 min readLW link

[Question] Bro­ken Links for the Au­dio Ver­sion of 2021 MIRI Conversations

Krieger8 Oct 2022 16:16 UTC
1 point
1 comment1 min readLW link

Don’t leave your finger­prints on the future

So8res8 Oct 2022 0:35 UTC
93 points
32 comments5 min readLW link

Let’s talk about un­con­trol­lable AI

Karl von Wendt9 Oct 2022 10:34 UTC
12 points
6 comments3 min readLW link

Les­sons learned from talk­ing to >100 aca­demics about AI safety

Marius Hobbhahn10 Oct 2022 13:16 UTC
207 points
16 comments12 min readLW link

When re­port­ing AI timelines, be clear who you’re (not) defer­ring to

Sam Clarke10 Oct 2022 14:24 UTC
37 points
3 comments1 min readLW link

Nat­u­ral Cat­e­gories Update

Logan Zoellner10 Oct 2022 15:19 UTC
29 points
6 comments2 min readLW link

Up­dates and Clarifications

SD Marlow11 Oct 2022 5:34 UTC
−5 points
1 comment1 min readLW link

My ar­gu­ment against AGI

cveres12 Oct 2022 6:33 UTC
3 points
5 comments1 min readLW link

In­stru­men­tal con­ver­gence in sin­gle-agent systems

12 Oct 2022 12:24 UTC
27 points
4 comments8 min readLW link
(www.gladstone.ai)

A strange twist on the road to AGI

cveres12 Oct 2022 23:27 UTC
−8 points
0 comments1 min readLW link

Perfect Enemy

Alex Beyman13 Oct 2022 8:23 UTC
−2 points
0 comments46 min readLW link

A stub­born un­be­liever fi­nally gets the depth of the AI al­ign­ment problem

aelwood13 Oct 2022 15:16 UTC
17 points
8 comments3 min readLW link
(pursuingreality.substack.com)

Misal­ign­ment-by-de­fault in multi-agent systems

13 Oct 2022 15:38 UTC
17 points
8 comments20 min readLW link
(www.gladstone.ai)

Nice­ness is unnatural

So8res13 Oct 2022 1:30 UTC
98 points
18 comments8 min readLW link

The Vi­talik Bu­terin Fel­low­ship in AI Ex­is­ten­tial Safety is open for ap­pli­ca­tions!

Cynthia Chen13 Oct 2022 18:32 UTC
21 points
0 comments1 min readLW link

Greed Is the Root of This Evil

Thane Ruthenis13 Oct 2022 20:40 UTC
21 points
4 comments8 min readLW link

Con­tra shard the­ory, in the con­text of the di­a­mond max­i­mizer problem

So8res13 Oct 2022 23:51 UTC
84 points
16 comments2 min readLW link

An­thro­po­mor­phic AI and Sand­boxed Vir­tual Uni­verses

jacob_cannell3 Sep 2010 19:02 UTC
4 points
124 comments5 min readLW link

In­stru­men­tal con­ver­gence: scale and phys­i­cal interactions

14 Oct 2022 15:50 UTC
15 points
0 comments17 min readLW link
(www.gladstone.ai)

Prov­ably Hon­est—A First Step

Srijanak De5 Nov 2022 19:18 UTC
10 points
2 comments8 min readLW link

They gave LLMs ac­cess to physics simulators

ryan_b17 Oct 2022 21:21 UTC
50 points
18 comments1 min readLW link
(arxiv.org)

De­ci­sion the­ory does not im­ply that we get to have nice things

So8res18 Oct 2022 3:04 UTC
142 points
53 comments26 min readLW link

[Question] How easy is it to su­per­vise pro­cesses vs out­comes?

Noosphere8918 Oct 2022 17:48 UTC
3 points
0 comments1 min readLW link

How To Make Pre­dic­tion Mar­kets Use­ful For Align­ment Work

johnswentworth18 Oct 2022 19:01 UTC
86 points
18 comments2 min readLW link

The re­ward func­tion is already how well you ma­nipu­late humans

Kerry19 Oct 2022 1:52 UTC
20 points
9 comments2 min readLW link

Co­op­er­a­tors are more pow­er­ful than agents

Ivan Vendrov21 Oct 2022 20:02 UTC
14 points
7 comments3 min readLW link

Log­i­cal De­ci­sion The­o­ries: Our fi­nal failsafe?

Noosphere8925 Oct 2022 12:51 UTC
−6 points
8 comments1 min readLW link
(www.lesswrong.com)

[Question] Sim­ple ques­tion about cor­rigi­bil­ity and val­ues in AI.

jmh22 Oct 2022 2:59 UTC
6 points
1 comment1 min readLW link

Newslet­ter for Align­ment Re­search: The ML Safety Updates

Esben Kran22 Oct 2022 16:17 UTC
14 points
0 comments1 min readLW link

“Origi­nal­ity is noth­ing but ju­di­cious imi­ta­tion”—Voltaire

Vestozia23 Oct 2022 19:00 UTC
0 points
0 comments13 min readLW link

AI re­searchers an­nounce Neu­roAI agenda

Cameron Berg24 Oct 2022 0:14 UTC
37 points
12 comments6 min readLW link
(arxiv.org)

AGI in our life­times is wish­ful thinking

niknoble24 Oct 2022 11:53 UTC
−4 points
21 comments8 min readLW link

ques­tion-an­swer coun­ter­fac­tual intervals

Tamsin Leake24 Oct 2022 13:08 UTC
8 points
0 comments4 min readLW link
(carado.moe)

Why some peo­ple be­lieve in AGI, but I don’t.

cveres26 Oct 2022 3:09 UTC
−15 points
6 comments1 min readLW link

[Question] Is the Orthog­o­nal­ity Th­e­sis true for hu­mans?

Noosphere8927 Oct 2022 14:41 UTC
12 points
18 comments1 min readLW link

Wor­ld­view iPeo­ple—Fu­ture Fund’s AI Wor­ld­view Prize

Toni MUENDEL28 Oct 2022 1:53 UTC
−22 points
4 comments9 min readLW link

Causal scrub­bing: Appendix

3 Dec 2022 0:58 UTC
16 points
0 comments20 min readLW link

Beyond Kol­mogorov and Shannon

25 Oct 2022 15:13 UTC
60 points
14 comments5 min readLW link

Method of state­ments: an al­ter­na­tive to taboo

Q Home16 Nov 2022 10:57 UTC
7 points
0 comments41 min readLW link

Causal Scrub­bing: a method for rigor­ously test­ing in­ter­pretabil­ity hy­pothe­ses [Red­wood Re­search]

3 Dec 2022 0:58 UTC
130 points
9 comments20 min readLW link

Some Les­sons Learned from Study­ing Indi­rect Ob­ject Iden­ti­fi­ca­tion in GPT-2 small

28 Oct 2022 23:55 UTC
86 points
5 comments9 min readLW link
(arxiv.org)

Causal scrub­bing: re­sults on a paren bal­ance checker

3 Dec 2022 0:59 UTC
26 points
0 comments30 min readLW link

AI as a Civ­i­liza­tional Risk Part 1/​6: His­tor­i­cal Priors

PashaKamyshev29 Oct 2022 21:59 UTC
2 points
2 comments7 min readLW link

AI as a Civ­i­liza­tional Risk Part 2/​6: Be­hav­ioral Modification

PashaKamyshev30 Oct 2022 16:57 UTC
9 points
0 comments10 min readLW link

AI as a Civ­i­liza­tional Risk Part 3/​6: Anti-econ­omy and Sig­nal Pollution

PashaKamyshev31 Oct 2022 17:03 UTC
7 points
4 comments14 min readLW link

AI as a Civ­i­liza­tional Risk Part 4/​6: Bioweapons and Philos­o­phy of Modification

PashaKamyshev1 Nov 2022 20:50 UTC
7 points
1 comment8 min readLW link

AI as a Civ­i­liza­tional Risk Part 5/​6: Re­la­tion­ship be­tween C-risk and X-risk

PashaKamyshev3 Nov 2022 2:19 UTC
2 points
0 comments7 min readLW link

AI as a Civ­i­liza­tional Risk Part 6/​6: What can be done

PashaKamyshev3 Nov 2022 19:48 UTC
2 points
3 comments4 min readLW link

Am I se­cretly ex­cited for AI get­ting weird?

porby29 Oct 2022 22:16 UTC
98 points
4 comments4 min readLW link

“Nor­mal” is the equil­ibrium state of past op­ti­miza­tion processes

Alex_Altair30 Oct 2022 19:03 UTC
77 points
5 comments5 min readLW link

love, not competition

Tamsin Leake30 Oct 2022 19:44 UTC
31 points
20 comments1 min readLW link
(carado.moe)

My (naive) take on Risks from Learned Optimization

Artyom Karpov31 Oct 2022 10:59 UTC
7 points
0 comments5 min readLW link

Embed­ding safety in ML development

zeshen31 Oct 2022 12:27 UTC
24 points
1 comment18 min readLW link

Au­dit­ing games for high-level interpretability

Paul Colognese1 Nov 2022 10:44 UTC
28 points
1 comment7 min readLW link

pub­lish­ing al­ign­ment re­search and infohazards

Tamsin Leake31 Oct 2022 18:02 UTC
69 points
10 comments1 min readLW link
(carado.moe)

Cau­tion when in­ter­pret­ing Deep­mind’s In-con­text RL paper

Sam Marks1 Nov 2022 2:42 UTC
104 points
6 comments4 min readLW link

AGI and the fu­ture: Is a fu­ture with AGI and hu­mans al­ive ev­i­dence that AGI is not a threat to our ex­is­tence?

LetUsTalk1 Nov 2022 7:37 UTC
4 points
8 comments1 min readLW link

Threat Model Liter­a­ture Review

1 Nov 2022 11:03 UTC
55 points
4 comments25 min readLW link

Clar­ify­ing AI X-risk

1 Nov 2022 11:03 UTC
102 points
23 comments4 min readLW link

a ca­sual in­tro to AI doom and alignment

Tamsin Leake1 Nov 2022 16:38 UTC
12 points
0 comments4 min readLW link
(carado.moe)

[Question] Which Is­sues in Con­cep­tual Align­ment have been For­mal­ised or Ob­served (or not)?

ojorgensen1 Nov 2022 22:32 UTC
4 points
0 comments1 min readLW link

Ques­tions about Value Lock-in, Pa­ter­nal­ism, and Empowerment

Sam F. Brown16 Nov 2022 15:33 UTC
12 points
2 comments12 min readLW link
(sambrown.eu)

Why do we post our AI safety plans on the In­ter­net?

Peter S. Park3 Nov 2022 16:02 UTC
3 points
4 comments11 min readLW link

Mechanis­tic In­ter­pretabil­ity as Re­v­erse Eng­ineer­ing (fol­low-up to “cars and elephants”)

David Scott Krueger (formerly: capybaralet)3 Nov 2022 23:19 UTC
28 points
3 comments1 min readLW link

[Question] Are al­ign­ment re­searchers de­vot­ing enough time to im­prov­ing their re­search ca­pac­ity?

Carson Jones4 Nov 2022 0:58 UTC
13 points
3 comments3 min readLW link

[Question] Don’t you think RLHF solves outer al­ign­ment?

Charbel-Raphaël4 Nov 2022 0:36 UTC
2 points
19 comments1 min readLW link

A new­comer’s guide to the tech­ni­cal AI safety field

zeshen4 Nov 2022 14:29 UTC
30 points
1 comment10 min readLW link

Toy Models and Tegum Products

Adam Jermyn4 Nov 2022 18:51 UTC
27 points
7 comments5 min readLW link

For ELK truth is mostly a distraction

c.trout4 Nov 2022 21:14 UTC
32 points
0 comments21 min readLW link

In­ter­pret­ing sys­tems as solv­ing POMDPs: a step to­wards a for­mal un­der­stand­ing of agency [pa­per link]

the gears to ascension5 Nov 2022 1:06 UTC
12 points
2 comments1 min readLW link
(www.semanticscholar.org)

When can a mimic sur­prise you? Why gen­er­a­tive mod­els han­dle seem­ingly ill-posed problems

David Johnston5 Nov 2022 13:19 UTC
8 points
4 comments16 min readLW link

The Slip­pery Slope from DALLE-2 to Deep­fake Anarchy

scasper5 Nov 2022 14:53 UTC
16 points
9 comments11 min readLW link

[Question] Can we get around Godel’s In­com­plete­ness the­o­rems and Tur­ing un­de­cid­able prob­lems via in­finite com­put­ers?

Noosphere895 Nov 2022 18:01 UTC
−10 points
12 comments1 min readLW link

Recom­mend HAIST re­sources for as­sess­ing the value of RLHF-re­lated al­ign­ment research

5 Nov 2022 20:58 UTC
26 points
9 comments3 min readLW link

[Question] Has any­one in­creased their AGI timelines?

Darren McKee6 Nov 2022 0:03 UTC
38 points
13 comments1 min readLW link

Ap­ply­ing su­per­in­tel­li­gence with­out col­lu­sion

Eric Drexler8 Nov 2022 18:08 UTC
88 points
57 comments4 min readLW link

A philoso­pher’s cri­tique of RLHF

TW1237 Nov 2022 2:42 UTC
55 points
8 comments2 min readLW link

4 Key As­sump­tions in AI Safety

Prometheus7 Nov 2022 10:50 UTC
20 points
5 comments7 min readLW link

Hacker-AI – Does it already ex­ist?

Erland Wittkotter7 Nov 2022 14:01 UTC
3 points
11 comments11 min readLW link

Loss of con­trol of AI is not a likely source of AI x-risk

squek7 Nov 2022 18:44 UTC
−6 points
0 comments5 min readLW link

Mys­ter­ies of mode collapse

janus8 Nov 2022 10:37 UTC
213 points
35 comments14 min readLW link

Some ad­vice on in­de­pen­dent research

Marius Hobbhahn8 Nov 2022 14:46 UTC
41 points
4 comments10 min readLW link

A first suc­cess story for Outer Align­ment: In­struc­tGPT

Noosphere898 Nov 2022 22:52 UTC
6 points
1 comment1 min readLW link
(openai.com)

A caveat to the Orthog­o­nal­ity Thesis

Wuschel Schulz9 Nov 2022 15:06 UTC
36 points
10 comments2 min readLW link

Try­ing to Make a Treach­er­ous Mesa-Optimizer

MadHatter9 Nov 2022 18:07 UTC
87 points
13 comments4 min readLW link
(attentionspan.blog)

Is full self-driv­ing an AGI-com­plete prob­lem?

kraemahz10 Nov 2022 2:04 UTC
5 points
5 comments1 min readLW link

The har­ness­ing of complexity

geduardo10 Nov 2022 18:44 UTC
6 points
2 comments3 min readLW link

[Question] I there a demo of “You can’t fetch the coffee if you’re dead”?

Ram Rachum10 Nov 2022 18:41 UTC
8 points
9 comments1 min readLW link

LessWrong Poll on AGI

Niclas Kupper10 Nov 2022 13:13 UTC
12 points
6 comments1 min readLW link

Value For­ma­tion: An Over­ar­ch­ing Model

Thane Ruthenis15 Nov 2022 17:16 UTC
27 points
9 comments34 min readLW link

[simu­la­tion] 4chan user claiming to be the at­tor­ney hired by Google’s sen­tient chat­bot LaMDA shares wild de­tails of encounter

janus10 Nov 2022 21:39 UTC
11 points
1 comment13 min readLW link
(generative.ink)

Why I’m Work­ing On Model Ag­nos­tic Interpretability

Jessica Rumbelow11 Nov 2022 9:24 UTC
28 points
9 comments2 min readLW link

Are fund­ing op­tions for AI Safety threat­ened? W45

11 Nov 2022 13:00 UTC
7 points
0 comments3 min readLW link
(newsletter.apartresearch.com)

How likely are ma­lign pri­ors over ob­jec­tives? [aborted WIP]

David Johnston11 Nov 2022 5:36 UTC
−2 points
0 comments8 min readLW link

Is AI Gain-of-Func­tion re­search a thing?

MadHatter12 Nov 2022 2:33 UTC
8 points
2 comments2 min readLW link

Vanessa Kosoy’s PreDCA, distilled

Martín Soto12 Nov 2022 11:38 UTC
16 points
17 comments5 min readLW link

fully al­igned sin­gle­ton as a solu­tion to everything

Tamsin Leake12 Nov 2022 18:19 UTC
6 points
2 comments2 min readLW link
(carado.moe)

Ways to buy time

12 Nov 2022 19:31 UTC
26 points
21 comments12 min readLW link

Char­ac­ter­iz­ing In­trin­sic Com­po­si­tion­al­ity in Trans­form­ers with Tree Projections

Ulisse Mini13 Nov 2022 9:46 UTC
12 points
2 comments1 min readLW link
(arxiv.org)

I (with the help of a few more peo­ple) am plan­ning to cre­ate an in­tro­duc­tion to AI Safety that a smart teenager can un­der­stand. What am I miss­ing?

Tapatakt14 Nov 2022 16:12 UTC
3 points
5 comments1 min readLW link

Will we run out of ML data? Ev­i­dence from pro­ject­ing dataset size trends

Pablo Villalobos14 Nov 2022 16:42 UTC
74 points
12 comments2 min readLW link
(epochai.org)

The limited up­side of interpretability

Peter S. Park15 Nov 2022 18:46 UTC
13 points
11 comments1 min readLW link

[Question] Is the speed of train­ing large mod­els go­ing to in­crease sig­nifi­cantly in the near fu­ture due to Cere­bras An­dromeda?

Amal 15 Nov 2022 22:50 UTC
11 points
11 comments1 min readLW link

Un­pack­ing “Shard The­ory” as Hunch, Ques­tion, The­ory, and Insight

Jacy Reese Anthis16 Nov 2022 13:54 UTC
29 points
9 comments2 min readLW link

The two con­cep­tions of Ac­tive In­fer­ence: an in­tel­li­gence ar­chi­tec­ture and a the­ory of agency

Roman Leventov16 Nov 2022 9:30 UTC
7 points
0 comments4 min readLW link

Eng­ineer­ing Monose­man­tic­ity in Toy Models

18 Nov 2022 1:43 UTC
72 points
6 comments3 min readLW link
(arxiv.org)

[Question] Is there any policy for a fair treat­ment of AIs whose friendli­ness is in doubt?

nahoj18 Nov 2022 19:01 UTC
15 points
9 comments1 min readLW link

The Ground Truth Prob­lem (Or, Why Eval­u­at­ing In­ter­pretabil­ity Meth­ods Is Hard)

Jessica Rumbelow17 Nov 2022 11:06 UTC
26 points
2 comments2 min readLW link

Mas­sive Scal­ing Should be Frowned Upon

harsimony17 Nov 2022 8:43 UTC
7 points
6 comments5 min readLW link

How AI Fails Us: A non-tech­ni­cal view of the Align­ment Problem

testingthewaters18 Nov 2022 19:02 UTC
7 points
0 comments2 min readLW link
(ethics.harvard.edu)

LLMs may cap­ture key com­po­nents of hu­man agency

catubc17 Nov 2022 20:14 UTC
21 points
0 comments4 min readLW link

AGIs may value in­trin­sic re­wards more than ex­trin­sic ones

catubc17 Nov 2022 21:49 UTC
8 points
6 comments4 min readLW link

The econ­omy as an anal­ogy for ad­vanced AI systems

15 Nov 2022 11:16 UTC
26 points
0 comments5 min readLW link

Cog­ni­tive sci­ence and failed AI fore­casts

Eleni Angelou24 Nov 2022 21:02 UTC
0 points
0 comments2 min readLW link

A Short Dialogue on the Mean­ing of Re­ward Functions

19 Nov 2022 21:04 UTC
40 points
0 comments3 min readLW link

[Question] Up­dates on scal­ing laws for foun­da­tion mod­els from ′ Tran­scend­ing Scal­ing Laws with 0.1% Ex­tra Com­pute’

Nick_Greig18 Nov 2022 12:46 UTC
15 points
2 comments1 min readLW link

Distil­la­tion of “How Likely Is De­cep­tive Align­ment?”

NickGabs18 Nov 2022 16:31 UTC
20 points
3 comments10 min readLW link

The Disas­trously Con­fi­dent And Inac­cu­rate AI

Sharat Jacob Jacob18 Nov 2022 19:06 UTC
13 points
0 comments13 min readLW link

gen­er­al­ized wireheading

Tamsin Leake18 Nov 2022 20:18 UTC
21 points
7 comments2 min readLW link
(carado.moe)

By De­fault, GPTs Think In Plain Sight

Fabien Roger19 Nov 2022 19:15 UTC
60 points
16 comments9 min readLW link

ARC pa­per: For­mal­iz­ing the pre­sump­tion of independence

Erik Jenner20 Nov 2022 1:22 UTC
88 points
2 comments2 min readLW link
(arxiv.org)

Planes are still decades away from dis­plac­ing most bird jobs

guzey25 Nov 2022 16:49 UTC
156 points
13 comments3 min readLW link

Scott Aaron­son on “Re­form AI Align­ment”

Shmi20 Nov 2022 22:20 UTC
39 points
17 comments1 min readLW link
(scottaaronson.blog)

How Should AIS Re­late To Its Fun­ders? W46

21 Nov 2022 15:58 UTC
6 points
1 comment3 min readLW link
(newsletter.apartresearch.com)

Benefits/​Risks of Scott Aaron­son’s Ortho­dox/​Re­form Fram­ing for AI Alignment

Jeremyy21 Nov 2022 17:54 UTC
2 points
1 comment1 min readLW link

[Heb­bian Nat­u­ral Ab­strac­tions] Introduction

21 Nov 2022 20:34 UTC
34 points
3 comments4 min readLW link
(www.snellessen.com)

Mis­cel­la­neous First-Pass Align­ment Thoughts

NickGabs21 Nov 2022 21:23 UTC
12 points
4 comments10 min readLW link

Meta AI an­nounces Cicero: Hu­man-Level Di­plo­macy play (with di­alogue)

Jacy Reese Anthis22 Nov 2022 16:50 UTC
95 points
64 comments1 min readLW link
(www.science.org)

An­nounc­ing AI Align­ment Awards: $100k re­search con­tests about goal mis­gen­er­al­iza­tion & corrigibility

22 Nov 2022 22:19 UTC
69 points
20 comments4 min readLW link

Brute-forc­ing the uni­verse: a non-stan­dard shot at di­a­mond alignment

Martín Soto22 Nov 2022 22:36 UTC
6 points
0 comments20 min readLW link

Si­mu­la­tors, con­straints, and goal ag­nos­ti­cism: por­bynotes vol. 1

porby23 Nov 2022 4:22 UTC
36 points
2 comments35 min readLW link

Sets of ob­jec­tives for a multi-ob­jec­tive RL agent to optimize

23 Nov 2022 6:49 UTC
11 points
0 comments8 min readLW link

Hu­man-level Di­plo­macy was my fire alarm

Lao Mein23 Nov 2022 10:05 UTC
51 points
15 comments3 min readLW link

Ex nihilo

Hopkins Stanley23 Nov 2022 14:38 UTC
1 point
0 comments1 min readLW link

Cor­rigi­bil­ity Via Thought-Pro­cess Deference

Thane Ruthenis24 Nov 2022 17:06 UTC
13 points
5 comments9 min readLW link

Con­jec­ture: a ret­ro­spec­tive af­ter 8 months of work

23 Nov 2022 17:10 UTC
183 points
9 comments8 min readLW link

Con­jec­ture Se­cond Hiring Round

23 Nov 2022 17:11 UTC
85 points
0 comments1 min readLW link

In­ject­ing some num­bers into the AGI de­bate—by Boaz Barak

Jsevillamol23 Nov 2022 16:10 UTC
12 points
0 comments3 min readLW link
(windowsontheory.org)

Hu­man-level Full-Press Di­plo­macy (some bare facts).

Cleo Nardo22 Nov 2022 20:59 UTC
50 points
7 comments3 min readLW link

When AI solves a game, fo­cus on the game’s me­chan­ics, not its theme.

Cleo Nardo23 Nov 2022 19:16 UTC
81 points
7 comments2 min readLW link

[Question] What is the best source to ex­plain short AI timelines to a skep­ti­cal per­son?

trevor23 Nov 2022 5:19 UTC
4 points
4 comments1 min readLW link

Steer­ing Be­havi­our: Test­ing for (Non-)My­opia in Lan­guage Models

5 Dec 2022 20:28 UTC
37 points
16 comments10 min readLW link

The man and the tool

pedroalvarado25 Nov 2022 19:51 UTC
1 point
0 comments4 min readLW link

Gliders in Lan­guage Models

Alexandre Variengien25 Nov 2022 0:38 UTC
27 points
11 comments10 min readLW link

The AI Safety com­mu­nity has four main work groups, Strat­egy, Gover­nance, Tech­ni­cal and Move­ment Building

peterslattery25 Nov 2022 3:45 UTC
0 points
0 comments6 min readLW link

Us­ing mechanis­tic in­ter­pretabil­ity to find in-dis­tri­bu­tion failure in toy transformers

Charlie George28 Nov 2022 19:39 UTC
6 points
0 comments4 min readLW link

In­tu­itions by ML re­searchers may get pro­gres­sively worse con­cern­ing likely can­di­dates for trans­for­ma­tive AI

Viktor Rehnberg25 Nov 2022 15:49 UTC
7 points
0 comments2 min readLW link

Guardian AI (Misal­igned sys­tems are all around us.)

Jessica Rumbelow25 Nov 2022 15:55 UTC
15 points
6 comments2 min readLW link

Three Align­ment Schemas & Their Problems

Shoshannah Tekofsky26 Nov 2022 4:25 UTC
16 points
1 comment6 min readLW link

Re­ward Is Not Ne­c­es­sary: How To Create A Com­po­si­tional Self-Pre­serv­ing Agent For Life-Long Learning

Capybasilisk27 Nov 2022 14:05 UTC
3 points
0 comments1 min readLW link
(arxiv.org)

Re­view: LOVE in a simbox

PeterMcCluskey27 Nov 2022 17:41 UTC
32 points
4 comments9 min readLW link
(bayesianinvestor.com)

Su­per­in­tel­li­gent AI is nec­es­sary for an amaz­ing fu­ture, but far from sufficient

So8res31 Oct 2022 21:16 UTC
115 points
46 comments34 min readLW link

[Question] How to cor­rect for mul­ti­plic­ity with AI-gen­er­ated mod­els?

Lao Mein28 Nov 2022 3:51 UTC
4 points
0 comments1 min readLW link

Is Con­struc­tor The­ory a use­ful tool for AI al­ign­ment?

A.H.29 Nov 2022 12:35 UTC
11 points
8 comments26 min readLW link

Multi-Com­po­nent Learn­ing and S-Curves

30 Nov 2022 1:37 UTC
57 points
24 comments7 min readLW link

Sub­sets and quo­tients in interpretability

Erik Jenner2 Dec 2022 23:13 UTC
24 points
1 comment7 min readLW link

Ne­glected cause: au­to­mated fraud de­tec­tion in academia through image analysis

Lao Mein30 Nov 2022 5:52 UTC
10 points
1 comment2 min readLW link

AGI Im­pos­si­ble due to En­ergy Constrains

TheKlaus30 Nov 2022 18:48 UTC
−8 points
13 comments1 min readLW link

Master plan spec: needs au­dit (logic and co­op­er­a­tive AI)

Quinn30 Nov 2022 6:10 UTC
12 points
5 comments7 min readLW link

AI takeover table­top RPG: “The Treach­er­ous Turn”

Daniel Kokotajlo30 Nov 2022 7:16 UTC
51 points
3 comments1 min readLW link

Has AI gone too far?

Boston Anderson30 Nov 2022 18:49 UTC
−15 points
3 comments1 min readLW link

Seek­ing sub­mis­sions for short AI-safety course proposals

Sergio1 Dec 2022 0:32 UTC
3 points
0 comments1 min readLW link

Did ChatGPT just gaslight me?

TW1231 Dec 2022 5:41 UTC
123 points
45 comments9 min readLW link
(equonc.substack.com)

Safe Devel­op­ment of Hacker-AI Coun­ter­mea­sures – What if we are too late?

Erland Wittkotter1 Dec 2022 7:59 UTC
3 points
0 comments14 min readLW link

Re­search re­quest (al­ign­ment strat­egy): Deep dive on “mak­ing AI solve al­ign­ment for us”

JanB1 Dec 2022 14:55 UTC
16 points
3 comments1 min readLW link

[LINK] - ChatGPT discussion

JanB1 Dec 2022 15:04 UTC
13 points
7 comments1 min readLW link
(openai.com)

ChatGPT: First Impressions

specbug1 Dec 2022 16:36 UTC
18 points
2 comments13 min readLW link
(sixeleven.in)

Re-Ex­am­in­ing LayerNorm

Eric Winsor1 Dec 2022 22:20 UTC
100 points
8 comments5 min readLW link

Up­date on Har­vard AI Safety Team and MIT AI Alignment

2 Dec 2022 0:56 UTC
56 points
4 comments8 min readLW link

De­con­fus­ing Direct vs Amor­tised Optimization

beren2 Dec 2022 11:30 UTC
48 points
6 comments10 min readLW link

[ASoT] Fine­tun­ing, RL, and GPT’s world prior

Jozdien2 Dec 2022 16:33 UTC
31 points
8 comments5 min readLW link

Take­off speeds, the chimps anal­ogy, and the Cul­tural In­tel­li­gence Hypothesis

NickGabs2 Dec 2022 19:14 UTC
14 points
2 comments4 min readLW link

Non-Tech­ni­cal Prepa­ra­tion for Hacker-AI and Cy­ber­war 2.0+

Erland Wittkotter19 Dec 2022 11:42 UTC
2 points
0 comments25 min readLW link

Ap­ply for the ML Up­skil­ling Win­ter Camp in Cam­bridge, UK [2-10 Jan]

hannah wing-yee2 Dec 2022 20:45 UTC
3 points
0 comments2 min readLW link

Re­search Prin­ci­ples for 6 Months of AI Align­ment Studies

Shoshannah Tekofsky2 Dec 2022 22:55 UTC
22 points
3 comments6 min readLW link

Chat GPT’s views on Me­ta­physics and Ethics

Cole Killian3 Dec 2022 18:12 UTC
5 points
3 comments1 min readLW link
(twitter.com)

[Question] Will the first AGI agent have been de­signed as an agent (in ad­di­tion to an AGI)?

nahoj3 Dec 2022 20:32 UTC
1 point
8 comments1 min readLW link

Could an AI be Reli­gious?

mk544 Dec 2022 5:00 UTC
−12 points
14 comments1 min readLW link

ChatGPT seems over­con­fi­dent to me

qbolec4 Dec 2022 8:03 UTC
19 points
3 comments16 min readLW link

AI can ex­ploit safety plans posted on the Internet

Peter S. Park4 Dec 2022 12:17 UTC
−19 points
4 comments1 min readLW link

Race to the Top: Bench­marks for AI Safety

Isabella Duan4 Dec 2022 18:48 UTC
12 points
2 comments1 min readLW link

Take 3: No in­de­scrib­able heav­en­wor­lds.

Charlie Steiner4 Dec 2022 2:48 UTC
21 points
12 comments2 min readLW link

ChatGPT is set­tling the Chi­nese Room argument

averros4 Dec 2022 20:25 UTC
−7 points
4 comments1 min readLW link

AGI as a Black Swan Event

Stephen McAleese4 Dec 2022 23:00 UTC
8 points
8 comments7 min readLW link

Prob­a­bly good pro­jects for the AI safety ecosystem

Ryan Kidd5 Dec 2022 2:26 UTC
73 points
15 comments2 min readLW link

A ChatGPT story about ChatGPT doom

SurfingOrca5 Dec 2022 5:40 UTC
6 points
3 comments4 min readLW link

Aligned Be­hav­ior is not Ev­i­dence of Align­ment Past a Cer­tain Level of Intelligence

Ronny Fernandez5 Dec 2022 15:19 UTC
19 points
5 comments7 min readLW link

Is the “Valley of Con­fused Ab­strac­tions” real?

jacquesthibs5 Dec 2022 13:36 UTC
15 points
9 comments2 min readLW link

Anal­y­sis of AI Safety sur­veys for field-build­ing insights

Ash Jafari5 Dec 2022 19:21 UTC
10 points
2 comments4 min readLW link

Test­ing Ways to By­pass ChatGPT’s Safety Features

Robert_AIZI5 Dec 2022 18:50 UTC
6 points
2 comments5 min readLW link
(aizi.substack.com)

ChatGPT on Spielberg’s A.I. and AI Alignment

Bill Benzon5 Dec 2022 21:10 UTC
5 points
0 comments4 min readLW link

Shh, don’t tell the AI it’s likely to be evil

naterush6 Dec 2022 3:35 UTC
19 points
9 comments1 min readLW link

Neu­ral net­works bi­ased to­wards ge­o­met­ri­cally sim­ple func­tions?

DavidHolmes8 Dec 2022 16:16 UTC
16 points
2 comments3 min readLW link

Things roll downhill

awenonian6 Dec 2022 15:27 UTC
19 points
0 comments1 min readLW link

ChatGPT and the Hu­man Race

Ben Reilly6 Dec 2022 21:38 UTC
6 points
1 comment3 min readLW link

AI Safety in a Vuln­er­a­ble World: Re­quest­ing Feed­back on Pre­limi­nary Thoughts

Jordan Arel6 Dec 2022 22:35 UTC
3 points
2 comments3 min readLW link

In defense of prob­a­bly wrong mechanis­tic models

evhub6 Dec 2022 23:24 UTC
41 points
10 comments2 min readLW link

ChatGPT: “An er­ror oc­curred. If this is­sue per­sists...”

Bill Benzon7 Dec 2022 15:41 UTC
5 points
11 comments3 min readLW link

Where to be an AI Safety Pro­fes­sor

scasper7 Dec 2022 7:09 UTC
30 points
12 comments2 min readLW link

Thoughts on AGI or­ga­ni­za­tions and ca­pa­bil­ities work

7 Dec 2022 19:46 UTC
94 points
17 comments5 min readLW link

Riffing on the agent type

Quinn8 Dec 2022 0:19 UTC
16 points
0 comments4 min readLW link

Of pump­kins, the Fal­con Heavy, and Grou­cho Marx: High-Level dis­course struc­ture in ChatGPT

Bill Benzon8 Dec 2022 22:25 UTC
2 points
0 comments8 min readLW link

Why I’m Scep­ti­cal of Foom

DragonGod8 Dec 2022 10:01 UTC
19 points
26 comments3 min readLW link

If Went­worth is right about nat­u­ral ab­strac­tions, it would be bad for alignment

Wuschel Schulz8 Dec 2022 15:19 UTC
27 points
5 comments4 min readLW link

Take 7: You should talk about “the hu­man’s util­ity func­tion” less.

Charlie Steiner8 Dec 2022 8:14 UTC
47 points
22 comments2 min readLW link

Notes on OpenAI’s al­ign­ment plan

Alex Flint8 Dec 2022 19:13 UTC
47 points
5 comments7 min readLW link

We need to make scary AIs

Igor Ivanov9 Dec 2022 10:04 UTC
3 points
8 comments5 min readLW link

I Believe we are in a Hard­ware Overhang

nem8 Dec 2022 23:18 UTC
8 points
0 comments1 min readLW link

[Question] What are your thoughts on the fu­ture of AI-as­sisted soft­ware de­vel­op­ment?

RomanHauksson9 Dec 2022 10:04 UTC
4 points
2 comments1 min readLW link

ChatGPT’s Misal­ign­ment Isn’t What You Think

stavros9 Dec 2022 11:11 UTC
3 points
12 comments1 min readLW link

Si­mu­la­tors and Mindcrime

DragonGod9 Dec 2022 15:20 UTC
0 points
4 comments3 min readLW link

Work­ing to­wards AI al­ign­ment is better

Johannes C. Mayer9 Dec 2022 15:39 UTC
7 points
2 comments2 min readLW link

[Question] How would you im­prove ChatGPT’s fil­ter­ing?

Noah Scales10 Dec 2022 8:05 UTC
9 points
6 comments1 min readLW link

In­spira­tion as a Scarce Resource

zenbu zenbu zenbu zenbu10 Dec 2022 15:23 UTC
7 points
0 comments4 min readLW link
(inflorescence.substack.com)

Poll Re­sults on AGI

Niclas Kupper10 Dec 2022 21:25 UTC
10 points
0 comments2 min readLW link

The Op­por­tu­nity and Risks of Learn­ing Hu­man Values In-Context

Past Account10 Dec 2022 21:40 UTC
1 point
4 comments5 min readLW link

High level dis­course struc­ture in ChatGPT: Part 2 [Quasi-sym­bolic?]

Bill Benzon10 Dec 2022 22:26 UTC
7 points
0 comments6 min readLW link

ChatGPT goes through a worm­hole hole in our Shandyesque uni­verse [vir­tual wacky weed]

Bill Benzon11 Dec 2022 11:59 UTC
−1 points
2 comments3 min readLW link

Ques­tions about AI that bother me

Eleni Angelou11 Dec 2022 18:14 UTC
11 points
2 comments2 min readLW link

Reflec­tions on the PIBBSS Fel­low­ship 2022

11 Dec 2022 21:53 UTC
31 points
0 comments18 min readLW link

Bench­marks for Com­par­ing Hu­man and AI Intelligence

MrThink11 Dec 2022 22:06 UTC
8 points
4 comments2 min readLW link

a rough sketch of for­mal al­igned AI us­ing QACI

Tamsin Leake11 Dec 2022 23:40 UTC
14 points
0 comments4 min readLW link
(carado.moe)

Triv­ial GPT-3.5 limi­ta­tion workaround

Dave Lindbergh12 Dec 2022 8:42 UTC
5 points
4 comments1 min readLW link

[Question] Thought ex­per­i­ment. If hu­man minds could be har­nessed into one uni­ver­sal con­scious­ness of hu­man­ity, would we dis­cover things that have been quite difficult to reach with the means of mod­ern sci­ence? And would the con­scious­ness of hu­man­ity be more com­pre­hen­sive than the fu­ture power of ar­tifi­cial in­tel­li­gence?

lotta liedes12 Dec 2022 14:43 UTC
−1 points
0 comments1 min readLW link

Mean­ingful things are those the uni­verse pos­sesses a se­man­tics for

Abhimanyu Pallavi Sudhir12 Dec 2022 16:03 UTC
7 points
14 comments14 min readLW link

Let’s go meta: Gram­mat­i­cal knowl­edge and self-refer­en­tial sen­tences [ChatGPT]

Bill Benzon12 Dec 2022 21:50 UTC
5 points
0 comments9 min readLW link

[Question] Are law­suits against AGI com­pa­nies ex­tend­ing AGI timelines?

SlowingAGI13 Dec 2022 6:00 UTC
1 point
1 comment1 min readLW link

An ex­plo­ra­tion of GPT-2′s em­bed­ding weights

Adam Scherlis13 Dec 2022 0:46 UTC
26 points
2 comments10 min readLW link

Re­vis­it­ing al­gorith­mic progress

13 Dec 2022 1:39 UTC
92 points
8 comments2 min readLW link
(arxiv.org)

Align­ment with ar­gu­ment-net­works and as­sess­ment-predictions

Tor Økland Barstad13 Dec 2022 2:17 UTC
7 points
3 comments45 min readLW link

Limits of Superintelligence

Aleksei Petrenko13 Dec 2022 12:19 UTC
1 point
0 comments1 min readLW link

[Question] Best in­tro­duc­tory overviews of AGI safety?

JakubK13 Dec 2022 19:01 UTC
14 points
5 comments2 min readLW link
(forum.effectivealtruism.org)

Seek­ing par­ti­ci­pants for study of AI safety researchers

joelegardner13 Dec 2022 21:58 UTC
2 points
0 comments1 min readLW link

Assess­ing the Ca­pa­bil­ities of ChatGPT through Suc­cess Rates

Past Account13 Dec 2022 21:16 UTC
5 points
0 comments2 min readLW link

Dis­cov­er­ing La­tent Knowl­edge in Lan­guage Models Without Supervision

Xodarap14 Dec 2022 12:32 UTC
45 points
1 comment1 min readLW link
(arxiv.org)

all claw, no world — and other thoughts on the uni­ver­sal distribution

Tamsin Leake14 Dec 2022 18:55 UTC
14 points
0 comments7 min readLW link
(carado.moe)

Con­trary to List of Lethal­ity’s point 22, al­ign­ment’s door num­ber 2

False Name14 Dec 2022 22:01 UTC
0 points
1 comment22 min readLW link

ChatGPT has a HAL Problem

Paul Anderson14 Dec 2022 21:31 UTC
1 point
0 comments1 min readLW link

How “Dis­cov­er­ing La­tent Knowl­edge in Lan­guage Models Without Su­per­vi­sion” Fits Into a Broader Align­ment Scheme

Collin15 Dec 2022 18:22 UTC
124 points
18 comments16 min readLW link

Avoid­ing Psy­cho­pathic AI

Cameron Berg19 Dec 2022 17:01 UTC
28 points
3 comments20 min readLW link

We’ve stepped over the thresh­old into the Fourth Arena, but don’t rec­og­nize it

Bill Benzon15 Dec 2022 20:22 UTC
2 points
0 comments7 min readLW link

AI Safety Move­ment Builders should help the com­mu­nity to op­ti­mise three fac­tors: con­trib­u­tors, con­tri­bu­tions and coordination

peterslattery15 Dec 2022 22:50 UTC
4 points
0 comments6 min readLW link

Proper scor­ing rules don’t guaran­tee pre­dict­ing fixed points

16 Dec 2022 18:22 UTC
55 points
5 comments21 min readLW link

A learned agent is not the same as a learn­ing agent

Ben Amitay16 Dec 2022 17:27 UTC
4 points
4 comments2 min readLW link

Ab­stract con­cepts and met­al­in­gual defi­ni­tion: Does ChatGPT un­der­stand jus­tice and char­ity?

Bill Benzon16 Dec 2022 21:01 UTC
2 points
0 comments13 min readLW link

Us­ing In­for­ma­tion The­ory to tackle AI Align­ment: A Prac­ti­cal Approach

Daniel Salami17 Dec 2022 1:37 UTC
6 points
4 comments8 min readLW link

Look­ing for an al­ign­ment tutor

JanB17 Dec 2022 19:08 UTC
15 points
2 comments1 min readLW link

What we owe the microbiome

weverka17 Dec 2022 19:40 UTC
2 points
0 comments1 min readLW link
(forum.effectivealtruism.org)

Bad at Arith­metic, Promis­ing at Math

cohenmacaulay18 Dec 2022 5:40 UTC
91 points
17 comments20 min readLW link

AGI is here, but no­body wants it. Why should we even care?

MGow20 Dec 2022 19:14 UTC
−20 points
0 comments17 min readLW link

Hacker-AI and Cy­ber­war 2.0+

Erland Wittkotter19 Dec 2022 11:46 UTC
2 points
0 comments15 min readLW link

Does ChatGPT’s perfor­mance war­rant work­ing on a tu­tor for chil­dren? [It’s time to take it to the lab.]

Bill Benzon19 Dec 2022 15:12 UTC
13 points
2 comments4 min readLW link
(new-savanna.blogspot.com)

Re­sults from a sur­vey on tool use and work­flows in al­ign­ment research

19 Dec 2022 15:19 UTC
50 points
2 comments19 min readLW link

Pro­lifer­at­ing Education

Haris Rashid20 Dec 2022 19:22 UTC
−1 points
2 comments5 min readLW link
(www.harisrab.com)

[Question] Will re­search in AI risk jinx it? Con­se­quences of train­ing AI on AI risk arguments

Yann Dubois19 Dec 2022 22:42 UTC
5 points
6 comments1 min readLW link

AGI Timelines in Gover­nance: Differ­ent Strate­gies for Differ­ent Timeframes

19 Dec 2022 21:31 UTC
47 points
15 comments10 min readLW link

(Ex­tremely) Naive Gra­di­ent Hack­ing Doesn’t Work

ojorgensen20 Dec 2022 14:35 UTC
6 points
0 comments6 min readLW link

An Open Agency Ar­chi­tec­ture for Safe Trans­for­ma­tive AI

davidad20 Dec 2022 13:04 UTC
18 points
12 comments4 min readLW link

Prop­er­ties of cur­rent AIs and some pre­dic­tions of the evolu­tion of AI from the per­spec­tive of scale-free the­o­ries of agency and reg­u­la­tive development

Roman Leventov20 Dec 2022 17:13 UTC
7 points
0 comments36 min readLW link

I be­lieve some AI doomers are overconfident

FTPickle20 Dec 2022 17:09 UTC
10 points
14 comments2 min readLW link

Perform­ing an SVD on a time-se­ries ma­trix of gra­di­ent up­dates on an MNIST net­work pro­duces 92.5 sin­gu­lar values

Garrett Baker21 Dec 2022 0:44 UTC
8 points
10 comments5 min readLW link

CIRL Cor­rigi­bil­ity is Fragile

21 Dec 2022 1:40 UTC
21 points
1 comment12 min readLW link

New AI risk in­tro from Vox [link post]

JakubK21 Dec 2022 6:00 UTC
5 points
1 comment2 min readLW link
(www.vox.com)