Against Al­most Every The­ory of Im­pact of Interpretability

Charbel-RaphaëlAug 17, 2023, 6:44 PM
329 points
90 comments26 min readLW link2 reviews

Model Or­ganisms of Misal­ign­ment: The Case for a New Pillar of Align­ment Research

Aug 8, 2023, 1:30 AM
318 points
30 comments18 min readLW link1 review

Dear Self; we need to talk about ambition

ElizabethAug 27, 2023, 11:10 PM
270 points
28 comments8 min readLW link2 reviews
(acesounderglass.com)

My cur­rent LK99 questions

Eliezer YudkowskyAug 1, 2023, 10:48 PM
206 points
38 comments5 min readLW link

Feed­back­loop-first Rationality

RaemonAug 7, 2023, 5:58 PM
203 points
67 comments8 min readLW link2 reviews

Large Lan­guage Models will be Great for Censorship

Ethan EdwardsAug 21, 2023, 7:03 PM
185 points
14 comments8 min readLW link
(ethanedwards.substack.com)

OpenAI API base mod­els are not syco­phan­tic, at any size

nostalgebraistAug 29, 2023, 12:58 AM
183 points
20 comments2 min readLW link
(colab.research.google.com)

A list of core AI safety prob­lems and how I hope to solve them

davidadAug 26, 2023, 3:12 PM
165 points
29 comments5 min readLW link

Pass­word-locked mod­els: a stress case for ca­pa­bil­ities evaluation

Fabien RogerAug 3, 2023, 2:53 PM
156 points
14 comments6 min readLW link

ARC Evals new re­port: Eval­u­at­ing Lan­guage-Model Agents on Real­is­tic Au­tonomous Tasks

Beth BarnesAug 1, 2023, 6:30 PM
153 points
12 comments5 min readLW link
(evals.alignment.org)

As­sume Bad Faith

Zack_M_DavisAug 25, 2023, 5:36 PM
151 points
63 comments7 min readLW link3 reviews

The U.S. is be­com­ing less stable

lcAug 18, 2023, 9:13 PM
147 points
68 comments2 min readLW link

6 non-ob­vi­ous men­tal health is­sues spe­cific to AI safety

Igor IvanovAug 18, 2023, 3:46 PM
147 points
24 comments4 min readLW link

The “pub­lic de­bate” about AI is con­fus­ing for the gen­eral pub­lic and for poli­cy­mak­ers be­cause it is a three-sided de­bate

Adam David LongAug 1, 2023, 12:08 AM
146 points
30 comments4 min readLW link

Re­sponses to ap­par­ent ra­tio­nal­ist con­fu­sions about game /​ de­ci­sion theory

Anthony DiGiovanniAug 30, 2023, 10:02 PM
142 points
20 comments12 min readLW link1 review

In­flec­tion.ai is a ma­jor AGI lab

Nikola JurkovicAug 9, 2023, 1:05 AM
137 points
13 comments2 min readLW link

Ten Thou­sand Years of Solitude

agpAug 15, 2023, 5:45 PM
136 points
19 comments4 min readLW link
(www.discovermagazine.com)

In­vuln­er­a­ble In­com­plete Prefer­ences: A For­mal Statement

SCPAug 30, 2023, 9:59 PM
134 points
39 comments35 min readLW link

Book Launch: “The Carv­ing of Real­ity,” Best of LessWrong vol. III

RaemonAug 16, 2023, 11:52 PM
131 points
22 comments5 min readLW link

When dis­cussing AI risks, talk about ca­pa­bil­ities, not intelligence

VikaAug 11, 2023, 1:38 PM
124 points
7 comments3 min readLW link
(vkrakovna.wordpress.com)

In­tro­duc­ing the Cen­ter for AI Policy (& we’re hiring!)

Thomas LarsenAug 28, 2023, 9:17 PM
123 points
50 comments2 min readLW link
(www.aipolicy.us)

Re­port on Fron­tier Model Training

YafahEdelmanAug 30, 2023, 8:02 PM
122 points
21 comments21 min readLW link
(docs.google.com)

Sum­mary of and Thoughts on the Hotz/​Yud­kowsky Debate

ZviAug 16, 2023, 4:50 PM
105 points
47 comments9 min readLW link
(thezvi.wordpress.com)

Biose­cu­rity Cul­ture, Com­puter Se­cu­rity Culture

jefftkAug 30, 2023, 4:40 PM
103 points
11 comments2 min readLW link
(www.jefftk.com)

A The­ory of Laughter

Steven ByrnesAug 23, 2023, 3:05 PM
102 points
14 comments28 min readLW link

[Question] Ex­er­cise: Solve “Think­ing Physics”

RaemonAug 1, 2023, 12:44 AM
101 points
30 comments5 min readLW link1 review

What’s A “Mar­ket”?

johnswentworthAug 8, 2023, 11:29 PM
94 points
16 comments10 min readLW link

Biolog­i­cal An­chors: The Trick that Might or Might Not Work

Scott AlexanderAug 12, 2023, 12:53 AM
91 points
3 comments33 min readLW link
(astralcodexten.substack.com)

LTFF and EAIF are un­usu­ally fund­ing-con­strained right now

Aug 30, 2023, 1:03 AM
90 points
24 comments15 min readLW link
(forum.effectivealtruism.org)

We Should Pre­pare for a Larger Rep­re­sen­ta­tion of Academia in AI Safety

Leon LangAug 13, 2023, 6:03 PM
90 points
14 comments5 min readLW link

Prob­lems with Robin Han­son’s Quillette Ar­ti­cle On AI

DaemonicSigilAug 6, 2023, 10:13 PM
89 points
33 comments8 min readLW link

Dat­ing Roundup #1: This is Why You’re Single

ZviAug 29, 2023, 12:50 PM
87 points
28 comments38 min readLW link
(thezvi.wordpress.com)

De­com­pos­ing in­de­pen­dent gen­er­al­iza­tions in neu­ral net­works via Hes­sian analysis

Aug 14, 2023, 5:04 PM
84 points
4 comments1 min readLW link

My check­list for pub­lish­ing a blog post

Steven ByrnesAug 15, 2023, 3:04 PM
84 points
6 comments3 min readLW link

Step­ping down as mod­er­a­tor on LW

Kaj_SotalaAug 14, 2023, 10:46 AM
82 points
1 comment1 min readLW link

The Low-Hang­ing Fruit Prior and sloped valleys in the loss landscape

Aug 23, 2023, 9:12 PM
82 points
1 comment13 min readLW link

Long-Term Fu­ture Fund: April 2023 grant recommendations

Aug 2, 2023, 7:54 AM
81 points
3 comments50 min readLW link

The God of Hu­man­ity, and the God of the Robot Utilitarians

RaemonAug 24, 2023, 8:27 AM
79 points
13 comments2 min readLW link1 review

The Eco­nomics of the As­teroid Deflec­tion Prob­lem (Dom­i­nant As­surance Con­tracts)

moyamoAug 29, 2023, 6:28 PM
78 points
71 comments15 min readLW link

Digi­tal brains beat biolog­i­cal ones be­cause diffu­sion is too slow

GeneSmithAug 26, 2023, 2:22 AM
78 points
21 comments5 min readLW link

An In­ter­pretabil­ity Illu­sion for Ac­ti­va­tion Patch­ing of Ar­bi­trary Subspaces

29 Aug 2023 1:04 UTC
77 points
4 comments1 min readLW link

A Proof of Löb’s The­o­rem us­ing Com­putabil­ity Theory

jessicata16 Aug 2023 18:57 UTC
76 points
0 comments17 min readLW link
(unstableontology.com)

Com­pu­ta­tional Thread Art

CallumMcDougall6 Aug 2023 21:42 UTC
76 points
2 comments6 min readLW link

A plea for more fund­ing short­fall transparency

porby7 Aug 2023 21:33 UTC
73 points
4 comments2 min readLW link

AI Fore­cast­ing: Two Years In

jsteinhardt19 Aug 2023 23:40 UTC
72 points
15 comments11 min readLW link
(bounded-regret.ghost.io)

AI pause/​gov­er­nance ad­vo­cacy might be net-nega­tive, es­pe­cially with­out a fo­cus on ex­plain­ing x-risk

Mikhail Samin27 Aug 2023 23:05 UTC
72 points
9 comments6 min readLW link

Au­mann-agree­ment is common

tailcalled26 Aug 2023 20:22 UTC
71 points
33 comments7 min readLW link1 review

When Om­nipo­tence is Not Enough

lsusr25 Aug 2023 19:50 UTC
71 points
4 comments2 min readLW link1 review

Mo­du­lat­ing syco­phancy in an RLHF model via ac­ti­va­tion steering

Nina Panickssery9 Aug 2023 7:06 UTC
69 points
20 comments12 min readLW link

3 lev­els of threat obfuscation

HoldenKarnofsky2 Aug 2023 14:58 UTC
69 points
14 comments7 min readLW link