Killswitch

Junio18 Nov 2023 22:53 UTC
2 points
0 comments3 min readLW link

Superalignment

Douglas_Reay18 Nov 2023 22:37 UTC
−4 points
4 comments1 min readLW link
(openai.com)

Pre­dictable Defect-Co­op­er­ate?

quetzal_rainbow18 Nov 2023 15:38 UTC
7 points
1 comment2 min readLW link

I think I’m just con­fused. Once a model ex­ists, how do you “red-team” it to see whether it’s safe. Isn’t it already dan­ger­ous?

FTPickle18 Nov 2023 14:16 UTC
21 points
13 comments1 min readLW link

AI Safety Camp 2024

Linda Linsefors18 Nov 2023 10:37 UTC
15 points
1 comment4 min readLW link
(aisafety.camp)

Post-EAG Mu­sic Party

jefftk18 Nov 2023 3:00 UTC
14 points
2 comments2 min readLW link
(www.jefftk.com)

Let­ter to a Sonoma County Jail Cell

MadHatter18 Nov 2023 2:24 UTC
11 points
1 comment1 min readLW link
(open.substack.com)

1. A Sense of Fair­ness: De­con­fus­ing Ethics

RogerDearnaley17 Nov 2023 20:55 UTC
16 points
8 comments15 min readLW link

Sam Alt­man fired from OpenAI

LawrenceC17 Nov 2023 20:42 UTC
192 points
75 comments1 min readLW link
(openai.com)

On the lethal­ity of bi­ased hu­man re­ward ratings

17 Nov 2023 18:59 UTC
48 points
10 comments37 min readLW link

Coup probes: Catch­ing catas­tro­phes with probes trained off-policy

Fabien Roger17 Nov 2023 17:58 UTC
85 points
9 comments11 min readLW link1 review

On Lies and Liars

Gabriel Alfour17 Nov 2023 17:13 UTC
33 points
4 comments14 min readLW link
(cognition.cafe)

Clas­sify­ing rep­re­sen­ta­tions of sparse au­toen­coders (SAEs)

Annah17 Nov 2023 13:54 UTC
15 points
6 comments2 min readLW link

R&D is a Huge Ex­ter­nal­ity, So Why Do Mar­kets Do So Much of it?

Maxwell Tabarrok17 Nov 2023 13:14 UTC
15 points
14 comments3 min readLW link
(maximumprogress.substack.com)

On ex­clud­ing dan­ger­ous in­for­ma­tion from training

ShayBenMoshe17 Nov 2023 11:14 UTC
23 points
5 comments3 min readLW link

The dan­gers of re­pro­duc­ing while old

garymm17 Nov 2023 5:55 UTC
23 points
6 comments1 min readLW link
(www.garymm.org)

I put odds on ends with Nathan Young

KatjaGrace17 Nov 2023 5:40 UTC
8 points
0 comments1 min readLW link
(worldspiritsockpuppet.com)

De­bate helps su­per­vise hu­man ex­perts [Paper]

habryka17 Nov 2023 5:25 UTC
29 points
6 comments1 min readLW link
(github.com)

A to Z of things

KatjaGrace17 Nov 2023 5:20 UTC
71 points
8 comments1 min readLW link1 review
(worldspiritsockpuppet.com)

On Tap­ping Out

Screwtape17 Nov 2023 3:23 UTC
50 points
14 comments8 min readLW link1 review

Elic­it­ing La­tent Knowl­edge in Com­pre­hen­sive AI Ser­vices Models

acabodi17 Nov 2023 2:36 UTC
6 points
0 comments5 min readLW link

Some Rules for an Alge­bra of Bayes Nets

16 Nov 2023 23:53 UTC
77 points
38 comments14 min readLW link1 review

How much to up­date on re­cent AI gov­er­nance moves?

16 Nov 2023 23:46 UTC
112 points
5 comments29 min readLW link

New LessWrong fea­ture: Dialogue Matching

jacobjacob16 Nov 2023 21:27 UTC
106 points
22 comments3 min readLW link

Towards Eval­u­at­ing AI Sys­tems for Mo­ral Sta­tus Us­ing Self-Reports

16 Nov 2023 20:18 UTC
45 points
3 comments1 min readLW link
(arxiv.org)

So­cial Dark Matter

Duncan Sabien (Deactivated)16 Nov 2023 20:00 UTC
326 points
116 comments34 min readLW link1 review

AI #38: Let’s Make a Deal

Zvi16 Nov 2023 19:50 UTC
44 points
2 comments55 min readLW link
(thezvi.wordpress.com)

Fore­cast­ing AI (Overview)

jsteinhardt16 Nov 2023 19:00 UTC
35 points
0 comments2 min readLW link
(bounded-regret.ghost.io)

We Should Talk About This More. Epistemic World Col­lapse as Im­mi­nent Safety Risk of Gen­er­a­tive AI.

Joerg Weiss16 Nov 2023 18:46 UTC
11 points
2 comments29 min readLW link

In­tel­li­gence in sys­tems (hu­man, AI) can be con­cep­tu­al­ized as the re­s­olu­tion and through­put at which a sys­tem can pro­cess and af­fect Shan­non in­for­ma­tion.

AiresJL16 Nov 2023 17:46 UTC
0 points
0 comments2 min readLW link

Life on the Grid (Part 2)

rogersbacon16 Nov 2023 17:22 UTC
7 points
0 comments15 min readLW link
(www.secretorum.life)

The im­pos­si­bil­ity of ra­tio­nally an­a­lyz­ing par­ti­san news

RationalDino16 Nov 2023 16:19 UTC
4 points
4 comments1 min readLW link

We are Peace­craft.ai!

MadHatter16 Nov 2023 14:15 UTC
15 points
20 comments2 min readLW link

A di­alec­ti­cal view of the his­tory of AI, Part 1: We’re only in the an­tithe­sis phase. [A syn­the­sis is in the fu­ture.]

Bill Benzon16 Nov 2023 12:34 UTC
6 points
0 comments12 min readLW link

[Question] How much fraud is there in academia?

ChristianKl16 Nov 2023 11:50 UTC
23 points
10 comments1 min readLW link

Learn­ing co­effi­cient es­ti­ma­tion: the details

Zach Furman16 Nov 2023 3:19 UTC
36 points
0 comments2 min readLW link
(colab.research.google.com)

[Question] AI Safety orgs- what’s your biggest bot­tle­neck right now?

Kabir Kumar16 Nov 2023 2:02 UTC
1 point
0 comments1 min readLW link

My cri­tique of Eliezer’s deeply ir­ra­tional beliefs

Jorterder16 Nov 2023 0:34 UTC
−33 points
1 comment9 min readLW link
(docs.google.com)

Ex­trap­o­lat­ing from Five Words

Gordon Seidoh Worley15 Nov 2023 23:21 UTC
40 points
11 comments2 min readLW link

In Defense of Parselmouths

Screwtape15 Nov 2023 23:02 UTC
48 points
10 comments10 min readLW link

Life on the Grid (Part 1)

rogersbacon15 Nov 2023 22:37 UTC
12 points
4 comments9 min readLW link
(www.secretorum.life)

Glo­ma­riza­tion FAQ

Zane15 Nov 2023 20:20 UTC
30 points
5 comments5 min readLW link

Testbed evals: eval­u­at­ing AI safety even when it can’t be di­rectly mea­sured

joshc15 Nov 2023 19:00 UTC
71 points
2 comments4 min readLW link

EA/​ACX/​LW Novem­ber Santa Cruz Meetup

madmail15 Nov 2023 18:39 UTC
1 point
0 comments1 min readLW link

New re­port: “Schem­ing AIs: Will AIs fake al­ign­ment dur­ing train­ing in or­der to get power?”

Joe Carlsmith15 Nov 2023 17:16 UTC
80 points
26 comments30 min readLW link

Large Lan­guage Models can Strate­gi­cally De­ceive their Users when Put Un­der Pres­sure.

ReaderM15 Nov 2023 16:36 UTC
89 points
9 comments2 min readLW link1 review
(arxiv.org)

AISN #26: Na­tional In­sti­tu­tions for AI Safety, Re­sults From the UK Sum­mit, and New Re­leases From OpenAI and xAI

15 Nov 2023 16:07 UTC
13 points
0 comments6 min readLW link
(newsletter.safe.ai)

‘The­o­ries of Values’ and ‘The­o­ries of Agents’: con­fu­sions, mus­ings and desiderata

15 Nov 2023 16:00 UTC
35 points
8 comments24 min readLW link

Ex­pe­riences and learn­ings from both sides of the AI safety job market

Marius Hobbhahn15 Nov 2023 15:40 UTC
110 points
4 comments18 min readLW link

Good busi­nesses cre­ate epistemic monopolies

Logan Kieller15 Nov 2023 14:04 UTC
−2 points
2 comments4 min readLW link
(logankieller.substack.com)