How I’d like al­ign­ment to get done (as of 2024-10-18)

TristanTrim18 Oct 2024 23:39 UTC
11 points
4 comments4 min readLW link

Sab­o­tage Eval­u­a­tions for Fron­tier Models

18 Oct 2024 22:33 UTC
93 points
55 comments6 min readLW link
(assets.anthropic.com)

D&D Sci Coli­seum: Arena of Data

aphyer18 Oct 2024 22:02 UTC
41 points
23 comments4 min readLW link

the Day­di­ca­tion technique

chaosmage18 Oct 2024 21:47 UTC
27 points
0 comments2 min readLW link

[Linkpost] Hawk­ish na­tion­al­ism vs in­ter­na­tional AI power and benefit sharing

18 Oct 2024 18:13 UTC
7 points
5 comments1 min readLW link
(nacicankaya.substack.com)

LLM Psy­cho­met­rics and Prompt-In­duced Psychopathy

Korbinian K.18 Oct 2024 18:11 UTC
12 points
2 comments10 min readLW link

A short pro­ject on Mamba: grokking & interpretability

Alejandro Tlaie18 Oct 2024 16:59 UTC
21 points
0 comments6 min readLW link

LLMs can learn about them­selves by introspection

18 Oct 2024 16:12 UTC
102 points
38 comments9 min readLW link

[Question] Are there more than 12 paths to Su­per­in­tel­li­gence?

p4rziv4l18 Oct 2024 16:05 UTC
−3 points
0 comments1 min readLW link

Low Prob­a­bil­ity Es­ti­ma­tion in Lan­guage Models

Gabriel Wu18 Oct 2024 15:50 UTC
50 points
0 comments10 min readLW link
(www.alignment.org)

The Mys­te­ri­ous Trump Buy­ers on Polymarket

Annapurna18 Oct 2024 13:26 UTC
52 points
10 comments2 min readLW link
(jorgevelez.substack.com)

On In­ten­tion­al­ity, or: Towards a More In­clu­sive Con­cept of Lying

Cornelius Dybdahl18 Oct 2024 10:37 UTC
8 points
0 comments4 min readLW link

Species as Canon­i­cal Refer­ents of Su­per-Organisms

Yudhister Kumar18 Oct 2024 7:49 UTC
9 points
8 comments2 min readLW link
(www.yudhister.me)

NAO Up­dates, Fall 2024

jefftk18 Oct 2024 0:00 UTC
32 points
2 comments1 min readLW link
(naobservatory.org)

You’re Play­ing a Rough Game

jefftk17 Oct 2024 19:20 UTC
25 points
2 comments2 min readLW link
(www.jefftk.com)

P=NP

OnePolynomial17 Oct 2024 17:56 UTC
−25 points
0 comments8 min readLW link

Fac­tor­ing P(doom) into a bayesian network

Joseph Gardi17 Oct 2024 17:55 UTC
1 point
0 comments1 min readLW link

un­der­stand­ing bureaucracy

dhruvmethi17 Oct 2024 17:55 UTC
1 point
2 comments8 min readLW link

AI #86: Just Think of the Potential

Zvi17 Oct 2024 15:10 UTC
58 points
8 comments57 min readLW link
(thezvi.wordpress.com)

Con­crete benefits of mak­ing predictions

17 Oct 2024 14:23 UTC
32 points
5 comments6 min readLW link
(fatebook.io)

Arith­metic is an un­der­rated world-mod­el­ing technology

dynomight17 Oct 2024 14:00 UTC
146 points
32 comments6 min readLW link
(dynomight.net)

The Com­pu­ta­tional Com­plex­ity of Cir­cuit Dis­cov­ery for In­ner Interpretability

Bogdan Ionut Cirstea17 Oct 2024 13:18 UTC
11 points
2 comments1 min readLW link
(arxiv.org)

[Question] is there a big dic­tio­nary some­where with all your jar­gon and acronyms and what­not?

KvmanThinking17 Oct 2024 11:30 UTC
4 points
7 comments1 min readLW link

[Question] Is there a known method to find oth­ers who came across the same po­ten­tial in­fo­haz­ard with­out spoiling it to the pub­lic?

hive17 Oct 2024 10:47 UTC
4 points
6 comments1 min readLW link

It is time to start war gam­ing for AGI

yanni kyriacos17 Oct 2024 5:14 UTC
4 points
1 comment1 min readLW link

[Question] Re­in­force­ment Learn­ing: Essen­tial Step Towards AGI or Ir­rele­vant?

Double17 Oct 2024 3:37 UTC
1 point
0 comments1 min readLW link

[Question] En­deav­orOTC le­git?

FinalFormal217 Oct 2024 1:33 UTC
3 points
0 comments1 min readLW link

The Cog­ni­tive Boot­camp Agreement

Raemon16 Oct 2024 23:24 UTC
34 points
0 comments9 min readLW link

Bit­ter les­sons about lu­cid dreaming

avturchin16 Oct 2024 21:27 UTC
77 points
62 comments2 min readLW link

Towards Quan­ti­ta­tive AI Risk Management

16 Oct 2024 19:26 UTC
28 points
1 comment6 min readLW link

Why Academia is Mostly Not Truth-Seeking

Zero Contradictions16 Oct 2024 19:14 UTC
−6 points
6 comments1 min readLW link
(thewaywardaxolotl.blogspot.com)

Launch­ing Ad­ja­cent News

Lucas Kohorst16 Oct 2024 17:58 UTC
23 points
0 comments4 min readLW link

[Question] In­ter­est in Leet­code, but for Ra­tion­al­ity?

Gregory 16 Oct 2024 17:54 UTC
74 points
20 comments2 min readLW link

Re­quest for ad­vice: Re­search for Con­ver­sa­tional Game The­ory for LLMs

Rome Viharo16 Oct 2024 17:53 UTC
10 points
0 comments1 min readLW link

Why hu­mans won’t con­trol su­per­hu­man AIs.

Spiritus Dei16 Oct 2024 16:48 UTC
−11 points
1 comment6 min readLW link

Against em­pa­thy-by-default

Steven Byrnes16 Oct 2024 16:38 UTC
60 points
24 comments7 min readLW link

can­cer rates af­ter gene therapy

bhauth16 Oct 2024 15:32 UTC
49 points
0 comments3 min readLW link
(bhauth.com)

Monthly Roundup #23: Oc­to­ber 2024

Zvi16 Oct 2024 13:50 UTC
39 points
13 comments50 min readLW link
(thezvi.wordpress.com)

[Question] Change My Mind: Thirders in “Sleep­ing Beauty” are Just Do­ing Episte­mol­ogy Wrong

DragonGod16 Oct 2024 10:20 UTC
8 points
67 comments6 min readLW link

[Question] After up­load­ing your con­scious­ness...

Jinge Wang16 Oct 2024 3:52 UTC
−2 points
0 comments1 min readLW link

The ELYSIUM Pro­posal - Ex­trap­o­lated voLi­tions Yield­ing Separate In­di­vi­d­u­al­ized Utopias for Mankind

Roko16 Oct 2024 1:24 UTC
10 points
18 comments1 min readLW link
(transhumanaxiology.substack.com)

Bel­le­vue Meetup

Cedar16 Oct 2024 1:07 UTC
3 points
0 comments1 min readLW link

Sin­gu­lar Learn­ing The­ory for Dummies

Rahul Chand15 Oct 2024 21:13 UTC
2 points
0 comments8 min readLW link

Distil­la­tion Of Deep­Seek-Prover V1.5

IvanLin15 Oct 2024 18:53 UTC
4 points
1 comment3 min readLW link

Im­prov­ing Model-Writ­ten Evals for AI Safety Benchmarking

15 Oct 2024 18:25 UTC
27 points
0 comments18 min readLW link

Tak­ing non­log­i­cal con­cepts seriously

Kris Brown15 Oct 2024 18:16 UTC
7 points
5 comments18 min readLW link
(topos.site)

Rashomon—A news­bet­ting site

ideasthete15 Oct 2024 18:15 UTC
23 points
8 comments1 min readLW link

On the Prac­ti­cal Ap­pli­ca­tions of Interpretability

Nick Jiang15 Oct 2024 17:18 UTC
3 points
0 comments7 min readLW link

An­thropic’s up­dated Re­spon­si­ble Scal­ing Policy

Zac Hatfield-Dodds15 Oct 2024 16:46 UTC
51 points
3 comments3 min readLW link
(www.anthropic.com)

[Question] When is re­ward ever the op­ti­miza­tion tar­get?

Noosphere8915 Oct 2024 15:09 UTC
35 points
12 comments1 min readLW link