RSS

In­trin­sic Power-Seek­ing: AI Might Seek Power for Power’s Sake

TurnTrout19 Nov 2024 18:36 UTC
37 points
3 comments1 min readLW link
(turntrout.com)

Train­ing AI agents to solve hard prob­lems could lead to Scheming

19 Nov 2024 0:10 UTC
58 points
12 comments28 min readLW link

Why im­perfect ad­ver­sar­ial ro­bust­ness doesn’t doom AI control

18 Nov 2024 16:05 UTC
60 points
23 comments2 min readLW link

Cross-con­text ab­duc­tion: LLMs make in­fer­ences about pro­ce­du­ral train­ing data lev­er­ag­ing declar­a­tive facts in ear­lier train­ing data

Sohaib Imran16 Nov 2024 23:22 UTC
34 points
5 comments14 min readLW link

Which evals re­sources would be good?

Marius Hobbhahn16 Nov 2024 14:24 UTC
46 points
4 comments5 min readLW link

Win/​con­tinue/​lose sce­nar­ios and ex­e­cute/​re­place/​au­dit protocols

Buck15 Nov 2024 15:47 UTC
54 points
2 comments7 min readLW link

Evolu­tion­ary prompt op­ti­miza­tion for SAE fea­ture visualization

14 Nov 2024 13:06 UTC
16 points
0 comments9 min readLW link

AXRP Epi­sode 38.0 - Zhijing Jin on LLMs, Causal­ity, and Multi-Agent Systems

DanielFilan14 Nov 2024 7:00 UTC
14 points
0 comments12 min readLW link

o1 is a bad idea

abramdemski11 Nov 2024 21:20 UTC
152 points
36 comments2 min readLW link

The Evals Gap

Marius Hobbhahn11 Nov 2024 16:42 UTC
55 points
7 comments7 min readLW link
(www.apolloresearch.ai)

The Lo­gis­tics of Distri­bu­tion of Mean­ing: Against Epistemic Bureaucratization

Sahil7 Nov 2024 5:27 UTC
20 points
1 comment12 min readLW link

SAEs are highly dataset de­pen­dent: a case study on the re­fusal direction

7 Nov 2024 5:22 UTC
62 points
4 comments14 min readLW link

An­thropic: Three Sketches of ASL-4 Safety Case Components

Zach Stein-Perlman6 Nov 2024 16:00 UTC
93 points
33 comments1 min readLW link
(alignment.anthropic.com)

SAE Prob­ing: What is it good for? Ab­solutely some­thing!

1 Nov 2024 19:23 UTC
31 points
0 comments11 min readLW link

Live Machin­ery: An In­ter­face De­sign Philos­o­phy for Whole­some AI Futures

Sahil1 Nov 2024 17:24 UTC
36 points
2 comments35 min readLW link

Seek­ing Collaborators

abramdemski1 Nov 2024 17:13 UTC
55 points
14 comments7 min readLW link

Com­plete Feedback

abramdemski1 Nov 2024 16:58 UTC
23 points
7 comments3 min readLW link

GPT-4o Guardrails Gone: Data Poi­son­ing & Jailbreak-Tuning

1 Nov 2024 0:10 UTC
17 points
0 comments6 min readLW link
(far.ai)

Toward Safety Cases For AI Scheming

31 Oct 2024 17:20 UTC
60 points
1 comment2 min readLW link

The Com­pendium, A full ar­gu­ment about ex­tinc­tion risk from AGI

31 Oct 2024 12:01 UTC
188 points
49 comments2 min readLW link
(www.thecompendium.ai)