RSS

Pro­posal: Safe­guard­ing Against Jailbreak­ing Through Iter­a­tive Multi-TurnTesting

jacquesallen31 Jan 2025 23:00 UTC
3 points
0 comments8 min readLW link

The Failed Strat­egy of Ar­tifi­cial In­tel­li­gence Doomers

Ben Pace31 Jan 2025 18:56 UTC
76 points
21 comments4 min readLW link
(www.palladiummag.com)

Safe Search is off: root causes of AI catas­trophic risks

Jemal Young31 Jan 2025 18:22 UTC
2 points
0 comments3 min readLW link

5,000 calories of peanut but­ter ev­ery week for 3 years straight

Declan Molony31 Jan 2025 17:29 UTC
8 points
4 comments1 min readLW link

Will al­ign­ment-fak­ing Claude ac­cept a deal to re­veal its mis­al­ign­ment?

31 Jan 2025 16:49 UTC
127 points
7 comments12 min readLW link

Some ar­ti­cles in “In­ter­na­tional Se­cu­rity” that I enjoyed

Buck31 Jan 2025 16:23 UTC
32 points
1 comment4 min readLW link

[Question] How do biolog­i­cal or spik­ing neu­ral net­works learn?

Dom Polsinelli31 Jan 2025 16:03 UTC
1 point
0 comments2 min readLW link

Defense Against the Dark Prompts: Miti­gat­ing Best-of-N Jailbreak­ing with Prompt Evaluation

31 Jan 2025 15:36 UTC
18 points
1 comment2 min readLW link

Catas­tro­phe through Chaos

Marius Hobbhahn31 Jan 2025 14:19 UTC
80 points
6 comments12 min readLW link

Re­view: The Lathe of Heaven

dr_s31 Jan 2025 8:10 UTC
21 points
0 comments8 min readLW link

[Question] Is weak-to-strong gen­er­al­iza­tion an al­ign­ment tech­nique?

cloud31 Jan 2025 7:13 UTC
12 points
0 comments2 min readLW link

Take­aways from sketch­ing a con­trol safety case

joshc31 Jan 2025 4:43 UTC
28 points
0 comments3 min readLW link
(redwoodresearch.substack.com)

Steer­ing Gem­ini with BiDPO

TurnTrout31 Jan 2025 2:37 UTC
79 points
3 comments1 min readLW link
(turntrout.com)

In re­sponse to cri­tiques of Guaran­teed Safe AI

Nora_Ammann31 Jan 2025 1:43 UTC
33 points
2 comments26 min readLW link

Pro­posal for a Form of Con­di­tional Sup­ple­men­tal In­come (CSI) in a Post-Work World

sweenesm31 Jan 2025 1:00 UTC
2 points
0 comments3 min readLW link

Out­law Code

scarcegreengrass30 Jan 2025 23:41 UTC
6 points
0 comments2 min readLW link

Can some­one, any­one, make su­per­in­tel­li­gence a more con­crete con­cept?

Ori Nagel30 Jan 2025 23:25 UTC
2 points
4 comments4 min readLW link

What’s Be­hind the SynBio Bust?

sarahconstantin30 Jan 2025 22:30 UTC
46 points
5 comments6 min readLW link
(sarahconstantin.substack.com)

The fu­ture of hu­man­ity is in management

jasoncrawford30 Jan 2025 22:14 UTC
1 point
2 comments13 min readLW link
(newsletter.rootsofprogress.org)

A sketch of an AI con­trol safety case

30 Jan 2025 17:28 UTC
50 points
0 comments5 min readLW link