Jailbreak­ing lan­guage mod­els with user roleplay

loops28 Sep 2024 23:43 UTC
8 points
0 comments3 min readLW link
(iter.ca)

“Slow” take­off is a ter­rible term for “maybe even faster take­off, ac­tu­ally”

Raemon28 Sep 2024 23:38 UTC
214 points
69 comments1 min readLW link

Con­tex­tual Con­sti­tu­tional AI

aksh-n28 Sep 2024 23:24 UTC
12 points
2 comments12 min readLW link

Ex­plore More: A Bag of Tricks to Keep Your Life on the Rails

Shoshannah Tekofsky28 Sep 2024 21:38 UTC
232 points
15 comments11 min readLW link
(shoshanigans.substack.com)

2024 Petrov Day Retrospective

28 Sep 2024 21:30 UTC
93 points
25 comments10 min readLW link

[Question] Any Trump Sup­port­ers Want to Dialogue?

k6428 Sep 2024 19:41 UTC
14 points
80 comments1 min readLW link

Eval­u­at­ing LLaMA 3 for poli­ti­cal syco­phancy

alma.liezenga28 Sep 2024 19:02 UTC
2 points
2 comments6 min readLW link

Two new datasets for eval­u­at­ing poli­ti­cal syco­phancy in LLMs

alma.liezenga28 Sep 2024 18:29 UTC
8 points
0 comments9 min readLW link

COT Scal­ing im­plies slower take­off speeds

Logan Zoellner28 Sep 2024 16:20 UTC
37 points
56 comments1 min readLW link

Thoughts on Evo-Bio Math and Mesa-Op­ti­miza­tion: Maybe We Need To Think Harder About “Rel­a­tive” Fit­ness?

Lorec28 Sep 2024 14:07 UTC
6 points
6 comments1 min readLW link

Steer­ing LLMs’ Be­hav­ior with Con­cept Ac­ti­va­tion Vectors

Ruixuan Huang28 Sep 2024 9:53 UTC
8 points
0 comments10 min readLW link

An In­ter­ac­tive Shap­ley Value Explainer

James Stephen Brown28 Sep 2024 5:01 UTC
42 points
9 comments1 min readLW link
(nonzerosum.games)

[Question] Im­pli­ca­tions of China’s re­ces­sion on AGI de­vel­op­ment?

Eric Neyman28 Sep 2024 1:12 UTC
40 points
3 comments1 min readLW link

The Com­pute Co­nun­drum: AI Gover­nance in a Shift­ing Geopoli­ti­cal Era

octavo28 Sep 2024 1:05 UTC
−3 points
1 comment17 min readLW link

‘Chat with im­pact­ful re­search & eval­u­a­tions’ (Un­jour­nal Note­bookLMs)

david reinstein28 Sep 2024 0:32 UTC
6 points
0 comments2 min readLW link

Eye con­tact is effortless when you’re no longer emo­tion­ally blocked on it

Chipmonk27 Sep 2024 21:47 UTC
37 points
24 comments4 min readLW link

Where is the Learn Every­thing Sys­tem?

Shoshannah Tekofsky27 Sep 2024 21:30 UTC
15 points
8 comments4 min readLW link
(thinkfeelplay.substack.com)

An “Ob­ser­va­tory” For a Shy Su­per AI?

Sherrinford27 Sep 2024 21:22 UTC
5 points
0 comments1 min readLW link
(robreid.substack.com)

[Question] Search­ing for Im­pos­si­bil­ity Re­sults or No-Go The­o­rems for prov­able safety.

Maelstrom27 Sep 2024 20:12 UTC
2 points
1 comment1 min readLW link

What is Ran­dom­ness?

martinkunev27 Sep 2024 17:49 UTC
11 points
2 comments10 min readLW link

The Geom­e­try of Feel­ings and Non­sense in Large Lan­guage Models

27 Sep 2024 17:49 UTC
58 points
10 comments4 min readLW link

Avoid­ing jailbreaks by dis­cour­ag­ing their rep­re­sen­ta­tion in ac­ti­va­tion space

Guido Bergman27 Sep 2024 17:49 UTC
6 points
2 comments9 min readLW link

[Question] Why is o1 so de­cep­tive?

abramdemski27 Sep 2024 17:27 UTC
177 points
24 comments3 min readLW link

The Offense-Defense Balance of Gene Drives

Maxwell Tabarrok27 Sep 2024 16:47 UTC
23 points
1 comment4 min readLW link
(www.maximum-progress.com)

Book Re­view: On the Edge: The Future

Zvi27 Sep 2024 14:00 UTC
62 points
1 comment49 min readLW link
(thezvi.wordpress.com)

[Question] Is cy­ber­crime re­ally cost­ing trillions per year?

Fabien Roger27 Sep 2024 8:44 UTC
63 points
28 comments1 min readLW link

Aus­tralian AI Safety Fo­rum 2024

27 Sep 2024 0:40 UTC
42 points
0 comments2 min readLW link

Gell-Mann checks

Cleo Scrolls26 Sep 2024 22:45 UTC
20 points
7 comments3 min readLW link

[Question] Do­ing Noth­ing Utility Function

k6426 Sep 2024 22:05 UTC
9 points
9 comments1 min readLW link

Stanis­lav Petrov Quar­terly Perfor­mance Review

Ricki Heicklen26 Sep 2024 21:20 UTC
145 points
3 comments5 min readLW link
(bayesshammai.substack.com)

Self lo­ca­tion for LLMs by LLMs: Self-Assess­ment Check­list.

weightt an26 Sep 2024 19:57 UTC
11 points
0 comments5 min readLW link

Four Levels of Vot­ing Methods

hive26 Sep 2024 18:15 UTC
17 points
3 comments9 min readLW link
(hiveism.substack.com)

Char­ac­ter­iz­ing sta­ble re­gions in the resi­d­ual stream of LLMs

26 Sep 2024 13:44 UTC
38 points
4 comments1 min readLW link
(arxiv.org)

Chevy Bolt Review

jefftk26 Sep 2024 13:40 UTC
13 points
2 comments1 min readLW link
(www.jefftk.com)

AI #83: The Mask Comes Off

Zvi26 Sep 2024 12:00 UTC
82 points
20 comments36 min readLW link
(thezvi.wordpress.com)

The Ex­is­ten­tial Dread of Be­ing a Pow­er­ful AI System

testingthewaters26 Sep 2024 10:56 UTC
6 points
1 comment2 min readLW link

[Question] What pre­vents SB-1047 from trig­ger­ing on deep fake porn/​voice clon­ing fraud?

ChristianKl26 Sep 2024 9:17 UTC
27 points
21 comments1 min readLW link

[Com­pleted] The 2024 Petrov Day Scenario

26 Sep 2024 8:08 UTC
136 points
114 comments5 min readLW link

Source Con­trol for Pro­to­typ­ing and Analysis

jefftk26 Sep 2024 1:50 UTC
12 points
0 comments1 min readLW link
(www.jefftk.com)

[Linkpost] Play with SAEs on Llama 3

25 Sep 2024 22:35 UTC
40 points
2 comments1 min readLW link

Mira Mu­rati leaves OpenAI/​ OpenAI to re­move non-profit control

Sodium25 Sep 2024 21:15 UTC
58 points
4 comments2 min readLW link

Com­par­ing Fore­cast­ing Track Records for AI Bench­mark­ing and Beyond

ChristianWilliams25 Sep 2024 21:01 UTC
11 points
0 comments1 min readLW link
(www.metaculus.com)

Ex­tend­ing the Off-Switch Game: Toward a Ro­bust Frame­work for AI Corrigibility

OwenChen25 Sep 2024 20:38 UTC
3 points
0 comments4 min readLW link

Eval­u­at­ing Syn­thetic Ac­ti­va­tions com­posed of SAE La­tents in GPT-2

25 Sep 2024 20:37 UTC
27 points
0 comments3 min readLW link
(arxiv.org)

Cli­mate Change And Global Warming

Zero Contradictions25 Sep 2024 19:13 UTC
−7 points
0 comments1 min readLW link
(zerocontradictions.net)

How to pre­vent col­lu­sion when us­ing un­trusted mod­els to mon­i­tor each other

Buck25 Sep 2024 18:58 UTC
81 points
8 comments22 min readLW link

Align­ment by de­fault: the simu­la­tion hypothesis

gb25 Sep 2024 16:26 UTC
21 points
39 comments1 min readLW link

A Dialogue on De­cep­tive Align­ment Risks

Rauno Arike25 Sep 2024 16:10 UTC
11 points
0 comments18 min readLW link

[Paper] Hid­den in Plain Text: Emer­gence and Miti­ga­tion of Stegano­graphic Col­lu­sion in LLMs

25 Sep 2024 14:52 UTC
30 points
2 comments4 min readLW link
(arxiv.org)

AIS Hun­gary Oper­a­tions Officer role, Dead­line: 2024 Oc­to­ber 6th

gergogaspar25 Sep 2024 13:54 UTC
1 point
0 comments1 min readLW link