All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 20232024

All Jan Feb Mar Apr May Jun Jul AugSepOct Nov Dec

All 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 282930

Jailbreaking language models with user roleplay

loops28 Sep 2024 23:43 UTC

8 points

0 comments3 min readLW link

(iter.ca)

“Slow” takeoff is a terrible term for “maybe even faster takeoff, actually”

Raemon28 Sep 2024 23:38 UTC

214 points

69 comments1 min readLW link

Contextual Constitutional AI

aksh-n28 Sep 2024 23:24 UTC

12 points

2 comments12 min readLW link

Explore More: A Bag of Tricks to Keep Your Life on the Rails

Shoshannah Tekofsky28 Sep 2024 21:38 UTC

232 points

15 comments11 min readLW link

(shoshanigans.substack.com)

2024 Petrov Day Retrospective

Ben Pace and Raemon

28 Sep 2024 21:30 UTC

93 points

25 comments10 min readLW link

[Question] Any Trump Supporters Want to Dialogue?

k6428 Sep 2024 19:41 UTC

14 points

80 comments1 min readLW link

Evaluating LLaMA 3 for political sycophancy

alma.liezenga28 Sep 2024 19:02 UTC

2 points

2 comments6 min readLW link

Two new datasets for evaluating political sycophancy in LLMs

alma.liezenga28 Sep 2024 18:29 UTC

8 points

0 comments9 min readLW link

COT Scaling implies slower takeoff speeds

Logan Zoellner28 Sep 2024 16:20 UTC

37 points

56 comments1 min readLW link

Thoughts on Evo-Bio Math and Mesa-Optimization: Maybe We Need To Think Harder About “Relative” Fitness?

Lorec28 Sep 2024 14:07 UTC

6 points

6 comments1 min readLW link

Steering LLMs’ Behavior with Concept Activation Vectors

Ruixuan Huang28 Sep 2024 9:53 UTC

8 points

0 comments10 min readLW link

An Interactive Shapley Value Explainer

James Stephen Brown28 Sep 2024 5:01 UTC

42 points

9 comments1 min readLW link

(nonzerosum.games)

[Question] Implications of China’s recession on AGI development?

Eric Neyman28 Sep 2024 1:12 UTC

40 points

3 comments1 min readLW link

The Compute Conundrum: AI Governance in a Shifting Geopolitical Era

octavo28 Sep 2024 1:05 UTC

−3 points

1 comment17 min readLW link

‘Chat with impactful research & evaluations’ (Unjournal NotebookLMs)

david reinstein28 Sep 2024 0:32 UTC

6 points

0 comments2 min readLW link

Eye contact is effortless when you’re no longer emotionally blocked on it

Chipmonk27 Sep 2024 21:47 UTC

37 points

24 comments4 min readLW link

Where is the Learn Everything System?

Shoshannah Tekofsky27 Sep 2024 21:30 UTC

15 points

8 comments4 min readLW link

(thinkfeelplay.substack.com)

An “Observatory” For a Shy Super AI?

Sherrinford27 Sep 2024 21:22 UTC

5 points

0 comments1 min readLW link

(robreid.substack.com)

[Question] Searching for Impossibility Results or No-Go Theorems for provable safety.

Maelstrom27 Sep 2024 20:12 UTC

2 points

1 comment1 min readLW link

What is Randomness?

martinkunev27 Sep 2024 17:49 UTC

11 points

2 comments10 min readLW link

The Geometry of Feelings and Nonsense in Large Language Models

7vik and Nandi

27 Sep 2024 17:49 UTC

58 points

10 comments4 min readLW link

Avoiding jailbreaks by discouraging their representation in activation space

Guido Bergman27 Sep 2024 17:49 UTC

6 points

2 comments9 min readLW link

[Question] Why is o1 so deceptive?

abramdemski27 Sep 2024 17:27 UTC

177 points

24 comments3 min readLW link

The Offense-Defense Balance of Gene Drives

Maxwell Tabarrok27 Sep 2024 16:47 UTC

23 points

1 comment4 min readLW link

(www.maximum-progress.com)

Book Review: On the Edge: The Future

Zvi27 Sep 2024 14:00 UTC

62 points

1 comment49 min readLW link

(thezvi.wordpress.com)

[Question] Is cybercrime really costing trillions per year?

Fabien Roger27 Sep 2024 8:44 UTC

63 points

28 comments1 min readLW link

Australian AI Safety Forum 2024

Liam Carroll and Daniel Murfet

27 Sep 2024 0:40 UTC

42 points

0 comments2 min readLW link

Gell-Mann checks

Cleo Scrolls26 Sep 2024 22:45 UTC

20 points

7 comments3 min readLW link

[Question] Doing Nothing Utility Function

k6426 Sep 2024 22:05 UTC

9 points

9 comments1 min readLW link

Stanislav Petrov Quarterly Performance Review

Ricki Heicklen26 Sep 2024 21:20 UTC

145 points

3 comments5 min readLW link

(bayesshammai.substack.com)

Self location for LLMs by LLMs: Self-Assessment Checklist.

weightt an26 Sep 2024 19:57 UTC

11 points

0 comments5 min readLW link

Four Levels of Voting Methods

hive26 Sep 2024 18:15 UTC

17 points

3 comments9 min readLW link

(hiveism.substack.com)

Characterizing stable regions in the residual stream of LLMs

Jett Janiak, jacek, Chatrik, Giorgi Giglemiani, nlpet and StefanHex

26 Sep 2024 13:44 UTC

38 points

4 comments1 min readLW link

(arxiv.org)

Chevy Bolt Review

jefftk26 Sep 2024 13:40 UTC

13 points

2 comments1 min readLW link

(www.jefftk.com)

AI #83: The Mask Comes Off

Zvi26 Sep 2024 12:00 UTC

82 points

20 comments36 min readLW link

(thezvi.wordpress.com)

The Existential Dread of Being a Powerful AI System

testingthewaters26 Sep 2024 10:56 UTC

6 points

1 comment2 min readLW link

[Question] What prevents SB-1047 from triggering on deep fake porn/voice cloning fraud?

ChristianKl26 Sep 2024 9:17 UTC

27 points

21 comments1 min readLW link

[Completed] The 2024 Petrov Day Scenario

Ben Pace and Raemon

26 Sep 2024 8:08 UTC

136 points

114 comments5 min readLW link

Source Control for Prototyping and Analysis

jefftk26 Sep 2024 1:50 UTC

12 points

0 comments1 min readLW link

(www.jefftk.com)

[Linkpost] Play with SAEs on Llama 3

Tom McGrath, Eric Ho and Dan Balsam

25 Sep 2024 22:35 UTC

40 points

2 comments1 min readLW link

Mira Murati leaves OpenAI/ OpenAI to remove non-profit control

Sodium25 Sep 2024 21:15 UTC

58 points

4 comments2 min readLW link

Comparing Forecasting Track Records for AI Benchmarking and Beyond

ChristianWilliams25 Sep 2024 21:01 UTC

11 points

0 comments1 min readLW link

(www.metaculus.com)

Extending the Off-Switch Game: Toward a Robust Framework for AI Corrigibility

OwenChen25 Sep 2024 20:38 UTC

3 points

0 comments4 min readLW link

Evaluating Synthetic Activations composed of SAE Latents in GPT-2

Giorgi Giglemiani, nlpet, Chatrik, Jett Janiak and StefanHex

25 Sep 2024 20:37 UTC

27 points

0 comments3 min readLW link

(arxiv.org)

Climate Change And Global Warming

Zero Contradictions25 Sep 2024 19:13 UTC

−7 points

0 comments1 min readLW link

(zerocontradictions.net)

How to prevent collusion when using untrusted models to monitor each other

Buck25 Sep 2024 18:58 UTC

81 points

8 comments22 min readLW link

Alignment by default: the simulation hypothesis

gb25 Sep 2024 16:26 UTC

21 points

39 comments1 min readLW link

A Dialogue on Deceptive Alignment Risks

Rauno Arike25 Sep 2024 16:10 UTC

11 points

0 comments18 min readLW link

[Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs

Yohan Mathew, joanv, robert mccarthy, ollie, Nandi and Dylan Cope

25 Sep 2024 14:52 UTC

30 points

2 comments4 min readLW link

(arxiv.org)

AIS Hungary Operations Officer role, Deadline: 2024 October 6th

gergogaspar25 Sep 2024 13:54 UTC

1 point

0 comments1 min readLW link