Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack and TurnTrout

30 Apr 2024 18:51 UTC

143 points

27 comments45 min readLW link

“Open Source AI” is a lie, but it doesn’t have to be

jacobhaimes30 Apr 2024 23:10 UTC

17 points

1 comment6 min readLW link

(jacob-haimes.github.io)

Increasing IQ is trivial

George3d61 Mar 2024 22:43 UTC

37 points

50 comments6 min readLW link

(epistemink.substack.com)

Mechanistic Interpretability Workshop Happening at ICML 2024!

Neel Nanda, LawrenceC and Fazl

3 May 2024 1:18 UTC

47 points

3 comments1 min readLW link

The Rationalists of the 1950s (and before) also called themselves “Rationalists”

Owain_Evans28 Nov 2021 20:17 UTC

187 points

31 comments3 min readLW link 1 review

“AI Safety for Fleshy Humans” an AI Safety explainer by Nicky Case

habryka3 May 2024 18:10 UTC

48 points

10 comments4 min readLW link

(aisafety.dance)

ACX Covid Origins Post convinced readers

ErnestScribbler1 May 2024 13:06 UTC

74 points

7 comments2 min readLW link

My hour of memoryless lucidity

Eric Neyman4 May 2024 1:40 UTC

77 points

2 comments5 min readLW link

(ericneyman.wordpress.com)

[Question] How does the ever-increasing use of AI in the military for the direct purpose of murdering people affect your p(doom)?

Justausername6 Apr 2024 6:31 UTC

15 points

16 comments1 min readLW link

Key takeaways from our EA and alignment research surveys

Cameron Berg, Judd Rosenblatt, florin_pop and AE Studio

3 May 2024 18:10 UTC

66 points

3 comments21 min readLW link

Why I’m doing PauseAI

Joseph Miller30 Apr 2024 16:21 UTC

93 points

11 comments4 min readLW link

KAN: Kolmogorov-Arnold Networks

Gunnar_Zarncke1 May 2024 16:50 UTC

10 points

11 comments1 min readLW link

(arxiv.org)

An Introduction to AI Sandbagging

Teun van der Weij, Felix Hofstätter and Francis Rhys Ward

26 Apr 2024 13:40 UTC

41 points

1 comment8 min readLW link

Transformers Represent Belief State Geometry in their Residual Stream

Adam Shai16 Apr 2024 21:16 UTC

353 points

79 comments12 min readLW link

If you weren’t such an idiot...

kave and Mark Xu

2 Mar 2024 0:01 UTC

119 points

60 comments2 min readLW link

(markxu.com)

[Question] Which skincare products are evidence-based?

Vanessa Kosoy2 May 2024 15:22 UTC

83 points

24 comments1 min readLW link

LLM+Planners hybridisation for friendly AGI

installgentoo3 May 2024 8:40 UTC

6 points

2 comments1 min readLW link

[Question] Were there any ancient rationalists?

OliverHayman3 May 2024 18:26 UTC

11 points

3 comments1 min readLW link

Why is AGI/ASI Inevitable?

DeathlessAmaranth2 May 2024 18:27 UTC

14 points

6 comments1 min readLW link

A list of core AI safety problems and how I hope to solve them

davidad26 Aug 2023 15:12 UTC

161 points

26 comments5 min readLW link