beren

Karma: 2,993

Interested in many things. I have a personal blog at https://www.beren.io/

Addendum: basic facts about language models during training

berenMar 6, 2023, 7:24 PM

22 points

2 comments5 min readLW link

Basic facts about language models during training

berenFeb 21, 2023, 11:46 AM

98 points

15 comments18 min readLW link

Validator models: A simple approach to detecting goodharting

berenFeb 20, 2023, 9:32 PM

14 points

1 comment4 min readLW link

Empathy as a natural consequence of learnt reward models

berenFeb 4, 2023, 3:35 PM

48 points

27 comments13 min readLW link

AGI will have learnt utility functions

berenJan 25, 2023, 7:42 PM

36 points

4 comments13 min readLW link

Gradient hacking is extremely difficult

berenJan 24, 2023, 3:45 PM

164 points

22 comments5 min readLW link

Scaling laws vs individual differences

berenJan 10, 2023, 1:22 PM

45 points

21 comments7 min readLW link

Basic Facts about Language Model Internals

beren and Eric Winsor

Jan 4, 2023, 1:01 PM

130 points

19 comments9 min readLW link

An ML interpretation of Shard Theory

berenJan 3, 2023, 8:30 PM

39 points

5 comments4 min readLW link

The ultimate limits of alignment will determine the shape of the long term future

berenJan 2, 2023, 12:47 PM

34 points

2 comments6 min readLW link

Evidence on recursive self-improvement from current ML

berenDec 30, 2022, 8:53 PM

31 points

12 comments6 min readLW link

Human sexuality as an interesting case study of alignment

berenDec 30, 2022, 1:37 PM

39 points

26 comments3 min readLW link

[Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey, Dan Braun and beren

Dec 13, 2022, 3:41 PM

150 points

23 comments22 min readLW link 2 reviews

Deconfusing Direct vs Amortised Optimization

berenDec 2, 2022, 11:30 AM

134 points

19 comments10 min readLW link

The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren and Sid Black

Nov 28, 2022, 12:54 PM

199 points

33 comments31 min readLW link

Current themes in mechanistic interpretability research

Lee Sharkey, Sid Black and beren

Nov 16, 2022, 2:14 PM

89 points

2 comments12 min readLW link

Interpreting Neural Networks through the Polytope Lens

Sid Black, Lee Sharkey, Connor Leahy, beren, CRG, merizian, Eric Winsor and Dan Braun

Sep 23, 2022, 5:58 PM

144 points

29 comments33 min readLW link