MATS Program

TagLast edit: Dec 30, 2024, 9:26 AM by Dakara

ML Alignment & Theory Scholars (MATS) Program is an educational seminar and independent research program that aims to provide talented scholars with talks, workshops, and research mentorship in the field of AI alignment, and connect them with the Berkeley AI safety research community.

SolidGoldMagikarp (plus, prompt generation)

Jessica Rumbelow and mwatkins

Feb 5, 2023, 10:02 PM

676 points

206 comments12 min readLW link 1 review

SERI MATS Program—Winter 2022 Cohort

Ryan Kidd, Victor Warlop and Christian Smith

Oct 8, 2022, 7:09 PM

72 points

12 comments4 min readLW link

Understanding and controlling a maze-solving policy network

TurnTrout, peligrietzer, Ulisse Mini, Monte M and David Udell

Mar 11, 2023, 6:59 PM

334 points

28 comments23 min readLW link

Project proposal: Testing the IBP definition of agent

Jeremy Gillen, Thomas Larsen and JamesH

Aug 9, 2022, 1:09 AM

21 points

4 comments2 min readLW link

How MATS addresses “mass movement building” concerns

Ryan KiddMay 4, 2023, 12:55 AM

63 points

9 comments3 min readLW link

Soft optimization makes the value target bigger

Jeremy GillenJan 2, 2023, 4:06 PM

119 points

20 comments12 min readLW link

SERI ML Alignment Theory Scholars Program 2022

Ryan Kidd, Victor Warlop and ozhang

Apr 27, 2022, 12:43 AM

67 points

6 comments3 min readLW link

SERI MATS—Summer 2023 Cohort

Aris, Ryan Kidd and Christian Smith

Apr 8, 2023, 3:32 PM

71 points

25 comments4 min readLW link

Finite Factored Sets in Pictures

Magdalena WacheDec 11, 2022, 6:49 PM

183 points

35 comments12 min readLW link

Talk: AI safety fieldbuilding at MATS

Ryan KiddJun 23, 2024, 11:06 PM

26 points

2 comments10 min readLW link

Taking the parameters which seem to matter and rotating them until they don’t

Garrett BakerAug 26, 2022, 6:26 PM

120 points

48 comments1 min readLW link

Predictions for shard theory mechanistic interpretability results

TurnTrout, Ulisse Mini and peligrietzer

Mar 1, 2023, 5:16 AM

105 points

10 comments5 min readLW link

MATS Spring 2024 Extension Retrospective

HenningB, Matthew Wearden, Cameron Holmes and Ryan Kidd

Feb 12, 2025, 10:43 PM

26 points

1 comment15 min readLW link

Modulating sycophancy in an RLHF model via activation steering

Nina PanicksseryAug 9, 2023, 7:06 AM

69 points

20 comments12 min readLW link

Neural Tangent Kernel Distillation

Thomas Larsen and Jeremy Gillen

Oct 5, 2022, 6:11 PM

76 points

20 comments8 min readLW link

My MATS Summer 2023 experience

James ChuaMar 20, 2024, 11:26 AM

29 points

0 comments3 min readLW link

(jameschua.net)

Efficient Dictionary Learning with Switch Sparse Autoencoders

Anish MudideJul 22, 2024, 6:45 PM

118 points

20 comments12 min readLW link

Normative vs Descriptive Models of Agency

mattmacdermottFeb 2, 2023, 8:28 PM

26 points

5 comments4 min readLW link

Infra-Bayesian haggling

hannagaborMay 20, 2024, 12:23 PM

28 points

0 comments20 min readLW link

I found >800 orthogonal “write code” steering vectors

Jacob G-W and TurnTrout

Jul 15, 2024, 7:06 PM

103 points

19 comments7 min readLW link

(jacobgw.com)

Self-explaining SAE features

Dmitrii Kharlapenko, neverix, Neel Nanda and Arthur Conmy

Aug 5, 2024, 10:20 PM

61 points

13 comments10 min readLW link

Steering Llama-2 with contrastive activation additions

Nina Panickssery, Wuschel Schulz, NickGabs, Meg, evhub and TurnTrout

Jan 2, 2024, 12:47 AM

125 points

29 comments8 min readLW link

(arxiv.org)

Balancing Security Mindset with Collaborative Research: A Proposal

MadHatterNov 1, 2023, 12:46 AM

9 points

3 comments4 min readLW link

Qualities that alignment mentors value in junior researchers

Orpheus16Feb 14, 2023, 11:27 PM

88 points

14 comments3 min readLW link

The Geometry of Feelings and Nonsense in Large Language Models

7vik and Nandi

Sep 27, 2024, 5:49 PM

61 points

10 comments4 min readLW link

MATS is hiring!

Ryan Kidd and VVN

Apr 8, 2025, 8:45 PM

8 points

0 comments6 min readLW link

Talent Needs of Technical AI Safety Teams

yams, Carson Jones, McKennaFitzgerald and Ryan Kidd

May 24, 2024, 12:36 AM

118 points

65 comments14 min readLW link

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack and TurnTrout

Apr 30, 2024, 6:51 PM

211 points

43 comments45 min readLW link

Debating with More Persuasive LLMs Leads to More Truthful Answers

Akbir Khan, John Hughes, Dan Valentine, Sam Bowman and Ethan Perez

Feb 7, 2024, 9:28 PM

89 points

14 comments9 min readLW link

(arxiv.org)

Apply for MATS Winter 2023-24!

utilistrutil, Ryan Kidd and LauraVaughan

Oct 21, 2023, 2:27 AM

104 points

6 comments5 min readLW link

Distillation Robustifies Unlearning

Bruce W. Lee, Addie Foote, alexinf, leni, Jacob G-W, Harish Kamath, Bryce Woodworth, cloud and TurnTrout

Jun 13, 2025, 1:45 PM

227 points

14 comments8 min readLW link

(arxiv.org)

MATS AI Safety Strategy Curriculum

Ronny Fernandez and Ryan Kidd

Mar 7, 2024, 7:59 PM

74 points

2 comments16 min readLW link

MATS Summer 2023 Retrospective

utilistrutil, Juan Gil, Ryan Kidd, Christian Smith, McKennaFitzgerald and LauraVaughan

Dec 1, 2023, 11:29 PM

77 points

34 comments26 min readLW link

Showing SAE Latents Are Not Atomic Using Meta-SAEs

Bart Bussmann, Michael Pearce, Patrick Leask, Joseph Bloom, Lee Sharkey and Neel Nanda

Aug 24, 2024, 12:56 AM

68 points

10 comments20 min readLW link

Behavioural statistics for a maze-solving agent

peligrietzer and TurnTrout

Apr 20, 2023, 10:26 PM

46 points

11 comments10 min readLW link

Apply to MATS 7.0!

Ryan Kidd and K Richards

Sep 21, 2024, 12:23 AM

32 points

0 comments5 min readLW link

Clarifying mesa-optimization

Marius Hobbhahn and Pierre Peigné

Mar 21, 2023, 3:53 PM

38 points

6 comments10 min readLW link

Alignment faking CTFs: Apply to my MATS stream

joshcApr 4, 2025, 4:29 PM

60 points

0 comments4 min readLW link

Broad Basins and Data Compression

Jeremy Gillen, Stephen Fowler and Thomas Larsen

Aug 8, 2022, 8:33 PM

33 points

6 comments7 min readLW link

MATS mentor selection

DanielFilan and Ryan Kidd

Jan 10, 2025, 3:12 AM

44 points

12 comments6 min readLW link

Decomposing the QK circuit with Bilinear Sparse Dictionary Learning

keith_wynroe and Lee Sharkey

Jul 2, 2024, 1:17 PM

86 points

7 comments12 min readLW link

Game Theory without Argmax [Part 2]

Cleo NardoNov 11, 2023, 4:02 PM

31 points

14 comments13 min readLW link

[ASoT] Policy Trajectory Visualization

Ulisse MiniFeb 7, 2023, 12:13 AM

9 points

2 comments1 min readLW link

MATS Applications + Research Directions I’m Currently Excited About

Neel NandaFeb 6, 2025, 11:03 AM

73 points

7 comments8 min readLW link

Case Studies in Reverse-Engineering Sparse Autoencoder Features by Using MLP Linearization

Jacob Dunefsky, Philippe Chlenski, Senthooran Rajamanoharan and Neel Nanda

Jan 14, 2024, 2:06 AM

24 points

0 comments42 min readLW link

Auditing games for high-level interpretability

Paul CologneseNov 1, 2022, 10:44 AM

33 points

1 comment7 min readLW link

What Makes an Idea Understandable? On Architecturally and Culturally Natural Ideas.

NickyP, Peter S. Park and Stephen Fowler

Aug 16, 2022, 2:09 AM

21 points

2 comments16 min readLW link

Attention SAEs Scale to GPT-2 Small

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

Feb 3, 2024, 6:50 AM

78 points

4 comments8 min readLW link

Experiments with an alternative method to promote sparsity in sparse autoencoders

Eoin FarrellApr 15, 2024, 6:21 PM

29 points

7 comments12 min readLW link

MATS Alumni Impact Analysis

utilistrutil, Juan Gil, yams, LauraVaughan, K Richards and Ryan Kidd

Sep 30, 2024, 2:35 AM

62 points

7 comments11 min readLW link

A distillation of Evan Hubinger’s training stories (for SERI MATS)

Daphne_WJul 18, 2022, 3:38 AM

15 points

1 comment10 min readLW link

Can We Align a Self-Improving AGI?

Peter S. ParkAug 30, 2022, 12:14 AM

8 points

5 comments11 min readLW link

Calendar feature geometry in GPT-2 layer 8 residual stream SAEs

Patrick Leask, Bart Bussmann and Neel Nanda

Aug 17, 2024, 1:16 AM

53 points

0 comments5 min readLW link

Stitching SAEs of different sizes

Bart Bussmann, Patrick Leask, Joseph Bloom, Curt Tigges and Neel Nanda

Jul 13, 2024, 5:19 PM

39 points

12 comments12 min readLW link

Race Along Rashomon Ridge

Stephen Fowler, Peter S. Park and MichaelEinhorn

Jul 7, 2022, 3:20 AM

50 points

15 comments8 min readLW link

My Advice for Incoming SERI MATS Scholars

Johannes C. MayerJan 3, 2023, 7:25 PM

58 points

6 comments4 min readLW link

MATS AI Safety Strategy Curriculum v2

DanielFilan and Ryan Kidd

Oct 7, 2024, 10:44 PM

43 points

6 comments13 min readLW link

Uncertainty in all its flavours

Cleo NardoJan 9, 2024, 4:21 PM

34 points

6 comments35 min readLW link

The Ground Truth Problem (Or, Why Evaluating Interpretability Methods Is Hard)

Jessica RumbelowNov 17, 2022, 11:06 AM

27 points

2 comments2 min readLW link

Intervening in the Residual Stream

MadHatterFeb 22, 2023, 6:29 AM

30 points

1 comment9 min readLW link

Swap and Scale

Stephen FowlerSep 9, 2022, 10:41 PM

17 points

3 comments1 min readLW link

Information theoretic model analysis may not lend much insight, but we may have been doing them wrong!

Garrett BakerJul 24, 2022, 12:42 AM

7 points

0 comments10 min readLW link

What sorts of systems can be deceptive?

Andrei AlexandruOct 31, 2022, 10:00 PM

16 points

0 comments7 min readLW link

Conditioning Generative Models for Alignment

JozdienJul 18, 2022, 7:11 AM

60 points

8 comments20 min readLW link

Consequentialists: One-Way Pattern Traps

David UdellJan 16, 2023, 8:48 PM

59 points

3 comments14 min readLW link

More findings on maximal data dimension

Marius HobbhahnFeb 2, 2023, 6:33 PM

27 points

1 comment11 min readLW link

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

cloud, Jacob G-W, Evzen, Joseph Miller and TurnTrout

Dec 6, 2024, 10:19 PM

165 points

12 comments11 min readLW link

(arxiv.org)

Content and Takeaways from SERI MATS Training Program with John Wentworth

RohanSDec 24, 2022, 4:17 AM

28 points

3 comments12 min readLW link

Forecasting Frontier Language Model Agent Capabilities

Govind Pimpale, Axel Højmark, Jérémy Scheurer and Marius Hobbhahn

Feb 24, 2025, 4:51 PM

35 points

0 comments5 min readLW link

(www.apolloresearch.ai)

Sparse Autoencoders Work on Attention Layer Outputs

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

Jan 16, 2024, 12:26 AM

85 points

9 comments18 min readLW link

Crafting Polysemantic Transformer Benchmarks with Known Circuits

Evan Anders and Adrià Garriga-alonso

Aug 23, 2024, 10:03 PM

17 points

0 comments25 min readLW link

Apply to MATS 8.0!

Ryan Kidd and K Richards

Mar 20, 2025, 2:17 AM

63 points

5 comments4 min readLW link

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Kola Ayonrinde, Michael Pearce and Lee Sharkey

Aug 23, 2024, 6:52 PM

42 points

8 comments16 min readLW link

Takeaways From Our Recent Work on SAE Probing

Josh Engels, Subhash Kantamneni, Senthooran Rajamanoharan and Neel Nanda

Mar 3, 2025, 7:50 PM

30 points

0 comments5 min readLW link

[Closed] Agent Foundations track in MATS

Vanessa KosoyOct 31, 2023, 8:12 AM

54 points

1 comment1 min readLW link

(www.matsprogram.org)

More findings on Memorization and double descent

Marius HobbhahnFeb 1, 2023, 6:26 PM

53 points

2 comments19 min readLW link

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Teun van der Weij, Felix Hofstätter, Ollie J, Sam F. Brown and Francis Rhys Ward

Jun 13, 2024, 10:04 AM

84 points

10 comments2 min readLW link

(arxiv.org)

[ASoT] Reflectivity in Narrow AI

Ulisse MiniNov 21, 2022, 12:51 AM

6 points

1 comment1 min readLW link

Reward hacking behavior can generalize across tasks

Kei, Isaac Dunn, Henry Sleight, Miles Turpin, evhub, Carson Denison and Ethan Perez

May 28, 2024, 4:33 PM

81 points

5 comments21 min readLW link

Automating LLM Auditing with Developmental Interpretability

htlou and evhub

Sep 4, 2024, 3:50 PM

19 points

0 comments3 min readLW link

[Question] How is ARC planning to use ELK?

jacquesthibsDec 15, 2022, 8:11 PM

24 points

5 comments1 min readLW link

Training goals for large language models

Johannes TreutleinJul 18, 2022, 7:09 AM

28 points

5 comments19 min readLW link

Attention Output SAEs Improve Circuit Analysis

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

Jun 21, 2024, 12:56 PM

33 points

3 comments19 min readLW link

Performance guarantees in classical learning theory and infra-Bayesianism

David MatolcsiFeb 28, 2023, 6:37 PM

9 points

4 comments31 min readLW link

A Short Dialogue on the Meaning of Reward Functions

Leon Lang, Quintin Pope and peligrietzer

Nov 19, 2022, 9:04 PM

45 points

0 comments3 min readLW link

[Research log] The board of Alphabet would stop DeepMind to save the world

Lucie PhilipponJul 16, 2024, 4:59 AM

6 points

0 comments4 min readLW link

Why are counterfactuals elusive?

Martín SotoMar 3, 2023, 8:13 PM

14 points

6 comments2 min readLW link

Understanding and Aligning a Human-like Inductive Bias with Cognitive Science: a Review of Related Literature

Claire ShortJul 29, 2023, 6:10 AM

27 points

0 comments12 min readLW link

Bounded complexity of solving ELK and its implications

Rubi J. HudsonJul 19, 2022, 6:56 AM

11 points

4 comments18 min readLW link

Results from a survey on tool use and workflows in alignment research

jacquesthibs, Jan, janus and Logan Riggs

Dec 19, 2022, 3:19 PM

79 points

2 comments19 min readLW link

Research agenda: Supervising AIs improving AIs

Quintin Pope, Owen D, Roman Engeler and jacquesthibs

Apr 29, 2023, 5:09 PM

76 points

5 comments19 min readLW link

Modelling Deception

Garrett BakerJul 18, 2022, 9:21 PM

15 points

0 comments7 min readLW link

Externalized reasoning oversight: a research direction for language model alignment

tameraAug 3, 2022, 12:03 PM

136 points

23 comments6 min readLW link

The Natural Abstraction Hypothesis: Implications and Evidence

CallumMcDougallDec 14, 2021, 11:14 PM

39 points

9 comments19 min readLW link

Game Theory without Argmax [Part 1]

Cleo NardoNov 11, 2023, 3:59 PM

70 points

18 comments19 min readLW link

How Do We Align an AGI Without Getting Socially Engineered? (Hint: Box It)

Peter S. Park, NickyP and Stephen Fowler

Aug 10, 2022, 6:14 PM

28 points

30 comments11 min readLW link

[Appendix] Natural Abstractions: Key Claims, Theorems, and Critiques

LawrenceC, Erik Jenner and Leon Lang

Mar 16, 2023, 4:38 PM

48 points

0 comments13 min readLW link

Conditions for mathematical equivalence of Stochastic Gradient Descent and Natural Selection

Oliver SourbutMay 9, 2022, 9:38 PM

70 points

19 comments8 min readLW link 1 review

(www.oliversourbut.net)

A mostly critical review of infra-Bayesianism

David MatolcsiFeb 28, 2023, 6:37 PM

108 points

9 comments29 min readLW link

Sources of evidence in Alignment

Martín SotoJul 2, 2023, 8:38 PM

20 points

0 comments11 min readLW link

Finding Neurons in a Haystack: Case Studies with Sparse Probing

wesg and Neel Nanda

May 3, 2023, 1:30 PM

33 points

6 comments2 min readLW link 1 review

(arxiv.org)

Reducing sycophancy and improving honesty via activation steering

Nina PanicksseryJul 28, 2023, 2:46 AM

122 points

18 comments9 min readLW link 1 review

Framing AI Childhoods

David UdellSep 6, 2022, 11:40 PM

37 points

8 comments4 min readLW link

How complex are myopic imitators?

Vivek HebbarFeb 8, 2022, 12:00 PM

26 points

1 comment15 min readLW link

Notes on Learning the Prior

carboniferous_umbraculum Jul 15, 2022, 5:28 PM

25 points

2 comments25 min readLW link

A Bunch of Matryoshka SAEs

chanind, TomasD and Adrià Garriga-alonso

Apr 4, 2025, 2:53 PM

25 points

0 comments8 min readLW link

Searching for a model’s concepts by their shape – a theoretical framework

Kaarel, gekaklam, Walter Laurito , Kay Kozaronek, AlexMennen and June Ku

Feb 23, 2023, 8:14 PM

51 points

0 comments19 min readLW link

Large Language Models will be Great for Censorship

Ethan EdwardsAug 21, 2023, 7:03 PM

185 points

14 comments8 min readLW link

(ethanedwards.substack.com)

[Job Ad] MATS is hiring!

Jana, LauraVaughan, yams, Christian Smith and Ryan Kidd

Oct 9, 2024, 2:17 AM

10 points

0 comments5 min readLW link

Polysemantic Attention Head in a 4-Layer Transformer

Jett Janiak, cmathw and StefanHex

Nov 9, 2023, 4:16 PM

51 points

0 comments6 min readLW link

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

JanB, Owain_Evans and SoerenMind

Sep 28, 2023, 6:53 PM

187 points

39 comments3 min readLW link 1 review

Interview: Applications w/ Alice Rigg

jacobhaimesDec 19, 2023, 7:03 PM

12 points

0 comments1 min readLW link

(into-ai-safety.github.io)

Red-teaming language models via activation engineering

Nina PanicksseryAug 26, 2023, 5:52 AM

69 points

6 comments9 min readLW link

Trying to find the underlying structure of computational systems

Matthias G. MayerSep 13, 2022, 9:16 PM

18 points

9 comments4 min readLW link

But is it really in Rome? An investigation of the ROME model editing technique

jacquesthibsDec 30, 2022, 2:40 AM

105 points

2 comments18 min readLW link

GPT-2 Sometimes Fails at IOI

Ronak_MehtaAug 14, 2024, 11:24 PM

13 points

0 comments2 min readLW link

(ronakrm.github.io)

Identification of Natural Modularity

Stephen FowlerJun 25, 2022, 3:05 PM

15 points

3 comments7 min readLW link

Determining the power of investors over Frontier AI Labs is strategically important to reduce x-risk

Lucie PhilipponJul 25, 2024, 1:12 AM

18 points

7 comments2 min readLW link

How Interpretability can be Impactful

Connall GarrodJul 18, 2022, 12:06 AM

18 points

0 comments37 min readLW link

MATS Winter 2023-24 Retrospective

utilistrutil, LauraVaughan, McKennaFitzgerald, Christian Smith, Juan Gil, Henry Sleight, Matthew Wearden and Ryan Kidd

May 11, 2024, 12:09 AM

86 points

28 comments49 min readLW link

Why I’m Working On Model Agnostic Interpretability

Jessica RumbelowNov 11, 2022, 9:24 AM

27 points

9 comments2 min readLW link

AutoInterpretation Finds Sparse Coding Beats Alternatives

HoagyJul 17, 2023, 1:41 AM

57 points

1 comment7 min readLW link

Translating between Latent Spaces

JamesH, Jeremy Gillen and NickyP

Jul 30, 2022, 3:25 AM

27 points

2 comments8 min readLW link

A mechanistic explanation for SolidGoldMagikarp-like tokens in GPT2

MadHatterFeb 26, 2023, 1:10 AM

61 points

14 comments6 min readLW link

Post-hoc reasoning in chain of thought

Kyle CoxFeb 5, 2025, 6:58 PM

17 points

0 comments11 min readLW link

Intricacies of Feature Geometry in Large Language Models

7vik, Lucius Bushnaq and Nandi

Dec 7, 2024, 6:10 PM

71 points

0 comments12 min readLW link

End-to-end hacking with language models

tchauvinApr 5, 2024, 3:06 PM

29 points

0 comments8 min readLW link

The shallow reality of ‘deep learning theory’

Jesse HooglandFeb 22, 2023, 4:16 AM

34 points

11 comments3 min readLW link

(www.jessehoogland.com)

Inner Alignment via Superpowers

JamesH, Thomas Larsen and Jeremy Gillen

Aug 30, 2022, 8:01 PM

37 points

13 comments4 min readLW link

Standard SAEs Might Be Incoherent: A Choosing Problem & A “Concise” Solution

Kola AyonrindeOct 30, 2024, 10:50 PM

27 points

0 comments12 min readLW link

Some Notes on the mathematics of Toy Autoencoding Problems

carboniferous_umbraculum Dec 22, 2022, 5:21 PM

18 points

1 comment12 min readLW link

Abram Demski’s ELK thoughts and proposal—distillation

Rubi J. HudsonJul 19, 2022, 6:57 AM

19 points

8 comments16 min readLW link

Some Summaries of Agent Foundations Work

mattmacdermottMay 15, 2023, 4:09 PM

62 points

1 comment13 min readLW link

Decomposing independent generalizations in neural networks via Hessian analysis

Dmitry Vaintrob and Nina Panickssery

Aug 14, 2023, 5:04 PM

84 points

4 comments1 min readLW link

Finding Skeletons on Rashomon Ridge

David Udell, Peter S. Park and NickyP

Jul 24, 2022, 10:31 PM

30 points

2 comments7 min readLW link

Guardian AI (Misaligned systems are all around us.)

Jessica RumbelowNov 25, 2022, 3:55 PM

15 points

6 comments2 min readLW link

Natural Abstractions: Key Claims, Theorems, and Critiques

LawrenceC, Leon Lang and Erik Jenner

Mar 16, 2023, 4:37 PM

241 points

26 comments45 min readLW link 3 reviews

The Low-Hanging Fruit Prior and sloped valleys in the loss landscape

Dmitry Vaintrob and Nina Panickssery

Aug 23, 2023, 9:12 PM

82 points

1 comment13 min readLW link

MATS Models

johnswentworthJul 9, 2022, 12:14 AM

95 points

5 comments16 min readLW link

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback

Marcus Williams, micahcarroll, Adhyyan Narang, Constantin Weisser and Brendan Murphy

Nov 7, 2024, 3:39 PM

51 points

7 comments11 min readLW link

Feature Hedging: Another way correlated features break SAEs

chanind, TomasD and Adrià Garriga-alonso

Mar 25, 2025, 2:33 PM

22 points

0 comments18 min readLW link

My SERI MATS Application

Daniel PalekaMay 30, 2022, 2:04 AM

16 points

0 comments8 min readLW link

On Interpretability’s Robustness

WCargoOct 18, 2023, 1:18 PM

11 points

0 comments4 min readLW link

Empirical risk minimization is fundamentally confused

Jesse HooglandMar 22, 2023, 4:58 PM

32 points

8 comments1 min readLW link

A Neural Network undergoing Gradient-based Training as a Complex System

carboniferous_umbraculum Feb 19, 2023, 10:08 PM

22 points

1 comment19 min readLW link

Is the “Valley of Confused Abstractions” real?

jacquesthibsDec 5, 2022, 1:36 PM

20 points

11 comments2 min readLW link

When fine-tuning fails to elicit GPT-3.5′s chess abilities

Theodore ChapmanJun 14, 2024, 6:50 PM

42 points

3 comments9 min readLW link

Experiment Idea: RL Agents Evading Learned Shutdownability

Leon LangJan 16, 2023, 10:46 PM

31 points

7 comments17 min readLW link

(docs.google.com)

SAE Probing: What is it good for?

Subhash Kantamneni, Josh Engels, Senthooran Rajamanoharan and Neel Nanda

Nov 1, 2024, 7:23 PM

33 points

0 comments11 min readLW link

How important is AI hacking as LLMs advance?

Artyom KarpovJan 29, 2024, 6:41 PM

1 point

0 comments6 min readLW link

[RFC] Possible ways to expand on “Discovering Latent Knowledge in Language Models Without Supervision”.

gekaklam, Walter Laurito , Kaarel and Kay Kozaronek

Jan 25, 2023, 7:03 PM

48 points

6 comments12 min readLW link

Do models know when they are being evaluated?

Govind Pimpale, Giles, Joe Needham and Marius Hobbhahn

Feb 17, 2025, 11:13 PM

59 points

6 comments12 min readLW link

Getting up to Speed on the Speed Prior in 2022

robertzkDec 28, 2022, 7:49 AM

36 points

5 comments65 min readLW link

Revealing alignment faking with a single prompt

Florian_DietzJan 29, 2025, 9:01 PM

9 points

5 comments4 min readLW link

Boomerang—protocol to dissolve some commitment races

Filip SondejMay 30, 2023, 4:21 PM

37 points

10 comments8 min readLW link

Disentangling Shard Theory into Atomic Claims

Leon LangJan 13, 2023, 4:23 AM

86 points

6 comments18 min readLW link

Among Us: A Sandbox for Agentic Deception

7vik and Adrià Garriga-alonso

Apr 5, 2025, 6:24 AM

110 points

7 comments7 min readLW link

BatchTopK: A Simple Improvement for TopK-SAEs

Bart Bussmann, Patrick Leask and Neel Nanda

Jul 20, 2024, 2:20 AM

61 points

0 comments4 min readLW link

Taking features out of superposition with sparse autoencoders more quickly with informed initialization

Pierre PeignéSep 23, 2023, 4:21 PM

30 points

8 comments5 min readLW link

Understanding SAE Features with the Logit Lens

Joseph Bloom and Johnny Lin

Mar 11, 2024, 12:16 AM

68 points

0 comments14 min readLW link

Can We Change the Goals of a Toy RL Agent?

tuphs and Adrià Garriga-alonso

Jun 15, 2025, 8:34 PM

9 points

0 comments9 min readLW link

An Interpretability Illusion for Activation Patching of Arbitrary Subspaces

Georg Lange, Alex Makelov and Neel Nanda

Aug 29, 2023, 1:04 AM

77 points

4 comments1 min readLW link

Why you might expect homogeneous take-off: evidence from ML research

Andrei AlexandruJul 17, 2022, 8:31 PM

24 points

0 comments10 min readLW link

Team Shard Status Report

David UdellAug 9, 2022, 5:33 AM

38 points

8 comments3 min readLW link

Gradient surfing: the hidden role of regularization

Jesse HooglandFeb 6, 2023, 3:50 AM

37 points

9 comments14 min readLW link

(www.jessehoogland.com)

Finding Goals in the World Model

Jeremy Gillen, JamesH and Thomas Larsen

Aug 22, 2022, 6:06 PM

59 points

8 comments13 min readLW link

My experience applying to MATS 6.0

micJul 18, 2024, 7:02 PM

17 points

3 comments5 min readLW link

Some real examples of gradient hacking

Oliver SourbutNov 22, 2021, 12:11 AM

15 points

8 comments2 min readLW link

Scaling Sparse Feature Circuit Finding to Gemma 9B

Diego Caples, Jatin Nainani, CallumMcDougall and rrenaud

Jan 10, 2025, 11:08 AM

86 points

11 comments17 min readLW link

[Interim research report] Evaluating the Goal-Directedness of Language Models

Rauno Arike, Elizabeth Donoway and Marius Hobbhahn

Jul 18, 2024, 6:19 PM

40 points

4 comments11 min readLW link

Ophiology (or, how the Mamba architecture works)

Danielle Ensign, SrGonao and Adrià Garriga-alonso

Apr 9, 2024, 7:31 PM

67 points

8 comments10 min readLW link

SolidGoldMagikarp II: technical details and more recent findings

mwatkins and Jessica Rumbelow

Feb 6, 2023, 7:09 PM

113 points

45 comments13 min readLW link

Using PICT against PastaGPT Jailbreaking

Quentin FEUILLADE--MONTIXIFeb 9, 2023, 4:30 AM

26 points

0 comments9 min readLW link

Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens

Florian_DietzMar 10, 2025, 4:07 PM

37 points

4 comments9 min readLW link

Implementing activation steering

AnnahFeb 5, 2024, 5:51 PM

76 points

8 comments7 min readLW link

Infra-Bayesian Logic

harfe and Yegreg

Jul 5, 2023, 7:16 PM

15 points

2 comments1 min readLW link

Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers

hugofryApr 29, 2024, 8:57 PM

94 points

9 comments11 min readLW link

Fixed points in mortal population games

ViktoriaMalyasovaMar 14, 2023, 7:10 AM

31 points

0 comments12 min readLW link

(www.lesswrong.com)

Tips On Empirical Research Slides

James Chua, John Hughes, Ethan Perez and Owain_Evans

Jan 8, 2025, 5:06 AM

91 points

4 comments6 min readLW link

Proper scoring rules don’t guarantee predicting fixed points

Johannes Treutlein, Rubi J. Hudson and Caspar Oesterheld

Dec 16, 2022, 6:22 PM

79 points

8 comments21 min readLW link

Domain-specific SAEs

jacob_droriOct 7, 2024, 8:15 PM

28 points

2 comments5 min readLW link

Classifying representations of sparse autoencoders (SAEs)

AnnahNov 17, 2023, 1:54 PM

15 points

6 comments2 min readLW link

Shard Theory: An Overview

David UdellAug 11, 2022, 5:44 AM

167 points

34 comments10 min readLW link

Neural networks generalize because of this one weird trick

Jesse HooglandJan 18, 2023, 12:10 AM

183 points

34 comments15 min readLW link 1 review

(www.jessehoogland.com)

The Core of the Alignment Problem is...

Thomas Larsen, Jeremy Gillen and JamesH

Aug 17, 2022, 8:07 PM

76 points

10 comments9 min readLW link

Searching for Modularity in Large Language Models

NickyP and Stephen Fowler

Sep 8, 2022, 2:25 AM

44 points

3 comments14 min readLW link

Understanding and visualizing sycophancy datasets

Nina PanicksseryAug 16, 2023, 5:34 AM

47 points

0 comments6 min readLW link

We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To

robertzk, Connor Kissane, Arthur Conmy and Neel Nanda

Mar 6, 2024, 5:03 AM

63 points

0 comments12 min readLW link

Motivations, Natural Selection, and Curriculum Engineering

Oliver SourbutDec 16, 2021, 1:07 AM

16 points

0 comments42 min readLW link

The slingshot helps with learning

Wilson WuOct 31, 2024, 11:18 PM

33 points

0 comments8 min readLW link

How (not) to choose a research project

Garrett Baker, CatGoddess and Johannes C. Mayer

Aug 9, 2022, 12:26 AM

79 points

11 comments7 min readLW link

Behaviour Manifolds and the Hessian of the Total Loss—Notes and Criticism

carboniferous_umbraculum Sep 3, 2022, 12:15 AM

35 points

5 comments6 min readLW link

Approximation is expensive, but the lunch is cheap

Jesse Hoogland and Zach Furman

Apr 19, 2023, 2:19 PM

70 points

3 comments16 min readLW link

Ambiguous out-of-distribution generalization on an algorithmic task

Wilson Wu and Louis Jaburi

Feb 13, 2025, 6:24 PM

83 points

6 comments11 min readLW link

Quantitative cruxes in Alignment

Martín SotoJul 2, 2023, 8:38 PM

19 points

0 comments23 min readLW link

Stop-gradients lead to fixed point predictions

Johannes Treutlein, Caspar Oesterheld, Rubi J. Hudson and Emery Cooper

Jan 28, 2023, 10:47 PM

37 points

2 comments24 min readLW link

Analyzing DeepMind’s Probabilistic Methods for Evaluating Agent Capabilities

Axel Højmark, Govind Pimpale, Arjun Panickssery, Marius Hobbhahn and Jérémy Scheurer

Jul 22, 2024, 4:17 PM

69 points

0 comments16 min readLW link

[Short version] Information Loss --> Basin flatness

Vivek HebbarMay 21, 2022, 12:59 PM

12 points

0 comments1 min readLW link

A brief note on Simplicity Bias

carboniferous_umbraculum Aug 14, 2022, 2:05 AM

20 points

0 comments4 min readLW link

Invulnerable Incomplete Preferences: A Formal Statement

SCPAug 30, 2023, 9:59 PM

136 points

39 comments35 min readLW link

Activation adding experiments with FLAN-T5

Nina PanicksseryJul 13, 2023, 11:32 PM

21 points

5 comments7 min readLW link

An adversarial example for Direct Logit Attribution: memory management in gelu-4l

Can, Yeu-Tong Lau, James Dao and Jett Janiak

Aug 30, 2023, 5:36 PM

17 points

0 comments8 min readLW link

(arxiv.org)

Transcoders enable fine-grained interpretable circuit analysis for language models

Jacob Dunefsky, Philippe Chlenski and Neel Nanda

Apr 30, 2024, 5:58 PM

74 points

14 comments17 min readLW link

Edge Cases in AI Alignment

Florian_DietzMar 24, 2025, 9:27 AM

19 points

3 comments4 min readLW link

Understanding and controlling auto-induced distributional shift

L Rudolf LDec 13, 2021, 2:59 PM

33 points

4 comments16 min readLW link

Language Models Model Us

eggsyntaxMay 17, 2024, 9:00 PM

159 points

55 comments7 min readLW link

Evaluating hidden directions on the utility dataset: classification, steering and removal

Annah and shash42

Sep 25, 2023, 5:19 PM

25 points

3 comments7 min readLW link

Activation adding experiments with llama-7b

Nina PanicksseryJul 16, 2023, 4:17 AM

51 points

1 comment3 min readLW link

Early Experiments in Human Auditing for AI Control

Joey Yudelson and Buck

Jan 23, 2025, 1:34 AM

27 points

0 comments7 min readLW link

The Alignment Problems

Martín SotoJan 12, 2023, 10:29 PM

20 points

0 comments4 min readLW link

A circuit for Python docstrings in a 4-layer attention-only transformer

StefanHex and Jett Janiak

Feb 20, 2023, 7:35 PM

96 points

8 comments21 min readLW link

Information Loss --> Basin flatness

Vivek HebbarMay 21, 2022, 12:58 PM

62 points

31 comments7 min readLW link

Decoding intermediate activations in llama-2-7b

Nina PanicksseryJul 21, 2023, 5:35 AM

39 points

3 comments4 min readLW link

Working towards AI alignment is better

Johannes C. MayerDec 9, 2022, 3:39 PM

8 points

2 comments2 min readLW link

The Shard Theory Alignment Scheme

David UdellAug 25, 2022, 4:52 AM

47 points

32 comments2 min readLW link

Theoretical Neuroscience For Alignment Theory

Cameron BergDec 7, 2021, 9:50 PM

66 points

18 comments23 min readLW link

Mesa-optimization for goals defined only within a training environment is dangerous

Rubi J. HudsonAug 17, 2022, 3:56 AM

6 points

2 comments4 min readLW link

How transparency changed over time

ViktoriaMalyasovaJul 30, 2022, 4:36 AM

21 points

0 comments6 min readLW link

An open letter to SERI MATS program organisers

Roman LeventovApr 20, 2023, 4:34 PM

26 points

26 comments4 min readLW link

Spooky action at a distance in the loss landscape

Jesse Hoogland and Filip Sondej

Jan 28, 2023, 12:22 AM

61 points

4 comments7 min readLW link

(www.jessehoogland.com)

Deception?! I ain’t got time for that!

Paul CologneseJul 18, 2022, 12:06 AM

55 points

5 comments13 min readLW link

Improving Model-Written Evals for AI Safety Benchmarking

Sunishchal Dev and Marius Hobbhahn

Oct 15, 2024, 6:25 PM

30 points

0 comments18 min readLW link

Foresight for AGI Safety Strategy: Mitigating Risks and Identifying Golden Opportunities

jacquesthibsDec 5, 2022, 4:09 PM

28 points

6 comments8 min readLW link

[Paper] All’s Fair In Love And Love: Copy Suppression in GPT-2 Small

CallumMcDougall, Arthur Conmy, Cody Rushing, Tom McGrath and Neel Nanda

Oct 13, 2023, 6:32 PM

82 points

4 comments8 min readLW link

Bridging the VLM and mech interp communities for multimodal interpretability

Sonia JosephOct 28, 2024, 2:41 PM

19 points

5 comments15 min readLW link

Non-Unitary Quantum Logic—SERI MATS Research Sprint

YegregFeb 16, 2023, 7:31 PM

27 points

0 comments7 min readLW link

No comments.