Activation Engineering

TagLast edit: Aug 29, 2023, 3:05 AM by David Udell

Activation Engineering is the direct manipulation of activation vectors inside of a trained machine learning model. Potentially, it is a way to steer a model’s behavior.

Activation engineering can be contrasted with other strategies for steering models: fine-tuning the models for desired behavior and crafting prompts that get a particular response.

Steering GPT-2-XL by adding an activation vector

TurnTrout, Monte M, David Udell, lisathiergart and Ulisse Mini

May 13, 2023, 6:42 PM

437 points

98 comments50 min readLW link 1 review

Modulating sycophancy in an RLHF model via activation steering

Nina PanicksseryAug 9, 2023, 7:06 AM

69 points

20 comments12 min readLW link

Reducing sycophancy and improving honesty via activation steering

Nina PanicksseryJul 28, 2023, 2:46 AM

122 points

18 comments9 min readLW link 1 review

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack and TurnTrout

Apr 30, 2024, 6:51 PM

208 points

43 comments45 min readLW link

Maze-solving agents: Add a top-right vector, make the agent go to the top-right

TurnTrout, peligrietzer and lisathiergart

Mar 31, 2023, 7:20 PM

101 points

17 comments11 min readLW link

Extracting and Evaluating Causal Direction in LLMs’ Activations

Fabien Roger and simeon_c

Dec 14, 2022, 2:33 PM

29 points

5 comments11 min readLW link

An Introduction to Representation Engineering—an activation-based paradigm for controlling LLMs

Jan WehnerJul 14, 2024, 10:37 AM

37 points

6 comments17 min readLW link

Programming Refusal with Conditional Activation Steering

Bruce W. LeeSep 11, 2024, 8:57 PM

41 points

0 comments11 min readLW link

(brucewlee.com)

Representation Tuning

Christopher AckermanJun 27, 2024, 5:44 PM

35 points

9 comments13 min readLW link

ActAdd: Steering Language Models without Optimization

technicalities, TurnTrout, lisathiergart, David Udell, Ulisse Mini and Monte M

Sep 6, 2023, 5:21 PM

105 points

3 comments2 min readLW link

(arxiv.org)

I found >800 orthogonal “write code” steering vectors

Jacob G-W and TurnTrout

Jul 15, 2024, 7:06 PM

102 points

19 comments7 min readLW link

(jacobgw.com)

Understanding Counterbalanced Subtractions for Better Activation Additions

ojorgensenAug 17, 2023, 1:53 PM

21 points

0 comments14 min readLW link

Evaluating hidden directions on the utility dataset: classification, steering and removal

Annah and shash42

Sep 25, 2023, 5:19 PM

25 points

3 comments7 min readLW link

LLMs Universally Learn a Feature Representing Token Frequency / Rarity

Sean OsierJun 30, 2024, 2:48 AM

12 points

5 comments6 min readLW link

(github.com)

Steering Llama-2 with contrastive activation additions

Nina Panickssery, Wuschel Schulz, NickGabs, Meg, evhub and TurnTrout

Jan 2, 2024, 12:47 AM

125 points

29 comments8 min readLW link

(arxiv.org)

Implementing activation steering

AnnahFeb 5, 2024, 5:51 PM

74 points

8 comments7 min readLW link

[Question] What’s the theory of impact for activation vectors?

Chris_LeongFeb 11, 2024, 7:34 AM

61 points

12 comments1 min readLW link

[Research sprint] Single-model crosscoder feature ablation and steering

Thomas ReadApr 6, 2025, 2:42 PM

8 points

0 comments12 min readLW link

Jailbreak steering generalization

Sarah Ball and Nina Panickssery

Jun 20, 2024, 5:25 PM

41 points

4 comments2 min readLW link

(arxiv.org)

Validating / finding alignment-relevant concepts using neural data

Bogdan Ionut CirsteaSep 20, 2024, 9:12 PM

7 points

0 comments1 min readLW link

(docs.google.com)

Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack and TurnTrout

Dec 3, 2024, 9:19 PM

100 points

7 comments41 min readLW link

Steering Gemini with BiDPO

TurnTroutJan 31, 2025, 2:37 AM

104 points

5 comments1 min readLW link

(turntrout.com)

Comparing the effectiveness of top-down and bottom-up activation steering for bypassing refusal on harmful prompts

Ana KaprosFeb 12, 2025, 7:12 PM

7 points

0 comments5 min readLW link

Comparing representation vectors between llama 2 base and chat

Nina PanicksseryOct 28, 2023, 10:54 PM

36 points

5 comments2 min readLW link

Activation additions in a simple MNIST network

Garrett BakerMay 18, 2023, 2:49 AM

26 points

0 comments2 min readLW link

Open problems in activation engineering

TurnTrout, woog, lisathiergart, Monte M and Ulisse Mini

Jul 24, 2023, 7:46 PM

51 points

2 comments1 min readLW link

(coda.io)

Activation additions in a small residual network

Garrett BakerMay 22, 2023, 8:28 PM

22 points

4 comments3 min readLW link

Decoding intermediate activations in llama-2-7b

Nina PanicksseryJul 21, 2023, 5:35 AM

39 points

3 comments4 min readLW link

[ASoT] GPT2 Steering & The Tuned Lens

Ulisse MiniJul 1, 2023, 2:12 PM

23 points

0 comments2 min readLW link

Red-teaming language models via activation engineering

Nina PanicksseryAug 26, 2023, 5:52 AM

69 points

6 comments9 min readLW link

Understanding and visualizing sycophancy datasets

Nina PanicksseryAug 16, 2023, 5:34 AM

45 points

0 comments6 min readLW link

Sparse Coding, for Mechanistic Interpretability and Activation Engineering

David UdellSep 23, 2023, 7:16 PM

42 points

7 comments34 min readLW link

Understanding and controlling a maze-solving policy network

TurnTrout, peligrietzer, Ulisse Mini, Monte M and David Udell

Mar 11, 2023, 6:59 PM

333 points

28 comments23 min readLW link

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

likennethJun 11, 2023, 5:38 AM

195 points

4 comments1 min readLW link

(arxiv.org)

Paper: Understanding and Controlling a Maze-Solving Policy Network

TurnTrout, Ulisse Mini, peligrietzer, mrinank_sharma, Austin Meek, Monte M and lisathiergart

Oct 13, 2023, 1:38 AM

70 points

0 comments1 min readLW link

(arxiv.org)

Sleeper agents appear resilient to activation steering

Lucy WingardFeb 3, 2025, 7:31 PM

4 points

0 comments7 min readLW link

How well do truth probes generalise?

mishajwFeb 24, 2024, 2:12 PM

92 points

11 comments9 min readLW link

Open Challenges in Representation Engineering

Jan Wehner and Daniel Tan

Apr 3, 2025, 7:21 PM

13 points

0 comments5 min readLW link

Features and Adversaries in MemoryDT

Joseph Bloom and Jay Bailey

Oct 20, 2023, 7:32 AM

31 points

6 comments25 min readLW link

Classifying representations of sparse autoencoders (SAEs)

AnnahNov 17, 2023, 1:54 PM

15 points

6 comments2 min readLW link

Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller

Henry CaiJun 16, 2024, 1:01 PM

7 points

0 comments7 min readLW link

(arxiv.org)

Striking Implications for Learning Theory, Interpretability — and Safety?

RogerDearnaleyJan 5, 2024, 8:46 AM

37 points

4 comments2 min readLW link

Emergent Misalignment and Emergent Alignment

Alvin ÅnestrandApr 3, 2025, 8:04 AM

5 points

0 comments8 min readLW link

Control Vectors as Dispositional Traits

Gianluca CalcagniJun 23, 2024, 9:34 PM

10 points

0 comments11 min readLW link

Activation Engineering Theories of Impact

kubaneticsJul 18, 2024, 4:44 PM

6 points

1 comment2 min readLW link

Auto-matching hidden layers in Pytorch LLMs

chanindFeb 19, 2024, 12:40 PM

2 points

0 comments3 min readLW link

Avoiding jailbreaks by discouraging their representation in activation space

Guido BergmanSep 27, 2024, 5:49 PM

7 points

2 comments9 min readLW link

One-shot steering vectors cause emergent misalignment, too

Jacob DunefskyApr 14, 2025, 6:40 AM

84 points

6 comments11 min readLW link

Do safety-relevant LLM steering vectors optimized on a single example generalize?

Jacob DunefskyFeb 28, 2025, 12:01 PM

18 points

1 comment14 min readLW link

(arxiv.org)

A Sober Look at Steering Vectors for LLMs

Joschka Braun, Dmitrii Krasheninnikov, Usman Anwar, RobertKirk, Daniel Tan and David Scott Krueger (formerly: capybaralet)

Nov 23, 2024, 5:30 PM

38 points

0 comments5 min readLW link

Investigating Bias Representations in LLMs via Activation Steering

DawnLuJan 15, 2024, 7:39 PM

29 points

4 comments5 min readLW link

Introducing SARA: a new activation steering technique

Alejandro TlaieJun 9, 2024, 3:33 PM

17 points

7 comments6 min readLW link

No comments.

Ac­ti­va­tion Engineering

Activation Engineering