RSS

Ac­ti­va­tion Engineering

TagLast edit: Aug 29, 2023, 3:05 AM by David Udell

Activation Engineering is the direct manipulation of activation vectors inside of a trained machine learning model. Potentially, it is a way to steer a model’s behavior.

Activation engineering can be contrasted with other strategies for steering models: fine-tuning the models for desired behavior and crafting prompts that get a particular response.

Steer­ing GPT-2-XL by adding an ac­ti­va­tion vector

May 13, 2023, 6:42 PM
437 points
98 comments50 min readLW link1 review

Mo­du­lat­ing syco­phancy in an RLHF model via ac­ti­va­tion steering

Nina PanicksseryAug 9, 2023, 7:06 AM
69 points
20 comments12 min readLW link

Ex­tract­ing and Eval­u­at­ing Causal Direc­tion in LLMs’ Activations

Dec 14, 2022, 2:33 PM
29 points
5 comments11 min readLW link

Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

Apr 30, 2024, 6:51 PM
207 points
43 comments45 min readLW link

Re­duc­ing syco­phancy and im­prov­ing hon­esty via ac­ti­va­tion steering

Nina PanicksseryJul 28, 2023, 2:46 AM
122 points
18 comments9 min readLW link1 review

Maze-solv­ing agents: Add a top-right vec­tor, make the agent go to the top-right

Mar 31, 2023, 7:20 PM
101 points
17 comments11 min readLW link

An In­tro­duc­tion to Rep­re­sen­ta­tion Eng­ineer­ing—an ac­ti­va­tion-based paradigm for con­trol­ling LLMs

Jan WehnerJul 14, 2024, 10:37 AM
36 points
6 comments17 min readLW link

Pro­gram­ming Re­fusal with Con­di­tional Ac­ti­va­tion Steering

Bruce W. LeeSep 11, 2024, 8:57 PM
41 points
0 comments11 min readLW link
(brucewlee.com)

Rep­re­sen­ta­tion Tuning

Christopher AckermanJun 27, 2024, 5:44 PM
35 points
9 comments13 min readLW link

I found >800 or­thog­o­nal “write code” steer­ing vectors

Jul 15, 2024, 7:06 PM
100 points
19 comments7 min readLW link
(jacobgw.com)

Eval­u­at­ing hid­den di­rec­tions on the util­ity dataset: clas­sifi­ca­tion, steer­ing and removal

Sep 25, 2023, 5:19 PM
25 points
3 comments7 min readLW link

Un­der­stand­ing Coun­ter­bal­anced Sub­trac­tions for Bet­ter Ac­ti­va­tion Additions

ojorgensenAug 17, 2023, 1:53 PM
21 points
0 comments14 min readLW link

Ac­tAdd: Steer­ing Lan­guage Models with­out Optimization

Sep 6, 2023, 5:21 PM
105 points
3 comments2 min readLW link
(arxiv.org)

LLMs Univer­sally Learn a Fea­ture Rep­re­sent­ing To­ken Fre­quency /​ Rarity

Sean OsierJun 30, 2024, 2:48 AM
12 points
5 comments6 min readLW link
(github.com)

Val­i­dat­ing /​ find­ing al­ign­ment-rele­vant con­cepts us­ing neu­ral data

Bogdan Ionut CirsteaSep 20, 2024, 9:12 PM
7 points
0 comments1 min readLW link
(docs.google.com)

Im­ple­ment­ing ac­ti­va­tion steering

AnnahFeb 5, 2024, 5:51 PM
73 points
8 comments7 min readLW link

[Question] What’s the the­ory of im­pact for ac­ti­va­tion vec­tors?

Chris_LeongFeb 11, 2024, 7:34 AM
59 points
12 comments1 min readLW link

Jailbreak steer­ing generalization

Jun 20, 2024, 5:25 PM
41 points
4 comments2 min readLW link
(arxiv.org)

Steer­ing Llama-2 with con­trastive ac­ti­va­tion additions

Jan 2, 2024, 12:47 AM
125 points
29 comments8 min readLW link
(arxiv.org)

Deep Causal Transcod­ing: A Frame­work for Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

Dec 3, 2024, 9:19 PM
100 points
7 comments41 min readLW link

Steer­ing Gem­ini with BiDPO

TurnTroutJan 31, 2025, 2:37 AM
104 points
5 comments1 min readLW link
(turntrout.com)

Com­par­ing the effec­tive­ness of top-down and bot­tom-up ac­ti­va­tion steer­ing for by­pass­ing re­fusal on harm­ful prompts

Ana KaprosFeb 12, 2025, 7:12 PM
7 points
0 comments5 min readLW link

Com­par­ing rep­re­sen­ta­tion vec­tors be­tween llama 2 base and chat

Nina PanicksseryOct 28, 2023, 10:54 PM
36 points
5 comments2 min readLW link

Ac­ti­va­tion ad­di­tions in a sim­ple MNIST network

Garrett BakerMay 18, 2023, 2:49 AM
26 points
0 comments2 min readLW link

Open prob­lems in ac­ti­va­tion engineering

Jul 24, 2023, 7:46 PM
51 points
2 comments1 min readLW link
(coda.io)

Ac­ti­va­tion ad­di­tions in a small resi­d­ual network

Garrett BakerMay 22, 2023, 8:28 PM
22 points
4 comments3 min readLW link

De­cod­ing in­ter­me­di­ate ac­ti­va­tions in llama-2-7b

Nina PanicksseryJul 21, 2023, 5:35 AM
39 points
3 comments4 min readLW link

[ASoT] GPT2 Steer­ing & The Tuned Lens

Ulisse MiniJul 1, 2023, 2:12 PM
23 points
0 comments2 min readLW link

Red-team­ing lan­guage mod­els via ac­ti­va­tion engineering

Nina PanicksseryAug 26, 2023, 5:52 AM
69 points
6 comments9 min readLW link

Un­der­stand­ing and vi­su­al­iz­ing syco­phancy datasets

Nina PanicksseryAug 16, 2023, 5:34 AM
45 points
0 comments6 min readLW link

Sparse Cod­ing, for Mechanis­tic In­ter­pretabil­ity and Ac­ti­va­tion Engineering

David UdellSep 23, 2023, 7:16 PM
42 points
7 comments34 min readLW link

Un­der­stand­ing and con­trol­ling a maze-solv­ing policy network

Mar 11, 2023, 6:59 PM
332 points
28 comments23 min readLW link

In­fer­ence-Time In­ter­ven­tion: Elic­it­ing Truth­ful An­swers from a Lan­guage Model

likennethJun 11, 2023, 5:38 AM
195 points
4 comments1 min readLW link
(arxiv.org)

Paper: Un­der­stand­ing and Con­trol­ling a Maze-Solv­ing Policy Network

Oct 13, 2023, 1:38 AM
70 points
0 comments1 min readLW link
(arxiv.org)

In­tro­duc­ing SARA: a new ac­ti­va­tion steer­ing technique

Alejandro TlaieJun 9, 2024, 3:33 PM
17 points
7 comments6 min readLW link

Fea­tures and Ad­ver­saries in MemoryDT

Oct 20, 2023, 7:32 AM
31 points
6 comments25 min readLW link

Clas­sify­ing rep­re­sen­ta­tions of sparse au­toen­coders (SAEs)

AnnahNov 17, 2023, 1:54 PM
15 points
6 comments2 min readLW link

Auto-match­ing hid­den lay­ers in Py­torch LLMs

chanindFeb 19, 2024, 12:40 PM
2 points
0 comments3 min readLW link

Ac­ti­va­tion Eng­ineer­ing The­o­ries of Impact

kubaneticsJul 18, 2024, 4:44 PM
6 points
1 comment2 min readLW link

In­ves­ti­gat­ing Bias Rep­re­sen­ta­tions in LLMs via Ac­ti­va­tion Steering

DawnLuJan 15, 2024, 7:39 PM
29 points
4 comments5 min readLW link

Avoid­ing jailbreaks by dis­cour­ag­ing their rep­re­sen­ta­tion in ac­ti­va­tion space

Guido BergmanSep 27, 2024, 5:49 PM
7 points
2 comments9 min readLW link

Con­trol Vec­tors as Dis­po­si­tional Traits

Gianluca CalcagniJun 23, 2024, 9:34 PM
10 points
0 comments11 min readLW link

Do safety-rele­vant LLM steer­ing vec­tors op­ti­mized on a sin­gle ex­am­ple gen­er­al­ize?

Jacob DunefskyFeb 28, 2025, 12:01 PM
15 points
1 comment14 min readLW link
(arxiv.org)

A Sober Look at Steer­ing Vec­tors for LLMs

Nov 23, 2024, 5:30 PM
38 points
0 comments5 min readLW link

Strik­ing Im­pli­ca­tions for Learn­ing The­ory, In­ter­pretabil­ity — and Safety?

RogerDearnaleyJan 5, 2024, 8:46 AM
37 points
4 comments2 min readLW link

How well do truth probes gen­er­al­ise?

mishajwFeb 24, 2024, 2:12 PM
92 points
11 comments9 min readLW link

Sleeper agents ap­pear re­silient to ac­ti­va­tion steering

Lucy WingardFeb 3, 2025, 7:31 PM
4 points
0 comments7 min readLW link

Self-Con­trol of LLM Be­hav­iors by Com­press­ing Suffix Gra­di­ent into Pre­fix Controller

Henry CaiJun 16, 2024, 1:01 PM
7 points
0 comments7 min readLW link
(arxiv.org)
No comments.