RSS

Ac­ti­va­tion Engineering

TagLast edit: 29 Aug 2023 3:05 UTC by David Udell

Activation Engineering is the direct manipulation of activation vectors inside of a trained machine learning model. Potentially, it is a way to steer a model’s behavior.

Activation engineering can be contrasted with other strategies for steering models: fine-tuning the models for desired behavior and crafting prompts that get a particular response.

Steer­ing GPT-2-XL by adding an ac­ti­va­tion vector

13 May 2023 18:42 UTC
436 points
97 comments50 min readLW link

Mo­du­lat­ing syco­phancy in an RLHF model via ac­ti­va­tion steering

Nina Panickssery9 Aug 2023 7:06 UTC
69 points
20 comments12 min readLW link

Ex­tract­ing and Eval­u­at­ing Causal Direc­tion in LLMs’ Activations

14 Dec 2022 14:33 UTC
29 points
5 comments11 min readLW link

Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

30 Apr 2024 18:51 UTC
204 points
40 comments45 min readLW link

Re­duc­ing syco­phancy and im­prov­ing hon­esty via ac­ti­va­tion steering

Nina Panickssery28 Jul 2023 2:46 UTC
122 points
17 comments9 min readLW link

Maze-solv­ing agents: Add a top-right vec­tor, make the agent go to the top-right

31 Mar 2023 19:20 UTC
101 points
17 comments11 min readLW link

An In­tro­duc­tion to Rep­re­sen­ta­tion Eng­ineer­ing—an ac­ti­va­tion-based paradigm for con­trol­ling LLMs

Jan Wehner14 Jul 2024 10:37 UTC
35 points
5 comments17 min readLW link

[Paper] Pro­gram­ming Re­fusal with Con­di­tional Ac­ti­va­tion Steering

Bruce W. Lee11 Sep 2024 20:57 UTC
41 points
0 comments11 min readLW link
(arxiv.org)

Rep­re­sen­ta­tion Tuning

Christopher Ackerman27 Jun 2024 17:44 UTC
35 points
9 comments13 min readLW link

I found >800 or­thog­o­nal “write code” steer­ing vectors

15 Jul 2024 19:06 UTC
95 points
19 comments7 min readLW link
(jacobgw.com)

Eval­u­at­ing hid­den di­rec­tions on the util­ity dataset: clas­sifi­ca­tion, steer­ing and removal

25 Sep 2023 17:19 UTC
25 points
3 comments7 min readLW link

Un­der­stand­ing Coun­ter­bal­anced Sub­trac­tions for Bet­ter Ac­ti­va­tion Additions

ojorgensen17 Aug 2023 13:53 UTC
21 points
0 comments14 min readLW link

Ac­tAdd: Steer­ing Lan­guage Models with­out Optimization

6 Sep 2023 17:21 UTC
105 points
3 comments2 min readLW link
(arxiv.org)

LLMs Univer­sally Learn a Fea­ture Rep­re­sent­ing To­ken Fre­quency /​ Rarity

Sean Osier30 Jun 2024 2:48 UTC
12 points
5 comments6 min readLW link
(github.com)

Steer­ing Llama-2 with con­trastive ac­ti­va­tion additions

2 Jan 2024 0:47 UTC
123 points
29 comments8 min readLW link
(arxiv.org)

Im­ple­ment­ing ac­ti­va­tion steering

Annah5 Feb 2024 17:51 UTC
66 points
7 comments7 min readLW link

[Question] What’s the the­ory of im­pact for ac­ti­va­tion vec­tors?

Chris_Leong11 Feb 2024 7:34 UTC
57 points
12 comments1 min readLW link

Jailbreak steer­ing generalization

20 Jun 2024 17:25 UTC
41 points
4 comments2 min readLW link
(arxiv.org)

Val­i­dat­ing /​ find­ing al­ign­ment-rele­vant con­cepts us­ing neu­ral data

Bogdan Ionut Cirstea20 Sep 2024 21:12 UTC
7 points
0 comments1 min readLW link
(docs.google.com)

Com­par­ing rep­re­sen­ta­tion vec­tors be­tween llama 2 base and chat

Nina Panickssery28 Oct 2023 22:54 UTC
36 points
5 comments2 min readLW link

Ac­ti­va­tion ad­di­tions in a sim­ple MNIST network

Garrett Baker18 May 2023 2:49 UTC
26 points
0 comments2 min readLW link

Open prob­lems in ac­ti­va­tion engineering

24 Jul 2023 19:46 UTC
51 points
2 comments1 min readLW link
(coda.io)

Ac­ti­va­tion ad­di­tions in a small resi­d­ual network

Garrett Baker22 May 2023 20:28 UTC
22 points
4 comments3 min readLW link

De­cod­ing in­ter­me­di­ate ac­ti­va­tions in llama-2-7b

Nina Panickssery21 Jul 2023 5:35 UTC
37 points
3 comments4 min readLW link

[ASoT] GPT2 Steer­ing & The Tuned Lens

Ulisse Mini1 Jul 2023 14:12 UTC
23 points
0 comments2 min readLW link

Red-team­ing lan­guage mod­els via ac­ti­va­tion engineering

Nina Panickssery26 Aug 2023 5:52 UTC
69 points
6 comments9 min readLW link

Un­der­stand­ing and vi­su­al­iz­ing syco­phancy datasets

Nina Panickssery16 Aug 2023 5:34 UTC
45 points
0 comments6 min readLW link

Sparse Cod­ing, for Mechanis­tic In­ter­pretabil­ity and Ac­ti­va­tion Engineering

David Udell23 Sep 2023 19:16 UTC
42 points
7 comments34 min readLW link

Un­der­stand­ing and con­trol­ling a maze-solv­ing policy network

11 Mar 2023 18:59 UTC
328 points
27 comments23 min readLW link

In­fer­ence-Time In­ter­ven­tion: Elic­it­ing Truth­ful An­swers from a Lan­guage Model

likenneth11 Jun 2023 5:38 UTC
195 points
4 comments1 min readLW link
(arxiv.org)

Paper: Un­der­stand­ing and Con­trol­ling a Maze-Solv­ing Policy Network

13 Oct 2023 1:38 UTC
70 points
0 comments1 min readLW link
(arxiv.org)

Auto-match­ing hid­den lay­ers in Py­torch LLMs

chanind19 Feb 2024 12:40 UTC
2 points
0 comments3 min readLW link

In­ves­ti­gat­ing Bias Rep­re­sen­ta­tions in LLMs via Ac­ti­va­tion Steering

DawnLu15 Jan 2024 19:39 UTC
29 points
4 comments5 min readLW link

Strik­ing Im­pli­ca­tions for Learn­ing The­ory, In­ter­pretabil­ity — and Safety?

RogerDearnaley5 Jan 2024 8:46 UTC
37 points
4 comments2 min readLW link

How well do truth probes gen­er­al­ise?

mishajw24 Feb 2024 14:12 UTC
87 points
11 comments9 min readLW link

Ac­ti­va­tion Eng­ineer­ing The­o­ries of Impact

kubanetics18 Jul 2024 16:44 UTC
6 points
1 comment2 min readLW link

Self-Con­trol of LLM Be­hav­iors by Com­press­ing Suffix Gra­di­ent into Pre­fix Controller

Henry Cai16 Jun 2024 13:01 UTC
7 points
0 comments7 min readLW link
(arxiv.org)

Avoid­ing jailbreaks by dis­cour­ag­ing their rep­re­sen­ta­tion in ac­ti­va­tion space

Guido Bergman27 Sep 2024 17:49 UTC
6 points
2 comments9 min readLW link

Con­trol Vec­tors as Dis­po­si­tional Traits

Gianluca Calcagni23 Jun 2024 21:34 UTC
9 points
0 comments11 min readLW link

In­tro­duc­ing SARA: a new ac­ti­va­tion steer­ing technique

Alejandro Tlaie9 Jun 2024 15:33 UTC
17 points
7 comments6 min readLW link

Fea­tures and Ad­ver­saries in MemoryDT

20 Oct 2023 7:32 UTC
31 points
6 comments25 min readLW link

Clas­sify­ing rep­re­sen­ta­tions of sparse au­toen­coders (SAEs)

Annah17 Nov 2023 13:54 UTC
15 points
6 comments2 min readLW link
No comments.