RSS

MATS Program

TagLast edit: Dec 30, 2024, 9:26 AM by Dakara

ML Alignment & Theory Scholars (MATS) Program is an educational seminar and independent research program that aims to provide talented scholars with talks, workshops, and research mentorship in the field of AI alignment, and connect them with the Berkeley AI safety research community.

SERI MATS Pro­gram—Win­ter 2022 Cohort

Oct 8, 2022, 7:09 PM
72 points
12 comments4 min readLW link

SolidGoldMag­ikarp (plus, prompt gen­er­a­tion)

Feb 5, 2023, 10:02 PM
680 points
206 comments12 min readLW link1 review

Un­der­stand­ing and con­trol­ling a maze-solv­ing policy network

Mar 11, 2023, 6:59 PM
332 points
28 comments23 min readLW link

Pro­ject pro­posal: Test­ing the IBP defi­ni­tion of agent

Aug 9, 2022, 1:09 AM
21 points
4 comments2 min readLW link

SERI MATS—Sum­mer 2023 Cohort

Apr 8, 2023, 3:32 PM
71 points
25 comments4 min readLW link

Soft op­ti­miza­tion makes the value tar­get bigger

Jeremy GillenJan 2, 2023, 4:06 PM
117 points
20 comments12 min readLW link

How MATS ad­dresses “mass move­ment build­ing” concerns

Ryan KiddMay 4, 2023, 12:55 AM
62 points
9 comments3 min readLW link

SERI ML Align­ment The­ory Schol­ars Pro­gram 2022

Apr 27, 2022, 12:43 AM
67 points
6 comments3 min readLW link

Talk: AI safety field­build­ing at MATS

Ryan KiddJun 23, 2024, 11:06 PM
26 points
2 comments10 min readLW link

Finite Fac­tored Sets in Pictures

Magdalena WacheDec 11, 2022, 6:49 PM
174 points
35 comments12 min readLW link

Tak­ing the pa­ram­e­ters which seem to mat­ter and ro­tat­ing them un­til they don’t

Garrett BakerAug 26, 2022, 6:26 PM
120 points
48 comments1 min readLW link

I found >800 or­thog­o­nal “write code” steer­ing vectors

Jul 15, 2024, 7:06 PM
99 points
19 comments7 min readLW link
(jacobgw.com)

My MATS Sum­mer 2023 experience

James ChuaMar 20, 2024, 11:26 AM
29 points
0 comments3 min readLW link
(jameschua.net)

In­fra-Bayesian haggling

hannagaborMay 20, 2024, 12:23 PM
28 points
0 comments20 min readLW link

Pre­dic­tions for shard the­ory mechanis­tic in­ter­pretabil­ity results

Mar 1, 2023, 5:16 AM
105 points
10 comments5 min readLW link

Neu­ral Tan­gent Ker­nel Distillation

Oct 5, 2022, 6:11 PM
76 points
20 comments8 min readLW link

Nor­ma­tive vs De­scrip­tive Models of Agency

mattmacdermottFeb 2, 2023, 8:28 PM
26 points
5 comments4 min readLW link

Effi­cient Dic­tionary Learn­ing with Switch Sparse Autoencoders

Anish MudideJul 22, 2024, 6:45 PM
118 points
19 comments12 min readLW link

Mo­du­lat­ing syco­phancy in an RLHF model via ac­ti­va­tion steering

Nina PanicksseryAug 9, 2023, 7:06 AM
69 points
20 comments12 min readLW link

[ASoT] Reflec­tivity in Nar­row AI

Ulisse MiniNov 21, 2022, 12:51 AM
6 points
1 comment1 min readLW link

Clar­ify­ing mesa-optimization

Mar 21, 2023, 3:53 PM
38 points
6 comments10 min readLW link

Ta­lent Needs of Tech­ni­cal AI Safety Teams

May 24, 2024, 12:36 AM
115 points
65 comments14 min readLW link

A dis­til­la­tion of Evan Hub­inger’s train­ing sto­ries (for SERI MATS)

Daphne_WJul 18, 2022, 3:38 AM
15 points
1 comment10 min readLW link

Con­se­quen­tial­ists: One-Way Pat­tern Traps

David UdellJan 16, 2023, 8:48 PM
59 points
3 comments14 min readLW link

In­for­ma­tion the­o­retic model anal­y­sis may not lend much in­sight, but we may have been do­ing them wrong!

Garrett BakerJul 24, 2022, 12:42 AM
7 points
0 comments10 min readLW link

Race Along Rashomon Ridge

Jul 7, 2022, 3:20 AM
50 points
15 comments8 min readLW link

MATS Sum­mer 2023 Retrospective

Dec 1, 2023, 11:29 PM
77 points
34 comments26 min readLW link

Show­ing SAE La­tents Are Not Atomic Us­ing Meta-SAEs

Aug 24, 2024, 12:56 AM
68 points
10 comments20 min readLW link

Broad Bas­ins and Data Compression

Aug 8, 2022, 8:33 PM
33 points
6 comments7 min readLW link

Cal­en­dar fea­ture ge­om­e­try in GPT-2 layer 8 resi­d­ual stream SAEs

Aug 17, 2024, 1:16 AM
53 points
0 comments5 min readLW link

Be­havi­oural statis­tics for a maze-solv­ing agent

Apr 20, 2023, 10:26 PM
46 points
11 comments10 min readLW link

[Closed] Agent Foun­da­tions track in MATS

Vanessa KosoyOct 31, 2023, 8:12 AM
54 points
1 comment1 min readLW link
(www.matsprogram.org)

Balanc­ing Se­cu­rity Mind­set with Col­lab­o­ra­tive Re­search: A Proposal

MadHatterNov 1, 2023, 12:46 AM
9 points
3 comments4 min readLW link

Ap­ply for MATS Win­ter 2023-24!

Oct 21, 2023, 2:27 AM
104 points
6 comments5 min readLW link

Gra­di­ent Rout­ing: Mask­ing Gra­di­ents to Lo­cal­ize Com­pu­ta­tion in Neu­ral Networks

Dec 6, 2024, 10:19 PM
161 points
12 comments11 min readLW link
(arxiv.org)

Ap­ply to MATS 7.0!

Sep 21, 2024, 12:23 AM
31 points
0 comments5 min readLW link

Game The­ory with­out Argmax [Part 2]

Cleo NardoNov 11, 2023, 4:02 PM
31 points
14 comments13 min readLW link

More find­ings on Me­moriza­tion and dou­ble descent

Marius HobbhahnFeb 1, 2023, 6:26 PM
53 points
2 comments19 min readLW link

More find­ings on max­i­mal data dimension

Marius HobbhahnFeb 2, 2023, 6:33 PM
27 points
1 comment11 min readLW link

Self-ex­plain­ing SAE features

Aug 5, 2024, 10:20 PM
60 points
13 comments10 min readLW link

Ex­per­i­ments with an al­ter­na­tive method to pro­mote spar­sity in sparse autoencoders

Eoin FarrellApr 15, 2024, 6:21 PM
29 points
7 comments12 min readLW link

[ASoT] Policy Tra­jec­tory Visualization

Ulisse MiniFeb 7, 2023, 12:13 AM
9 points
2 comments1 min readLW link

The Geom­e­try of Feel­ings and Non­sense in Large Lan­guage Models

Sep 27, 2024, 5:49 PM
59 points
10 comments4 min readLW link

MATS Alumni Im­pact Analysis

Sep 30, 2024, 2:35 AM
61 points
7 comments11 min readLW link

Qual­ities that al­ign­ment men­tors value in ju­nior researchers

AkashFeb 14, 2023, 11:27 PM
88 points
14 comments3 min readLW link

Sparse Au­toen­coders Work on At­ten­tion Layer Outputs

Jan 16, 2024, 12:26 AM
83 points
9 comments18 min readLW link

Uncer­tainty in all its flavours

Cleo NardoJan 9, 2024, 4:21 PM
27 points
6 comments35 min readLW link

In­ter­ven­ing in the Resi­d­ual Stream

MadHatterFeb 22, 2023, 6:29 AM
30 points
1 comment9 min readLW link

MATS AI Safety Strat­egy Curriculum

Mar 7, 2024, 7:59 PM
74 points
2 comments16 min readLW link

MATS AI Safety Strat­egy Cur­ricu­lum v2

Oct 7, 2024, 10:44 PM
42 points
6 comments13 min readLW link

Con­di­tion­ing Gen­er­a­tive Models for Alignment

JozdienJul 18, 2022, 7:11 AM
59 points
8 comments20 min readLW link

What Makes an Idea Un­der­stand­able? On Ar­chi­tec­turally and Cul­turally Nat­u­ral Ideas.

Aug 16, 2022, 2:09 AM
21 points
2 comments16 min readLW link

In­ter­pretabil­ity as Com­pres­sion: Re­con­sid­er­ing SAE Ex­pla­na­tions of Neu­ral Ac­ti­va­tions with MDL-SAEs

Aug 23, 2024, 6:52 PM
41 points
5 comments16 min readLW link

My Ad­vice for In­com­ing SERI MATS Scholars

Johannes C. MayerJan 3, 2023, 7:25 PM
58 points
6 comments4 min readLW link

Con­tent and Take­aways from SERI MATS Train­ing Pro­gram with John Wentworth

RohanSDec 24, 2022, 4:17 AM
28 points
3 comments12 min readLW link

Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

Apr 30, 2024, 6:51 PM
206 points
43 comments45 min readLW link

Re­ward hack­ing be­hav­ior can gen­er­al­ize across tasks

May 28, 2024, 4:33 PM
79 points
5 comments21 min readLW link

Craft­ing Poly­se­man­tic Trans­former Bench­marks with Known Circuits

Aug 23, 2024, 10:03 PM
10 points
0 comments25 min readLW link

Au­tomat­ing LLM Au­dit­ing with Devel­op­men­tal Interpretability

Sep 4, 2024, 3:50 PM
19 points
0 comments3 min readLW link

Can We Align a Self-Im­prov­ing AGI?

Peter S. ParkAug 30, 2022, 12:14 AM
8 points
5 comments11 min readLW link

Steer­ing Llama-2 with con­trastive ac­ti­va­tion additions

Jan 2, 2024, 12:47 AM
124 points
29 comments8 min readLW link
(arxiv.org)

De­bat­ing with More Per­sua­sive LLMs Leads to More Truth­ful Answers

Feb 7, 2024, 9:28 PM
88 points
14 comments9 min readLW link
(arxiv.org)

[Paper] AI Sand­bag­ging: Lan­guage Models can Strate­gi­cally Un­der­perform on Evaluations

Jun 13, 2024, 10:04 AM
84 points
10 comments2 min readLW link
(arxiv.org)

Case Stud­ies in Re­v­erse-Eng­ineer­ing Sparse Au­toen­coder Fea­tures by Us­ing MLP Linearization

Jan 14, 2024, 2:06 AM
23 points
0 comments42 min readLW link

Swap and Scale

Stephen FowlerSep 9, 2022, 10:41 PM
17 points
3 comments1 min readLW link

At­ten­tion SAEs Scale to GPT-2 Small

Feb 3, 2024, 6:50 AM
78 points
4 comments8 min readLW link

MATS men­tor selection

Jan 10, 2025, 3:12 AM
42 points
11 comments6 min readLW link

De­com­pos­ing the QK cir­cuit with Bilin­ear Sparse Dic­tionary Learning

Jul 2, 2024, 1:17 PM
86 points
7 comments12 min readLW link

What sorts of sys­tems can be de­cep­tive?

Andrei AlexandruOct 31, 2022, 10:00 PM
16 points
0 comments7 min readLW link

Au­dit­ing games for high-level interpretability

Paul CologneseNov 1, 2022, 10:44 AM
33 points
1 comment7 min readLW link

MATS Ap­pli­ca­tions + Re­search Direc­tions I’m Cur­rently Ex­cited About

Neel NandaFeb 6, 2025, 11:03 AM
62 points
2 comments8 min readLW link

The Ground Truth Prob­lem (Or, Why Eval­u­at­ing In­ter­pretabil­ity Meth­ods Is Hard)

Jessica RumbelowNov 17, 2022, 11:06 AM
27 points
2 comments2 min readLW link

Stitch­ing SAEs of differ­ent sizes

Jul 13, 2024, 5:19 PM
39 points
12 comments12 min readLW link

[Paper] All’s Fair In Love And Love: Copy Sup­pres­sion in GPT-2 Small

Oct 13, 2023, 6:32 PM
82 points
4 comments8 min readLW link

On In­ter­pretabil­ity’s Robustness

WCargoOct 18, 2023, 1:18 PM
11 points
0 comments4 min readLW link

Model­ling Deception

Garrett BakerJul 18, 2022, 9:21 PM
15 points
0 comments7 min readLW link

Abram Dem­ski’s ELK thoughts and pro­posal—distillation

Rubi J. HudsonJul 19, 2022, 6:57 AM
19 points
8 comments16 min readLW link

Bounded com­plex­ity of solv­ing ELK and its implications

Rubi J. HudsonJul 19, 2022, 6:56 AM
11 points
4 comments18 min readLW link

How com­plex are my­opic imi­ta­tors?

Vivek HebbarFeb 8, 2022, 12:00 PM
26 points
1 comment15 min readLW link

My SERI MATS Application

Daniel PalekaMay 30, 2022, 2:04 AM
16 points
0 comments8 min readLW link

How (not) to choose a re­search project

Aug 9, 2022, 12:26 AM
79 points
11 comments7 min readLW link

Team Shard Sta­tus Report

David UdellAug 9, 2022, 5:33 AM
38 points
8 comments3 min readLW link

Find­ing Skele­tons on Rashomon Ridge

Jul 24, 2022, 10:31 PM
30 points
2 comments7 min readLW link

Ex­ter­nal­ized rea­son­ing over­sight: a re­search di­rec­tion for lan­guage model alignment

tameraAug 3, 2022, 12:03 PM
134 points
23 comments6 min readLW link

Trans­lat­ing be­tween La­tent Spaces

Jul 30, 2022, 3:25 AM
27 points
2 comments8 min readLW link

Shard The­ory: An Overview

David UdellAug 11, 2022, 5:44 AM
166 points
34 comments10 min readLW link

How Do We Align an AGI Without Get­ting So­cially Eng­ineered? (Hint: Box It)

Aug 10, 2022, 6:14 PM
28 points
30 comments11 min readLW link

Iden­ti­fi­ca­tion of Nat­u­ral Modularity

Stephen FowlerJun 25, 2022, 3:05 PM
15 points
3 comments7 min readLW link

How trans­parency changed over time

ViktoriaMalyasovaJul 30, 2022, 4:36 AM
21 points
0 comments6 min readLW link

How In­ter­pretabil­ity can be Impactful

Connall GarrodJul 18, 2022, 12:06 AM
18 points
0 comments37 min readLW link

Why you might ex­pect ho­mo­ge­neous take-off: ev­i­dence from ML research

Andrei AlexandruJul 17, 2022, 8:31 PM
24 points
0 comments10 min readLW link

In­ter­view: Ap­pli­ca­tions w/​ Alice Rigg

jacobhaimesDec 19, 2023, 7:03 PM
12 points
0 comments1 min readLW link
(into-ai-safety.github.io)

Notes on Learn­ing the Prior

carboniferous_umbraculum Jul 15, 2022, 5:28 PM
25 points
2 comments25 min readLW link

De­cep­tion?! I ain’t got time for that!

Paul CologneseJul 18, 2022, 12:06 AM
55 points
5 comments13 min readLW link

In­for­ma­tion Loss --> Basin flatness

Vivek HebbarMay 21, 2022, 12:58 PM
62 points
31 comments7 min readLW link

[Short ver­sion] In­for­ma­tion Loss --> Basin flatness

Vivek HebbarMay 21, 2022, 12:59 PM
12 points
0 comments1 min readLW link

Find­ing Goals in the World Model

Aug 22, 2022, 6:06 PM
59 points
8 comments13 min readLW link

The Shard The­ory Align­ment Scheme

David UdellAug 25, 2022, 4:52 AM
47 points
32 comments2 min readLW link

The Core of the Align­ment Prob­lem is...

Aug 17, 2022, 8:07 PM
76 points
10 comments9 min readLW link

Mesa-op­ti­miza­tion for goals defined only within a train­ing en­vi­ron­ment is dangerous

Rubi J. HudsonAug 17, 2022, 3:56 AM
6 points
2 comments4 min readLW link

A brief note on Sim­plic­ity Bias

carboniferous_umbraculum Aug 14, 2022, 2:05 AM
20 points
0 comments4 min readLW link

In­ner Align­ment via Superpowers

Aug 30, 2022, 8:01 PM
37 points
13 comments4 min readLW link

Be­havi­our Man­i­folds and the Hes­sian of the To­tal Loss—Notes and Criticism

carboniferous_umbraculum Sep 3, 2022, 12:15 AM
35 points
5 comments6 min readLW link

Fram­ing AI Childhoods

David UdellSep 6, 2022, 11:40 PM
37 points
8 comments4 min readLW link

Search­ing for Mo­du­lar­ity in Large Lan­guage Models

Sep 8, 2022, 2:25 AM
44 points
3 comments14 min readLW link

Try­ing to find the un­der­ly­ing struc­ture of com­pu­ta­tional systems

Matthias G. MayerSep 13, 2022, 9:16 PM
17 points
9 comments4 min readLW link

The­o­ret­i­cal Neu­ro­science For Align­ment Theory

Cameron BergDec 7, 2021, 9:50 PM
65 points
18 comments23 min readLW link

The Nat­u­ral Ab­strac­tion Hy­poth­e­sis: Im­pli­ca­tions and Evidence

CallumMcDougallDec 14, 2021, 11:14 PM
39 points
9 comments19 min readLW link

Mo­ti­va­tions, Nat­u­ral Selec­tion, and Cur­ricu­lum Engineering

Oliver SourbutDec 16, 2021, 1:07 AM
16 points
0 comments42 min readLW link

Un­der­stand­ing and con­trol­ling auto-in­duced dis­tri­bu­tional shift

L Rudolf LDec 13, 2021, 2:59 PM
33 points
4 comments16 min readLW link

Why I’m Work­ing On Model Ag­nos­tic Interpretability

Jessica RumbelowNov 11, 2022, 9:24 AM
27 points
9 comments2 min readLW link

A Short Dialogue on the Mean­ing of Re­ward Functions

Nov 19, 2022, 9:04 PM
45 points
0 comments3 min readLW link

Guardian AI (Misal­igned sys­tems are all around us.)

Jessica RumbelowNov 25, 2022, 3:55 PM
15 points
6 comments2 min readLW link

Is the “Valley of Con­fused Ab­strac­tions” real?

jacquesthibsDec 5, 2022, 1:36 PM
19 points
11 comments2 min readLW link

Fore­sight for AGI Safety Strat­egy: Miti­gat­ing Risks and Iden­ti­fy­ing Golden Opportunities

jacquesthibsDec 5, 2022, 4:09 PM
28 points
6 comments8 min readLW link

Work­ing to­wards AI al­ign­ment is better

Johannes C. MayerDec 9, 2022, 3:39 PM
8 points
2 comments2 min readLW link

Proper scor­ing rules don’t guaran­tee pre­dict­ing fixed points

Dec 16, 2022, 6:22 PM
79 points
8 comments21 min readLW link

Get­ting up to Speed on the Speed Prior in 2022

robertzkDec 28, 2022, 7:49 AM
36 points
5 comments65 min readLW link

But is it re­ally in Rome? An in­ves­ti­ga­tion of the ROME model edit­ing technique

jacquesthibsDec 30, 2022, 2:40 AM
104 points
2 comments18 min readLW link

Re­sults from a sur­vey on tool use and work­flows in al­ign­ment research

Dec 19, 2022, 3:19 PM
79 points
2 comments19 min readLW link

[Question] How is ARC plan­ning to use ELK?

jacquesthibsDec 15, 2022, 8:11 PM
24 points
5 comments1 min readLW link

Some Notes on the math­e­mat­ics of Toy Au­toen­cod­ing Problems

carboniferous_umbraculum Dec 22, 2022, 5:21 PM
18 points
1 comment12 min readLW link

The Align­ment Problems

Martín SotoJan 12, 2023, 10:29 PM
20 points
0 comments4 min readLW link

Disen­tan­gling Shard The­ory into Atomic Claims

Leon LangJan 13, 2023, 4:23 AM
86 points
6 comments18 min readLW link

Neu­ral net­works gen­er­al­ize be­cause of this one weird trick

Jesse HooglandJan 18, 2023, 12:10 AM
179 points
29 comments53 min readLW link1 review
(www.jessehoogland.com)

Ex­per­i­ment Idea: RL Agents Evad­ing Learned Shutdownability

Leon LangJan 16, 2023, 10:46 PM
31 points
7 comments17 min readLW link
(docs.google.com)

[RFC] Pos­si­ble ways to ex­pand on “Dis­cov­er­ing La­tent Knowl­edge in Lan­guage Models Without Su­per­vi­sion”.

Jan 25, 2023, 7:03 PM
48 points
6 comments12 min readLW link

Stop-gra­di­ents lead to fixed point predictions

Jan 28, 2023, 10:47 PM
37 points
2 comments24 min readLW link

Spooky ac­tion at a dis­tance in the loss landscape

Jan 28, 2023, 12:22 AM
61 points
4 comments7 min readLW link
(www.jessehoogland.com)

Us­ing PICT against Pas­taGPT Jailbreaking

Quentin FEUILLADE--MONTIXIFeb 9, 2023, 4:30 AM
26 points
0 comments9 min readLW link

Gra­di­ent sur­fing: the hid­den role of regularization

Jesse HooglandFeb 6, 2023, 3:50 AM
37 points
9 comments14 min readLW link
(www.jessehoogland.com)

SolidGoldMag­ikarp II: tech­ni­cal de­tails and more re­cent findings

Feb 6, 2023, 7:09 PM
113 points
45 comments13 min readLW link

A cir­cuit for Python doc­strings in a 4-layer at­ten­tion-only transformer

Feb 20, 2023, 7:35 PM
96 points
8 comments21 min readLW link

The shal­low re­al­ity of ‘deep learn­ing the­ory’

Jesse HooglandFeb 22, 2023, 4:16 AM
34 points
11 comments3 min readLW link
(www.jessehoogland.com)

A Neu­ral Net­work un­der­go­ing Gra­di­ent-based Train­ing as a Com­plex System

carboniferous_umbraculum Feb 19, 2023, 10:08 PM
22 points
1 comment19 min readLW link

Search­ing for a model’s con­cepts by their shape – a the­o­ret­i­cal framework

Feb 23, 2023, 8:14 PM
51 points
0 comments19 min readLW link

Why are coun­ter­fac­tu­als elu­sive?

Martín SotoMar 3, 2023, 8:13 PM
14 points
6 comments2 min readLW link

A mechanis­tic ex­pla­na­tion for SolidGoldMag­ikarp-like to­kens in GPT2

MadHatterFeb 26, 2023, 1:10 AM
61 points
14 comments6 min readLW link

[Ap­pendix] Nat­u­ral Ab­strac­tions: Key Claims, The­o­rems, and Critiques

Mar 16, 2023, 4:38 PM
48 points
0 comments13 min readLW link

Nat­u­ral Ab­strac­tions: Key claims, The­o­rems, and Critiques

Mar 16, 2023, 4:37 PM
239 points
23 comments45 min readLW link3 reviews

Train­ing goals for large lan­guage models

Johannes TreutleinJul 18, 2022, 7:09 AM
28 points
5 comments19 min readLW link

We In­spected Every Head In GPT-2 Small us­ing SAEs So You Don’t Have To

Mar 6, 2024, 5:03 AM
63 points
0 comments12 min readLW link

How to Catch an AI Liar: Lie De­tec­tion in Black-Box LLMs by Ask­ing Un­re­lated Questions

Sep 28, 2023, 6:53 PM
187 points
39 comments3 min readLW link1 review

How im­por­tant is AI hack­ing as LLMs ad­vance?

Artyom KarpovJan 29, 2024, 6:41 PM
1 point
0 comments6 min readLW link

Un­der­stand­ing SAE Fea­tures with the Logit Lens

Mar 11, 2024, 12:16 AM
68 points
0 comments14 min readLW link

Im­ple­ment­ing ac­ti­va­tion steering

AnnahFeb 5, 2024, 5:51 PM
71 points
8 comments7 min readLW link

Ophiol­ogy (or, how the Mamba ar­chi­tec­ture works)

Apr 9, 2024, 7:31 PM
67 points
8 comments10 min readLW link

End-to-end hack­ing with lan­guage models

tchauvinApr 5, 2024, 3:06 PM
29 points
0 comments8 min readLW link

Transcoders en­able fine-grained in­ter­pretable cir­cuit anal­y­sis for lan­guage models

Apr 30, 2024, 5:58 PM
72 points
14 comments17 min readLW link

Towards Mul­ti­modal In­ter­pretabil­ity: Learn­ing Sparse In­ter­pretable Fea­tures in Vi­sion Transformers

hugofryApr 29, 2024, 8:57 PM
92 points
8 comments11 min readLW link

MATS Win­ter 2023-24 Retrospective

May 11, 2024, 12:09 AM
86 points
28 comments49 min readLW link

Lan­guage Models Model Us

eggsyntaxMay 17, 2024, 9:00 PM
158 points
55 comments7 min readLW link

When fine-tun­ing fails to elicit GPT-3.5′s chess abilities

Theodore ChapmanJun 14, 2024, 6:50 PM
42 points
3 comments9 min readLW link

At­ten­tion Out­put SAEs Im­prove Cir­cuit Analysis

Jun 21, 2024, 12:56 PM
33 points
3 comments19 min readLW link

[Re­search log] The board of Alpha­bet would stop Deep­Mind to save the world

Lucie PhilipponJul 16, 2024, 4:59 AM
6 points
0 comments4 min readLW link

My ex­pe­rience ap­ply­ing to MATS 6.0

micJul 18, 2024, 7:02 PM
16 points
3 comments5 min readLW link

BatchTopK: A Sim­ple Im­prove­ment for TopK-SAEs

Jul 20, 2024, 2:20 AM
53 points
0 comments4 min readLW link

An­a­lyz­ing Deep­Mind’s Prob­a­bil­is­tic Meth­ods for Eval­u­at­ing Agent Capabilities

Jul 22, 2024, 4:17 PM
69 points
0 comments16 min readLW link

Deter­min­ing the power of in­vestors over Fron­tier AI Labs is strate­gi­cally im­por­tant to re­duce x-risk

Lucie PhilipponJul 25, 2024, 1:12 AM
18 points
7 comments2 min readLW link

GPT-2 Some­times Fails at IOI

Ronak_MehtaAug 14, 2024, 11:24 PM
13 points
0 comments2 min readLW link
(ronakrm.github.io)

[In­terim re­search re­port] Eval­u­at­ing the Goal-Direct­ed­ness of Lan­guage Models

Jul 18, 2024, 6:19 PM
39 points
4 comments11 min readLW link

Do­main-spe­cific SAEs

jacob_droriOct 7, 2024, 8:15 PM
27 points
2 comments5 min readLW link

[Job Ad] MATS is hiring!

Oct 9, 2024, 2:17 AM
10 points
0 comments5 min readLW link

Stan­dard SAEs Might Be In­co­her­ent: A Choos­ing Prob­lem & A “Con­cise” Solution

Kola AyonrindeOct 30, 2024, 10:50 PM
27 points
0 comments12 min readLW link

On Tar­geted Ma­nipu­la­tion and De­cep­tion when Op­ti­miz­ing LLMs for User Feedback

Nov 7, 2024, 3:39 PM
50 points
6 comments11 min readLW link

Im­prov­ing Model-Writ­ten Evals for AI Safety Benchmarking

Oct 15, 2024, 6:25 PM
30 points
0 comments18 min readLW link

The sling­shot helps with learning

Wilson WuOct 31, 2024, 11:18 PM
33 points
0 comments8 min readLW link

Bridg­ing the VLM and mech in­terp com­mu­ni­ties for mul­ti­modal in­ter­pretabil­ity

Sonia JosephOct 28, 2024, 2:41 PM
19 points
5 comments15 min readLW link

SAE Prob­ing: What is it good for? Ab­solutely some­thing!

Nov 1, 2024, 7:23 PM
32 points
0 comments11 min readLW link

In­tri­ca­cies of Fea­ture Geom­e­try in Large Lan­guage Models

Dec 7, 2024, 6:10 PM
68 points
0 comments12 min readLW link

Tips On Em­piri­cal Re­search Slides

Jan 8, 2025, 5:06 AM
89 points
4 comments6 min readLW link

Scal­ing Sparse Fea­ture Cir­cuit Find­ing to Gemma 9B

Jan 10, 2025, 11:08 AM
86 points
10 comments17 min readLW link

Early Ex­per­i­ments in Hu­man Au­dit­ing for AI Control

Jan 23, 2025, 1:34 AM
27 points
0 comments7 min readLW link

Re­veal­ing al­ign­ment fak­ing with a sin­gle prompt

Florian_DietzJan 29, 2025, 9:01 PM
9 points
5 comments4 min readLW link

Post-hoc rea­son­ing in chain of thought

Kyle CoxFeb 5, 2025, 6:58 PM
6 points
0 comments11 min readLW link

Em­piri­cal risk min­i­miza­tion is fun­da­men­tally confused

Jesse HooglandMar 22, 2023, 4:58 PM
32 points
5 comments1 min readLW link

Ap­prox­i­ma­tion is ex­pen­sive, but the lunch is cheap

Apr 19, 2023, 2:19 PM
70 points
3 comments16 min readLW link

Fixed points in mor­tal pop­u­la­tion games

ViktoriaMalyasovaMar 14, 2023, 7:10 AM
31 points
0 comments12 min readLW link
(www.lesswrong.com)

A mostly crit­i­cal re­view of in­fra-Bayesianism

David MatolcsiFeb 28, 2023, 6:37 PM
104 points
9 comments29 min readLW link

Perfor­mance guaran­tees in clas­si­cal learn­ing the­ory and in­fra-Bayesianism

David MatolcsiFeb 28, 2023, 6:37 PM
9 points
4 comments31 min readLW link

Non-Uni­tary Quan­tum Logic—SERI MATS Re­search Sprint

YegregFeb 16, 2023, 7:31 PM
27 points
0 comments7 min readLW link

An open let­ter to SERI MATS pro­gram organisers

Roman LeventovApr 20, 2023, 4:34 PM
26 points
26 comments4 min readLW link

Poly­se­man­tic At­ten­tion Head in a 4-Layer Transformer

Nov 9, 2023, 4:16 PM
51 points
0 comments6 min readLW link

Game The­ory with­out Argmax [Part 1]

Cleo NardoNov 11, 2023, 3:59 PM
70 points
18 comments19 min readLW link

Clas­sify­ing rep­re­sen­ta­tions of sparse au­toen­coders (SAEs)

AnnahNov 17, 2023, 1:54 PM
15 points
6 comments2 min readLW link

Re­search agenda: Su­per­vis­ing AIs im­prov­ing AIs

Apr 29, 2023, 5:09 PM
76 points
5 comments19 min readLW link

Find­ing Neu­rons in a Haystack: Case Stud­ies with Sparse Probing

May 3, 2023, 1:30 PM
33 points
6 comments2 min readLW link1 review
(arxiv.org)

Con­di­tions for math­e­mat­i­cal equiv­alence of Stochas­tic Gra­di­ent Des­cent and Nat­u­ral Selection

Oliver SourbutMay 9, 2022, 9:38 PM
70 points
19 comments8 min readLW link1 review
(www.oliversourbut.net)

Some real ex­am­ples of gra­di­ent hacking

Oliver SourbutNov 22, 2021, 12:11 AM
15 points
8 comments2 min readLW link

Some Sum­maries of Agent Foun­da­tions Work

mattmacdermottMay 15, 2023, 4:09 PM
62 points
1 comment13 min readLW link

Boomerang—pro­to­col to dis­solve some com­mit­ment races

Filip SondejMay 30, 2023, 4:21 PM
37 points
10 comments8 min readLW link

In­fra-Bayesian Logic

Jul 5, 2023, 7:16 PM
15 points
2 comments1 min readLW link

Quan­ti­ta­tive cruxes in Alignment

Martín SotoJul 2, 2023, 8:38 PM
19 points
0 comments23 min readLW link

Sources of ev­i­dence in Alignment

Martín SotoJul 2, 2023, 8:38 PM
20 points
0 comments11 min readLW link

Ac­ti­va­tion adding ex­per­i­ments with llama-7b

Nina PanicksseryJul 16, 2023, 4:17 AM
51 points
1 comment3 min readLW link

Au­toIn­ter­pre­ta­tion Finds Sparse Cod­ing Beats Alternatives

HoagyJul 17, 2023, 1:41 AM
57 points
1 comment7 min readLW link

Ac­ti­va­tion adding ex­per­i­ments with FLAN-T5

Nina PanicksseryJul 13, 2023, 11:32 PM
21 points
5 comments7 min readLW link

De­cod­ing in­ter­me­di­ate ac­ti­va­tions in llama-2-7b

Nina PanicksseryJul 21, 2023, 5:35 AM
39 points
3 comments4 min readLW link

Un­der­stand­ing and Align­ing a Hu­man-like In­duc­tive Bias with Cog­ni­tive Science: a Re­view of Re­lated Liter­a­ture

Claire ShortJul 29, 2023, 6:10 AM
26 points
0 comments12 min readLW link

Re­duc­ing syco­phancy and im­prov­ing hon­esty via ac­ti­va­tion steering

Nina PanicksseryJul 28, 2023, 2:46 AM
122 points
18 comments9 min readLW link1 review

De­com­pos­ing in­de­pen­dent gen­er­al­iza­tions in neu­ral net­works via Hes­sian analysis

Aug 14, 2023, 5:04 PM
83 points
4 comments1 min readLW link

Un­der­stand­ing and vi­su­al­iz­ing syco­phancy datasets

Nina PanicksseryAug 16, 2023, 5:34 AM
45 points
0 comments6 min readLW link

Large Lan­guage Models will be Great for Censorship

Ethan EdwardsAug 21, 2023, 7:03 PM
183 points
14 comments8 min readLW link
(ethanedwards.substack.com)

The Low-Hang­ing Fruit Prior and sloped valleys in the loss landscape

Aug 23, 2023, 9:12 PM
82 points
1 comment13 min readLW link

In­vuln­er­a­ble In­com­plete Prefer­ences: A For­mal Statement

SCPAug 30, 2023, 9:59 PM
134 points
39 comments35 min readLW link

Red-team­ing lan­guage mod­els via ac­ti­va­tion engineering

Nina PanicksseryAug 26, 2023, 5:52 AM
69 points
6 comments9 min readLW link

An ad­ver­sar­ial ex­am­ple for Direct Logit At­tri­bu­tion: mem­ory man­age­ment in gelu-4l

Aug 30, 2023, 5:36 PM
17 points
0 comments8 min readLW link
(arxiv.org)

An In­ter­pretabil­ity Illu­sion for Ac­ti­va­tion Patch­ing of Ar­bi­trary Subspaces

Aug 29, 2023, 1:04 AM
77 points
4 comments1 min readLW link

Tak­ing fea­tures out of su­per­po­si­tion with sparse au­toen­coders more quickly with in­formed initialization

Pierre PeignéSep 23, 2023, 4:21 PM
30 points
8 comments5 min readLW link

Eval­u­at­ing hid­den di­rec­tions on the util­ity dataset: clas­sifi­ca­tion, steer­ing and removal

Sep 25, 2023, 5:19 PM
25 points
3 comments7 min readLW link
No comments.