Pierre Peigné

Karma: 119

Investing in Robust Safety Mechanisms is critical for reducing Systemic Risks

Tom DAVID, Pierre Peigné, Quentin FEUILLADE--MONTIXI, Kay Kozaronek and Miailhe Nicolas

Dec 11, 2024, 1:37 PM

8 points

3 comments2 min readLW link

Workshop Report: Why current benchmarks approaches are not sufficient for safety?

Tom DAVID and Pierre Peigné

Nov 26, 2024, 5:20 PM

3 points

1 comment3 min readLW link

The Stochastic Parrot Hypothesis is debatable for the last generation of LLMs

Quentin FEUILLADE--MONTIXI and Pierre Peigné

Nov 7, 2023, 4:12 PM

52 points

21 comments6 min readLW link

Pierre Peigné Sep 24, 2023, 9:57 AM
1 point
0
in reply to: Logan Riggs’s comment on: Taking features out of superposition with sparse autoencoders more quickly with informed initialization
Thanks Logan,
1) About re-initialization:
I think your idea of re-initializing dead features of the sparse dictionary with the input data the model struggle reconstructing could work. It seems a great idea!
This probably imply extracting rare features vectors out of such datapoints before using them for initialization.
I intuitively suspect that the datapoints the model is bad at predicting contain rare features and potentially common rare features. Therefore I would bet on performing some rare feature extraction out of batches of poorly reconstructed input data, instead of using directly the one with the worst reconstruction loss. (But may be this is what you already had in mind?)

2) About not being compute bottlenecked:
I am a bit cautious about how well sparse autoencoders methods would scale to very high dimensionality. If the “scaling factor” estimated (with a very low confidence) in the original work is correct, then compute could become a thing.

“Here we found very weak, tentative evidence that, for a model of size $d_{m o d e l} = 256$ , the number of features in superposition was over $100, 000$ . This is a large scaling factor, and it’s only a lower bound. If the estimated scaling factor is approximately correct (and, we emphasize, we’re not at all confident in that result yet) or if it gets larger, then this method of feature extraction is going to be very costly to scale to the largest models – possibly more costly than training the models themselves.”

However:
- we need more evidences of this (or may be I have missed an important update about this!)
- may be I’m asking too much out of it: my concerns about scaling relate to being able to recover most of the superposed features; but improving the understanding, even if it is not complete, is already a victory.

Taking features out of superposition with sparse autoencoders more quickly with informed initialization

Pierre PeignéSep 23, 2023, 4:21 PM

30 points

8 comments5 min readLW link

Clarifying mesa-optimization

Marius Hobbhahn and Pierre Peigné

Mar 21, 2023, 3:53 PM

38 points

6 comments10 min readLW link

Pierre Peigné′s Shortform

Pierre PeignéFeb 4, 2023, 3:22 AM

1 point

1 comment LW link

Pierre Peigné Feb 4, 2023, 1:41 AM
6 points
3
on: Pierre Peigné′s Shortform
Polysemanticity is bad

Pierre Peigné

In­vest­ing in Ro­bust Safety Mechanisms is crit­i­cal for re­duc­ing Sys­temic Risks

Work­shop Re­port: Why cur­rent bench­marks ap­proaches are not suffi­cient for safety?

The Stochas­tic Par­rot Hy­poth­e­sis is de­bat­able for the last gen­er­a­tion of LLMs

Tak­ing fea­tures out of su­per­po­si­tion with sparse au­toen­coders more quickly with in­formed initialization

Clar­ify­ing mesa-optimization