neverix

Karma: 146

Evolutionary prompt optimization for SAE feature visualization

neverix, Daniel Tan, Dmitrii Kharlapenko, Neel Nanda and Arthur Conmy

Nov 14, 2024, 1:06 PM

21 points

0 comments9 min readLW link

neverix Oct 16, 2024, 5:48 PM
2 points
0
in reply to: Sheikh Abdur Raheem Ali’s comment on: SAE features for refusal and sycophancy steering vectors
1. It doesn’t really make sense to interpret feature activation values as log probabilities. If we did, we’d have to worry about scaling. It’s also not guaranteed the score wouldn’t just decrease because of decreased accuracy on correct answers.
2. Phi seems specialized for MMLU-like problems and has an outsized score for a model its size, I would be surprised if it’s biased because of the format of the question. However, it’s possible using answers instead of letters would help improve raw accuracy in this case because the feature we used (45142) seems to max-activate on plain text and not multiple choice answers and it’s somewhat surprising it does this well on multiple-choice questions. For this reason, I think using text answers will boost the feature’s score but won’t change the relative ranking of features by accuracy. I don’t know how to adapt your proposed experiment for the task of finding how accurate a feature is at eliciting the model’s knowledge of correct answers.
3. This is a technical detail. We use a jax.lax.scan by the layer index and store layers in one structure with a special named axis. This mainly improves compilation time. Penzai has since implemented the technique in the main library.

SAE features for refusal and sycophancy steering vectors

neverix, Dmitrii Kharlapenko, Arthur Conmy and Neel Nanda

Oct 12, 2024, 2:54 PM

29 points

4 comments7 min readLW link

neverix Sep 8, 2024, 5:00 PM
8 points
2
in reply to: eggsyntax’s comment on: Self-explaining SAE features
- We use our own judgement as a (potentially very inaccurate) proxy for accuracy as an explanation and let readers look on their own at the feature dashboard interface. We judge using a random sample of examples at different levels of activation. We had an automatic interpretation scoring pipeline that used Llama 3 70B, but we did not use it because (IIRC) it was too slow to run with multiple explanations per feature. Perhaps it is now practical to use a method like this.
- That is a pattern that happens frequently, but we’re not confident enough to propose any particular form. It is sometimes thrown off by random spikes, self-similarity gradually rising at larger scales, or entropy peaking in the beginning. Because of this, there is still a lot of room for improvement in cases where a human (or maybe a peak-finding algorithm) could do better than our linear metric.

Extracting SAE task features for in-context learning

Dmitrii Kharlapenko, neverix, Neel Nanda and Arthur Conmy

Aug 12, 2024, 8:34 PM

31 points

1 comment9 min readLW link

neverix Aug 11, 2024, 7:25 PM
5 points
2
on: You should go to ML conferences
- Genie: Generative Interactive Environments Bruce et al.
How is that paper alignment-relevant?

Self-explaining SAE features

Dmitrii Kharlapenko, neverix, Neel Nanda and Arthur Conmy

Aug 5, 2024, 10:20 PM

60 points

13 comments10 min readLW link

neverix Jun 15, 2024, 3:12 AM
3 points
0
on: Research Report: Alternative sparsity methods for sparse autoencoders with OthelloGPT.
Freshman’s dream sparsity loss
A similar regularizer is known as Hoyer-Square.
Pick a value for $k$ and a small $ϵ \geq 0$ . Then define the activation function $T_{k, ϵ}$ in the following way. Given a vector $x$ , let $b$ be the value of the $k$ th-largest entry in $x$ . Then define the vector $T_{k, ϵ} (x)$ by
Is $a$ in the following formula a typo?

neverix Nov 23, 2023, 10:34 PM
1 point
0
in reply to: Neel Nanda’s comment on: 200 COP in MI: Exploring Polysemanticity and Superposition
To clarify, I thought it was about superposition happening inside the projection afterwards.

neverix Nov 21, 2023, 10:30 PM
1 point
0
on: 200 COP in MI: Exploring Polysemanticity and Superposition
This happens in transformer MLP layers. Note that the hidden dimen
Is the point that transformer MLPs blow up the hidden dimension in the middle?

neverix Aug 31, 2023, 7:41 PM
1 point
0
on: Steering GPT-2-XL by adding an activation vector
Activation additions in generative models
Also related is https://arxiv.org/abs/2210.10960. They use a small neural network to generate steering vectors for the UNet bottleneck in diffusion to edit images using CLIP.

neverix Aug 24, 2023, 6:18 PM
3 points
0
on: The Low-Hanging Fruit Prior and sloped valleys in the loss landscape
From a conversation on Discord:
Do you have in mind a way to weigh sequential learning into the actual prior?
Dmitry:
good question! We haven’t thought about an explicit complexity measure that would give this prior, but a very loose approximation that we’ve been keeping in the back of our minds could be a Turing machine/Boolean circuit version of the “BIMT” weight penalty from this paper https://arxiv.org/abs/2305.08746 (which they show encourages modularity at least in toy models)
Response:
Hmm, BIMT seems to only be about intra-layer locality. It would certainly encourage learning an ensemble of features, but I’m not sure if it would capture the interesting bit, which I think is the fact that features are built up sequentially from earlier to later layers and changes are only accepted if they improve local loss.
I’m thinking about something like an existence of a relatively smooth scaling law (?) as the criterion.
So, just some smoothness constraint that would basically integrate over paths SGD could take.

neverix Aug 2, 2023, 1:26 PM
1 point
0
on: Inference from a Mathematical Description of an Existing Alignment Research: a proposal for an outer alignment research program
We can idealize the outer alignment solution as a logical inductor.
Why outer?

neverix Aug 1, 2023, 8:20 AM
1 point
0
on: The “spelling miracle”: GPT-3 spelling abilities and glitch tokens revisited
You could literally go through some giant corpus with an LLM and see which samples have gradients similar to those from training on a spelling task.

neverix Jun 6, 2023, 3:56 PM
1 point
0
on: Hessian and Basin volume
There are also somewhat principled reasons for using a “fuzzy ellipsoid”, which I won’t explain here.
If you view $T$ as 2x learning rate, the ellipsoid contains parameters which will jump straight into the basin under the quadratic approximation, and we assume for points outside the basin the approximation breaks entirely. If you account for gradient noise ~~in the form of a Gaussian with sigma equal to gradient, the PDF of the resulting point at the basin is equal to the probability a Gaussian parametrized by the ellipsoid at the preceding point.~~ This is wrong, but there is an interpretation of the noise as a Gaussian with variance increasing away from the basin origin.

neverix Apr 20, 2023, 3:50 PM
4 points
0
in reply to: Vika’s comment on: Power-seeking can be probable and predictive for trained agents
Seems like quoting doesn’t work for LaTeX, it was definitions ²⁄₃. Reading again I saw D2 was indeed applicable to sets.

neverix Apr 19, 2023, 8:42 PM
2 points
0
AF
on: Power-seeking can be probable and predictive for trained agents
A0>A1
How is orbit comparison for sets defined?

neverix Feb 23, 2023, 1:39 PM
1 point
on: The Credit Assignment Problem
course
coarse?

neverix Feb 22, 2023, 5:04 PM
2 points
1
on: Is there a ML agent that abandons it’s utility function out-of-distribution without losing capabilities?
This is the whole point of goal misgeneralization. They have experiments (albeit on toy environments that can be explained by the network finding the wrong algorithm), so I’d say quite plausible.

neverix Feb 20, 2023, 10:26 AM
1 point
0
on: PSA about differential technological development
sidereal
typo?

neverix

Evolu­tion­ary prompt op­ti­miza­tion for SAE fea­ture visualization

SAE fea­tures for re­fusal and syco­phancy steer­ing vectors

Ex­tract­ing SAE task fea­tures for in-con­text learning

Self-ex­plain­ing SAE features

Freshman’s dream sparsity loss

Activation additions in generative models

Evolutionary prompt optimization for SAE feature visualization

SAE features for refusal and sycophancy steering vectors

Extracting SAE task features for in-context learning

Self-explaining SAE features