Adam Newgas

Karma: 64

https://www.boristhebrave.com/

An Introduction to SAEs and their Variants for Mech Interp

Adam NewgasApr 19, 2025, 2:09 PM

10 points

0 comments10 min readLW link

Adam Newgas Apr 17, 2025, 8:10 PM
4 points
−1
in reply to: Lucius Bushnaq’s comment on: Lucius Bushnaq’s Shortform
I’m with @chanind: If elephant is fully represented by a sum of its attributes, then it’s quite reasonable to say that the model has no fundamental notion of an elephant in that representation.
Yes, the combination “grey + big + mammal + …” is special in some sense. If the model needed to recall that elephans are afraid of mice, the circuit would appear to check “grey and big and mammal” and that’s an annoying mouthful that would be repeated all over the model. But it’s a faithful representation of what’s going on.
Let me be precise by what I mean “has no fundamental notion of an elephant”. Suppose I tried to fine tune the model to represent some new fact about animals, say, if they are worth a lot of points in Scrabble. One way the model could do this by squeezing another feature into the activation space. The other features might rotate a little during this training, but all the existing circuitry would basically continue functioning unchanged.
But they’d be too unchanged: the “afraid of mice” circuit would still be checking for “grey and big and mammal and …” as the finetune dataset included no facts about animal fears. While some newer circuits formed during fine tuning would be checking for “grey and big and mammal and … and high-scrabble-scoring”. Any interpretability tool that told you that “grey and big and mammal and …” was “elephant” in the first model is now going to have difficulty representing the situation.
Meanwhile, consider a “normal” model that has a residual notion of an elephant after you take away all all facts about elephants. Then both old and new circuits would contain references to that residual (plus other junk) and one could meaningfully say both circuits have something in common.
Your example, which represents animals purely by their properties, reminds me of this classic article, which argues that a key feature in thought is forming concepts of things that are independent of the properties we learnt about them.

Adam Newgas Mar 28, 2025, 3:36 PM
1 point
0
in reply to: Lucius Bushnaq’s comment on: Computational Superposition in a Toy Model of the U-AND Problem
Yes I don’t think the exact distribution of weights Gaussian/uniform/binary really makes that much difference, you can see the difference in loss in some of the charts above. The extra efficiency probably comes from the fact that every neuron contributes to everything fully—with Gaussian, sometimes the weights will be close to zero.
Some other advantages:
* But they are somewhat easier to analyse than gaussian weights.
* They can be skewed ( $p \neq 0.5$ ) which seems advantageous for an unknown reason. Possibly it makes AND circuits better at the expense of other possible truth tables.

Adam Newgas Mar 28, 2025, 10:23 AM
3 points
0
on: What’s up with LLMs representing XORs of arbitrary features?
I’ve come up with my own explanation for why this happens: https://www.lesswrong.com/posts/QpbdkECXAdLFThhGg/computational-superposition-in-a-toy-model-of-the-u-and#XOR_Circuits
In short, XOR representations are naturally learnt when a model is targetting some other boolean operation as the same circuitry makes all boolean operations linearly representable. But XOR requires different weights to identity operations, so linear probes still will tend to learn generalizable solutions.

Computational Superposition in a Toy Model of the U-AND Problem

Adam NewgasMar 27, 2025, 4:56 PM

17 points

2 comments11 min readLW link

Adam Newgas Mar 1, 2025, 9:26 AM
3 points
0
on: How To Do Patching Fast
$\frac{\partial F (e_{α})}{\partial α} = \frac{\partial F (e_{α})}{\partial e_{α}} \frac{\partial e_{α}}{\partial α}$
$= \frac{\partial F (e_{α})}{\partial e_{α}} \frac{\partial [e_{clean} + α \times (e_{corr} - e_{clean})]}{\partial α}$
$= \frac{\partial F (e_{α})}{\partial e_{α}} e_{corr} - e_{clean}$
Set $α = 0$ , ie. $e_{α} = e_{c l e a n}$
$= e_{corr} - e_{clean} \frac{\partial F (e_{clean})}{\partial e_{clean}}$
It looks like some parentheses are missing here?

London rationalish meetup

Adam NewgasFeb 18, 2025, 11:09 PM

3 points

0 comments1 min readLW link

Do No Harm? Navigating and Nudging AI Moral Choices

Sinem, pandelis and Adam Newgas

Feb 6, 2025, 7:18 PM

6 points

0 comments9 min readLW link

My Mental Model of AI Creativity – Creativity Kiki

Adam NewgasDec 9, 2024, 10:24 PM

12 points

0 comments2 min readLW link

(www.boristhebrave.com)

An Uncanny Moat

Adam NewgasNov 15, 2024, 11:39 AM

8 points

0 comments4 min readLW link

(www.boristhebrave.com)

London Rationalish Meetup

Adam NewgasNov 4, 2024, 11:18 PM

4 points

0 comments1 min readLW link

London Rationalish Meetup

Adam NewgasSep 14, 2024, 8:57 PM

2 points

0 comments1 min readLW link

Adam Newgas Aug 7, 2024, 11:30 AM
2 points
0
on: Organisation for Program Equilibrium reading group
Interested, but I can’t make the proposed dates this week.

London Rationalish Meetup

Adam NewgasJul 8, 2024, 6:30 PM

6 points

0 comments1 min readLW link

London Rationalish Meetup

Adam NewgasApr 28, 2024, 9:40 AM

1 point

0 comments1 min readLW link

London Rationalish Meetup

Adam NewgasJul 30, 2023, 9:33 PM

3 points

0 comments1 min readLW link

London Rationalish meetup

Adam NewgasJun 30, 2023, 5:40 PM

3 points

0 comments1 min readLW link