Lucius Bushnaq

Karma: 4,266

AI notkilleveryoneism researcher, focused on interpretability.

Personal account, opinions are my own.

I have signed no contracts or agreements whose existence I cannot mention.

Lucius Bushnaq Jul 25, 2025, 11:30 PM
6 points
0
on: HPMOR: The (Probably) Untold Lore
And the answer was: “All right. There is a curse on the Defence Professor position. There has always been a curse on the Defence Professor position. The school has adapted to it. Harry has gotten into just the right kind of shenanigan to cause McGonagall to panic about this, and give Harry the instructions he needs to hear to prevent him from just taking certain matters to McGonagall.”
The question I always had here was “But what was Voldemort’s original plan for dealing with this issue when he decided to teach at Hogwarts?”
Because I don’t think he would have wanted to stake all his plans for the stone and Harry on McGonagall coincidentally saying this just in time, and Harry coincidentally being in a state where he obeys her instruction and never rethinks that decision. And Voldemort would have definitely known about the resonance problem before coming to Hogwarts. Even if he thought it would be somehow gone after ten years, he would have realised after the encounter with Harry in Diagon Alley at the very latest that that wasn’t true. So what was his original plan for making sure Harry wouldn’t talk about the resonance to anyone important? Between the vow and the resonance itself, his means of reliably controlling Harry’s actions are really very sharply limited.
Every plan I’ve managed to come up with either doesn’t fit with Voldemort’s actual actions in the story, or doesn’t seem nearly reliable enough for my mental model of Voldemort to be satisfied with the whole crazy “Let’s just walk into Hogwarts, become a teacher, and hang out there for maybe a year” idea.
Eliezer: Right. But there’s more! This model also explains why, when Harry faces the Dementor and is lost in his dark side, and Hermione brings him out of it with a kiss,[18] Harry’s dark side has nothing to say about that kiss, it’s at a loss. Meanwhile, the main part of Harry has a thought process activated.
I picked up on this, though my main guess was that Tom Riddle had just always been aromantic and asexual. I didn’t think any dark rituals were involved.

Lucius Bushnaq Jul 17, 2025, 11:05 AM
13 points
3
in reply to: TsviBT’s comment on: Do confident short timelines make sense?
I do not think that Noosphere’s comment did not contain an argument. The rest of the comment after the passage you cited tries to lay out a model for why continual learning and long-term memory might be the only remaining bottlenecks. Perhaps you think that this argument is very bad, but it is an argument, and I did not think that your reply to it was helpful for the discussion.

Lucius Bushnaq Jul 1, 2025, 5:31 AM
12 points
9
in reply to: habryka’s comment on: Don’t Eat Honey
My guess is this is obvious, but IMO it seems extremely unlikely to me that bee-experience is remotely as important to care about as cow experience.
I agree with this, but would strike the ‘extremely’. I don’t actually have gears level models for how some algorithms produce qualia. ‘Something something, self modelling systems, strange loops’ is not a gears level model. I mostly don’t think a million neuron bee brain would be doing qualia, but I wouldn’t say I’m extremely confident.
Consequently, I don’t think people who say bees are likely to be conscious are so incredibly obviously making a mistake that we have to go looking for some signalling explanation for them producing those words.

Lucius Bushnaq Jun 30, 2025, 1:12 PM
2 points
0
in reply to: Alex Gibson’s comment on: Alex Gibson’s Shortform
But there’s no reason to think that the model is actually using a sparse set of components /features on any given forward pass.
I contest this. If a model wants to implement more computations (for example, logic gates) in a layer than that layer has neurons, the known methods for doing this rely on few computations being used (that is, receiving a non-baseline input) on any given forward pass.

Circuits in Superposition 2: Now with Less Wrong Math

Linda Linsefors and Lucius Bushnaq

Jun 30, 2025, 10:25 AM

69 points

0 comments20 min readLW link

Lucius Bushnaq Jun 30, 2025, 7:03 AM
4 points
0
in reply to: Alex Gibson’s comment on: [Paper] Stochastic Parameter Decomposition
I’d have to think about the exact setup here to make sure there’s no weird caveats, but my first thought is that for $W_{in}$ , this ought to be one component per bigram, firing exclusively for that bigram.

Lucius Bushnaq Jun 30, 2025, 6:59 AM
3 points
0
in reply to: Alex Gibson’s comment on: [Paper] Stochastic Parameter Decomposition
An intuition pump: Imagine the case of two scalar features $c_{1}, c_{2}$ being embedded along vectors $f_{1}, f_{2}$ . If you consider a series that starts with $f_{1}, f_{2}$ being orthogonal, then gives them ever higher cosine similarity, I’d expect the network to have ever more trouble learning to read out $c_{1}$ , $c_{2}$ , until we hit $f_{1} = f_{2}$ , at which point the network definitely cannot learn to read the features out at all. I don’t know how the learning difficulty behaves over this series exactly, but it sure seems to me like it ought to go up monotonically at least.
Another intuition pump: The higher the cosine similarity between the features, the larger the norm of the rows of $V^{- 1}$ will be, with norm infinity in the limit of cosine similarity going to one.

I agree that at cosine similarity $O (\frac{1}{\sqrt{1000}})$ , it’s very unlikely to be a big deal yet.

Lucius Bushnaq Jun 30, 2025, 6:17 AM
2 points
0
in reply to: Alex Gibson’s comment on: [Paper] Stochastic Parameter Decomposition
Sure, yes, that’s right. But I still wouldn’t take this to be equivalent to our $v_{i}$ literally being orthogonal, because the trained network itself might not perfectly learn this transformation.

Lucius Bushnaq Jun 30, 2025, 6:13 AM
2 points
0
in reply to: J Bostock’s comment on: [Paper] Stochastic Parameter Decomposition
What do you mean by “a global linear transformation” as in what kinds of linear transformations are there other than this? If we have an MLP consisting of multiple computations going on in superposition (your sense) I would hope that the W_in would be decomposed into co-activating subcomponents corresponding to features being read into computations, and the W_out would also be decomposed into co-activating subcomponents corresponding to the outputs of those computations being read back into the residual stream. The fact that this doesn’t happen tells me something is wrong.
Linear transformations that are the sum of weights for different circuits in superposition, for example.
What I am trying to say is that I expect networks to implement computation in superposition by linearly adding many different subcomponents to create W_in, but I mostly do not expect networks to create W_out by linearly adding many different subcomponents that each read-out a particular circuit output back into the residual stream, because that’s actually an incredibly noisy operation. I made this mistake at first as well. This post still has a faulty construction for W_out because of my error. Linda Linsefors finally corrected me on this a couple months ago.
As to the issue with the maximum number of components: it seems to me like if you have five sparse features (in something like the SAE sense) in superposition and you apply a rotation (or reflection, or identity transformation) then the important information would be contained in a set of five rank 1 transformations, basically a set of maps from A to B. This doesn’t happen for the identity, does it happen for a rotation or reflection?
I disagree that if all we’re doing is applying a linear transformation to the entire space of superposed features, rather than, say, performing different computations on the five different features, that it would be desirable to split this linear transformation into the five features.
Finally, as to “introducing noise” by doing things other than a global linear transformation, where have you seen evidence for this? On synthetic (and thus clean) datasets, or actually in real datasets? In real scenarios, your model will (I strongly believe) be set up such that the “noise” between interfering features is actually helpful for model performance, since the world has lots of structure which can be captured in the particular permutation in which you embed your overcomplete feature set into a lower dimensional space.
Uh, I think this would be a longer discussion than I feel up for at the moment, but I disagree with your prediction. I agree that the representational geometry in the model will be important and that it will be set up to help the model, but interference of circuits in superposition cannot be arranged to be helpful in full generality. If it were, I would take that as pretty strong evidence that whatever is going on in the model is not well-described by the framework of superposition at all.

Lucius Bushnaq Jun 29, 2025, 8:57 PM
2 points
0
in reply to: Alex Gibson’s comment on: [Paper] Stochastic Parameter Decomposition
If you have 100 orthogonal linear probes to read with, yes. But since there’s only 50 neurons, the actual circuits for different input features in the network will have interference to deal with.

Lucius Bushnaq Jun 29, 2025, 8:23 PM
2 points
0
in reply to: J Bostock’s comment on: [Paper] Stochastic Parameter Decomposition
My understanding is that SPD cannot decompose an $n \times m$ matrix into more than $max (n, m)$ subcomponents, and if all subcomponents are “live” i.e. active on a decent fraction of the inputs, then it will have to have $max (n, m)$ components to work
SPD can decompose an $n \times m$ matrix into more than $max (n, m)$ subcomponents.
I guess there aren’t any toy models in this paper that directly showcase this, but I’m pretty confident it’s true, because
1. I don’t see why it wouldn’t be able to.
2. I’ve decomposed a weight matrix in a tiny LLM and got out way more than $max (n, m)$ live subcomponents. That’s a very preliminary result though, you probably shouldn’t put that much stock in it.
Edit: as you pointed out, this might only apply when there’s not a nonlinearity after the weight. But every $W_{o u t}$ in a transformer has a connection running from it directly to the output logits through $W_{u n e m b e d}$ . So SPD will struggle to interpret any of the output weights of transformer MLPs. This seems bad.
I think it’s the other way around. If you try to implement computation in superposition in a network with a residual stream, you will find that about the best thing you can do with the $W_{out}$ is often to just use it as a global linear transformation. Most other things you might try to do with it drastically increases noise for not much pay-off. In the cases where networks are doing that, I would want SPD to show us this global linear transform.
But $W_{i n}$ is reading those vectors off a 1000-dimensional vector space where there’s no interference between features.
They’re embedded randomly in the space, so there is interference between them in the sense of them having non-zero inner products.
Thanks to CCi(p)nCiS, we know that the toy model is not even doing computation in superposition, which is the case which SPD seems to be based on. It’s actually doing something really weird with the “noise”, which doesn’t actually behave well.
Yes. I agree that this makes the model not as great a testbed as we originally hoped.

Lucius Bushnaq Jun 29, 2025, 4:35 PM
4 points
0
in reply to: J Bostock’s comment on: [Paper] Stochastic Parameter Decomposition
Since you’re working on just one weight at a time, linear transformations are the only case to consider. So all you can do is either find an exact basis or find the whole linear transformation.
No, that’s not how it works
1. Networks have non-linearities. SPD will decompose you a matrix into a single linear transformation if what the network is doing with that matrix really is just applying one global linear transformation. If e.g. there are non-linearities right after the matrix that aren’t just always switched on, SPD will usually decompose the matrix into many sub-components.^[1]
2. I’m not sure what you mean by ‘working on just one weight at a time’. The stochastic-layerwise reconstruction loss does do forward passes replacing only one matrix in the network at a time with a randomly ablated version of the same matrix. But the stochastic reconstruction loss does forward passes replacing all matrices at once.
  But I think I must be misunderstanding what you mean here, because even if we didn’t have the stochastic reconstruction loss I don’t see how that would matter for this.
In fact, if we have superposition, I would expect the relevant components of the model to sum to more than the weights of the model. This is kind of just what superposition means, the same weights are being used for multiple computations at once.
That’s not how it works in our existing framework for circuits in superposition. The weights for particular circuits there actually literally do sum to the weights of the whole network. I’ve been unable to come up with a general framework that doesn’t exhibit this weight linearity.
This is kind of just what superposition means, the same weights are being used for multiple computations at once.
I wouldn’t say that? Computation in superposition inevitably involves different circuits interfering with each other, because the weights of one circuit have non-zero inner product with the activations of another. But there is still a particular set of vectors in parameter space such that each vector implements one circuit.

Superposition can give you an overcomplete basis of variables in activation space, but it cannot give you an overcomplete basis of circuits acting on these variables in parameter space. There can’t be more circuits than weights.
1. ^
  Well, depending on what the network is actually computing with these non-linearities, of course. If it’s not computing many different things, or not using the results of many of the computations for anything downstream, SPD probably won’t find many components that ever activate.

[Paper] Stochastic Parameter Decomposition

Lee Sharkey, Lucius Bushnaq and Dan Braun

Jun 27, 2025, 4:54 PM

47 points

15 comments1 min readLW link

(arxiv.org)

Lucius Bushnaq Jun 27, 2025, 9:46 AM
4 points
−4
in reply to: StefanHex’s comment on: StefanHex’s Shortform
@Eliezer Yudkowsky If Large Language Models were confirmed to implement computation in superposition [1,2,3], rather than just representation in superposition, would you resolve this market as yes?

Representation in superposition would not have been a novel idea to computer scientists in 2006. Johnson-Lindenstrauss is old. But there’s nothing I can think of from back then that’d let you do computation in superposition, linearly embedding a large number of algorithms efficiently on top each other in the same global vector space so they can all be pretty efficiently executed in parallel, without wasting a ton of storage and FLOP, so long as only a few algorithms do anything at any given moment.
To me at least, that does seem like a new piece of the puzzle for how minds can be set up to easily learn lots of very different operations and transformations that all apply to representations living in the same global workspace.

Lucius Bushnaq Jun 25, 2025, 7:51 PM
5 points
3
on: Compressed Computation is (probably) not Computation in Superposition
Thank you for looking into this.
This investigation updated me more toward thinking that Computation in Superposition is unlikely to train in this kind of setup, because it’s mainly concerned with minimising worst-case noise. It does lots of things, but it does them all with low precision. A task where the model is scored on how close to correct it gets many continuously-valued labels, as scored by MSE loss, is not good for this.

I think we need a task where the labels are somehow more discrete, or the loss function punishes outlier errors more, or the computation has multiple steps, where later steps in the computation depend on lots of intermediary results computed to low precision.
What links here?
- Circuits in Superposition 2: Now with Less Wrong Math by Linda Linsefors (Jun 30, 2025, 10:25 AM; 69 points)

Lucius Bushnaq Jun 24, 2025, 7:08 AM
3 points
1
in reply to: habryka’s comment on: Habryka’s Shortform Feed
I am glad that you are proud of it and I feel kind of bad saying this, but the reason I had mixed feelings about the promotion is that I just really don’t like the design. I find it visually exhausting to look at. Until you added the option to disable the theme, I was just avoiding the LW front page. I don’t like the design of https://ifanyonebuildsit.com/ either.

Lucius Bushnaq Jun 5, 2025, 9:12 PM
2 points
0
in reply to: Cole Wyeth’s comment on: LLM in-context learning as (approximating) Solomonoff induction
I guess I wouldn’t expect UTM switching to be able to express any conditioning, that wouldn’t make sense since conditioning can exclude TMs and UTMs can all express any TM. But that doesn’t strike me as the sort of conditioning prior knowledge of the internet would impose?

Actually, now that I think about it, I guess it could be.

Lucius Bushnaq Jun 5, 2025, 8:02 PM
10 points
3
on: LLM in-context learning as (approximating) Solomonoff induction
I suspect language model in-context learning^[1] ‘approximates Solomonoff induction’ in the vague sense that it is a pattern matching thingy navigating a search space somewhat similar in character to the space of possible computer programs, consisting of inputs/parameters for some very universal, Turing-complete-ish computational architecture the lm expresses its guesses for patterns in, looking for a pattern that matches the data.
The way they navigate this search space is totally different from SI, which just checks every single point in its search space of UTM programs. But the geometry of the space is similar to the geometry of the space of UTM programs, with properties like simpler hypotheses corresponding to exponentially more points in the space.
So, even if the language models’ in-context learning algorithm was kind of maximally stupid, and literally just guessed random points in the search space until it found a good match to the data, we’d expect its outputs to somewhat match up with the universal distribution, just because they’re both ≈ uniformly random samples from a space of inputs to Turing-complete-ish computational architectures.
So, to the extent that these experimental results actually hold up^[2], I think the main thing they’d be telling us is that the ‘architecture’ or ‘language’ the lm expresses its in-context guesses in is highly expressive, with a computational universality similar to that of UTMs and many neural network architectures.
Arguably, the later may be a special case of the former with an appropriate choice of universal Turing machine (UTM), but I find this perspective to be a bit of a stretch. At the very least I expect LLM ICL to be similar to a universal distribution conditioned on some background information.
What’s even the difference between these propositions? Any UTM can be expressed in another UTM as a bit string of prior knowledge to condition on, and I’d intuitively expect the reverse to hold as well, though I don’t actually know that for sure.
1. ^
  and human thought too, I’d guess
2. ^
  I have not actually looked at your numerical results closely at all, sorry.

Lucius Bushnaq May 26, 2025, 5:40 AM
2 points
0
on: Superposition Without Compression: Why Entangled Representations Are the Default
You seem to be equating superposition and polysemanticity here, but they’re not the same thing.

Lucius Bushnaq May 23, 2025, 3:34 PM
LW: 14 AF: 7
2
AF
on: Reward button alignment
In other words, will the AGI actually want you to push the button? Or would it want some random weird thing because inner alignment is hard?
My answer is: yes, it would want you to push the button, at least if we’re talking about brain-like AGI, and if you set things up correctly.
Again, getting a brain-like AGI addicted to a reward button is a lot like getting a human or animal hooked on an addictive drug.
Humans addicted to drugs often exhibit weird meta-preferences like ‘I want to stop wanting the drug’, or ‘I want to find an even better kind of drug’.
For this reason, I am not at all confident that a smart thing exposed to the button would later generalise to coherent, super-smart thing that wants the button to be pressed. Maybe it perceived the circuits in it that bound to the button reward as foreign to the rest of its goals, and worked to remove them. Maybe the button binding generalised in a strange way.

‘Seek to directly inhabit the cognitive state caused by the button press’, ‘along an axis of cognitive states associated with button presses of various strength, seek to walk to a far end that does not actually correspond to any kind of button press ’, ‘make the world have a shape related to generalisations of ideas that tended to come up whenever the button was pressed’ and just generally ‘maximise a utility function made up of algorithmically simple combinations of button-related and pre-button-training-reward-related abstractions’ all seem like goals I could imagine a cognitively enhanced human button addict generalising toward. So I am not confident the AGI would generalise to wanting the button to be pushed either, not in the long term.

Lucius Bushnaq

Cir­cuits in Su­per­po­si­tion 2: Now with Less Wrong Math

[Paper] Stochas­tic Pa­ram­e­ter Decomposition

Circuits in Superposition 2: Now with Less Wrong Math

[Paper] Stochastic Parameter Decomposition