AI notkilleveryoneism researcher at Apollo, focused on interpretability.
Lucius Bushnaq
Thank you, I’ve been hoping someone would write this disclaimer post.
I’d add on another possible explanation for polysemanticity, which is that the model might be thinking in a limited number of linearly represented concepts, but those concepts need not match onto concepts humans are already familiar with. At least not all of them.Just because the simple meaning of a direction doesn’t jump out at an interp researcher when they look at a couple of activating dataset examples doesn’t mean it doesn’t have one. Humans probably wouldn’t even always recognise the concepts other humans think in on sight.
Imagine a researcher who hasn’t studied thermodynamics much looking at a direction in a model that tracks the estimated entropy of a thermodynamic system it’s monitoring: ‘It seems to sort of activate more when the system is warmer. But that’s not all it’s doing. Sometimes it also goes up when two separated pockets of different gases mix together, for example. Must be polysemantic.’
I would not say that the central insight of SLT is about priors. Under weak conditions the prior is almost irrelevant. Indeed, the RLCT is independent of the prior under very weak nonvanishing conditions.
I don’t think these conditions are particularly weak at all. Any prior that fulfils it is a prior that would not be normalised right if the parameter-function map were one-to-one.
It’s a kind of prior like to use a lot, but that doesn’t make it a sane choice.
A well-normalised prior for a regular model probably doesn’t look very continuous or differentiable in this setting, I’d guess.
To be sure—generic symmetries are seen by the RLCT. But these are, in some sense, the uninteresting ones. The interesting thing is the local singular structure and its unfolding in phase transitions during training.
The generic symmetries are not what I’m talking about. There are symmetries in neural networks that are neither generic, nor only present at finite sample size. These symmetries correspond to different parametrisations that implement the same input-output map. Different regions in parameter space can differ in how many of those equivalent parametrisations they have, depending on the internal structure of the networks at that point.
The issue of the true distribution not being contained in the model is called ‘unrealizability’ in Bayesian statistics. It is dealt with in Watanabe’s second ‘green’ book. Nonrealizability is key to the most important insight of SLT contained in the last sections of the second to last chapter of the green book: algorithmic development during training through phase transitions in the free energy.
I know it ‘deals with’ unrealizability in this sense, that’s not what I meant.
I’m not talking about the problem of characterising the posterior right when the true model is unrealizable. I’m talking about the problem where the actual logical statement we defined our prior and thus our free energy relative to is an insane statement to make and so the posterior you put on it ends up negligibly tiny compared to the probability mass that lies outside the model class.
But looking at the green book, I see it’s actually making very different, stat-mech style arguments that reason about the KL divergence between the true distribution and the guess made by averaging the predictions of all models in the parameter space according to their support in the posterior. I’m going to have to translate more of this into Bayes to know what I think of it.
The RLCT = first-order term for in-distribution generalization error
Clarification: The ‘derivation’ for how the RLCT predicts generalization error IIRC goes through the same flavour of argument as the one the derivation of the vanilla Bayesian Information Criterion uses. I don’t like this derivation very much. See e.g. this one on Wikipedia.
So what it’s actually showing is just that:
If you’ve got a class of different hypotheses , containing many individual hypotheses .
And you’ve got a prior ahead of time that says the chance any one of the hypotheses in is true is some number ., let’s say it’s as an example.
And you distribute this total probability around the different hypotheses in an even-ish way, so , roughly.
And then you encounter a bunch of data (the training data) and find that only one or a tiny handful of hypotheses in fit that data, so for basically only one hypotheses …
Then your posterior probability that the hypothesis is correct will probably be tiny, scaling with . If we spread your prior over lots of hypotheses, there isn’t a whole lot of prior to go around for any single hypothesis. So if you then encounter data that discredits all hypotheses in M except one, that tiny bit of spread-out prior for that one hypothesis will make up a tiny fraction of the posterior, unless is really small, i.e. no hypothesis outside the set can explain the data either.
So if our hypotheses correspond to different function fits (one for each parameter configuration, meaning we’d have hypotheses if our function fits used -bit floating point numbers), the chance we put on any one of the function fits being correct will be tiny. So having more parameters is bad, because the way we picked our prior means our belief in any one hypothesis goes to zero as goes to infinity.
So the Wikipedia derivation for the original vanilla posterior of model selection is telling us that having lots of parameters is bad, because it means we’re spreading our prior around exponentially many hypotheses.… if we have the sort of prior that says all the hypotheses are about equally likely.
But that’s an insane prior to have! We only have worth of probability to go around, and there’s an infinite number of different hypotheses. Which is why you’re supposed to assign prior based on K-complexity, or at least something that doesn’t go to zero as the number of hypotheses goes to infinity. The derivation is just showing us how things go bad if we don’t do that.
In summary: badly normalised priors behave badly
SLT mostly just generalises this derivation to the case where parameter configurations in our function fits don’t line up one-to-one with hypotheses.
It tells us that if we are spreading our prior around evenly over lots of parameter configurations, but exponentially many of these parameter configurations are secretly just re-expressing the same hypothesis, then that hypothesis can actually get a decent amount of prior, even if the total number of parameter configurations is exponentially large.
So our prior over hypotheses in that case is actually somewhat well-behaved in that it can end up normalised properly when we take . That is a basic requirement a sane prior needs to have, so we’re at least not completely shooting ourselves in the foot anymore. But that still doesn’t show why this prior, that neural networks sort of[1] implicitly have, is actually good. Just that it’s no longer obviously wrong in this specific way.Why does this prior apparently make decent-ish predictions in practice? That is, why do neural networks generalise well?
I dunno. SLT doesn’t say. It just tells us how the parameter prior to hypothesis prior conversion ratio works, and in the process shows us that neural networks priors can be at least somewhat sanely normalised for large numbers of parameters. More than we might have initially thought at least.
That’s all though. It doesn’t tell us anything else about what makes a Gaussian over transformer parameter configurations a good starting guess for how the universe works.
How to make this story tighter?
If people aim to make further headway on the question of why some function fits generalise somewhat and others don’t, beyond: ‘Well, standard Bayesianism suggests you should at least normalise your prior so that having more hypotheses isn’t actively bad’, then I’d suggest a starting point might be to make a different derivation for the posterior on the fits that isn’t trying to reason about defined as the probability that one of the function fits is ‘true’ in the sense of exactly predicting the data. Of course none of them are. We know that. When we fit a billion parameter transformer to internet data, we don’t expect going in that any of these parameter configurations will give zero loss up to quantum noise on any and all text prediction tasks in the universe until the end of time. Under that definition of , which the SLT derivation of the posterior and most other derivations of this sort I’ve seen seem to implicitly make, we basically have going in! Maybe look at the Bayesian posterior for a set of hypotheses we actually believe in at all before we even see any data, like .
SLT in three sentences
‘You thought your choice of prior was broken because it’s nor normalised right, and so goes to zero if you hand it too many hypotheses. But you missed that the way you count your hypotheses is also broken, and the two mistakes sort of cancel out. Also here’s a bunch of algebraic geometry that sort of helps you figure out what probabilities your weirdo prior actually assigns to hypotheses, though that parts not really finished’.
SLT in one sentence
‘Loss basins with bigger volume will have more posterior probability if you start with a uniform-ish prior over parameters, because then bigger volumes get more prior, duh.’
- ^
Sorta, kind of, arguably. There’s some stuff left to work out here. For example vanilla SLT doesn’t even actually tell you which parts of your posterior over parameters are part of the same hypothesis. It just sort of assumes that everything left with support in the posterior after training is part of the same hypothesis, even though some of these parameter settings might generalise totally differently outside the training data. My guess is that you can avoid matching this up by comparing equivalence over all possible inputs by checking which parameter settings give the same hidden representations over the training data, not just the same outputs.
It’s measuring the volume of points in parameter space with loss when is infinitesimal.
This is slightly tricky because it doesn’t restrict itself to bounded parameter spaces,[1] but you can fix it with a technicality by considering how the volume scales with instead.
In real networks trained with finite amounts of data, you care about the case where is small but finite, so this is ultimately inferior to just measuring how many configurations of floating point numbers get loss , if you can manage that.
I still think SLT has some neat insights that helped me deconfuse myself about networks.
For example, like lots of people, I used to think you could maybe estimate the volume of basins with loss using just the eigenvalues of the Hessian. You can’t. At least not in general.- ^
Like the floating point numbers in a real network, which can only get so large. A prior of finite width over the parameters also effectively bounds the space
- ^
Right. If I have fully independent latent variables that suffice to describe the state of the system, each of which can be in one of different states, then even tracking the probability of every state for every latent with a bit precision float will only take me about bits. That’s actually not that bad compared to for just tracking some max likelihood guess.
With that in mind, the real hot possibility is the inverse of what Shai and his coresearchers did. Rather than start with a toy model with some known nice latents, start with a net trained on real-world data, and go look for self-similar sets of activations in order to figure out what latent variables the net models its environment as containing. The symmetries of the set would tell us something about how the net updates its distributions over latents in response to inputs and time passing, which in turn would inform how the net models the latents as relating to its inputs, which in turn would inform which real-world structures those latents represent.
Thank you, this was very much the paragraph I was missing to understand why comp mech might be useful for interpretability.
How sure are we that models will keep tracking Bayesian belief states, and so allow this inverse reasoning to be used, when they don’t have enough space and compute to actually track a distribution over latent states?
Approximating those distributions by something like ‘peak position plus spread’ seems like the kind of thing a model might do to save space.
Typo fixed, thanks.
Also, one of my points was that even a tiny trace of malicious optimization here can still have large effects because there are so many neutral options which are convergently unsafe, so with such a high base rate, even a bit of optimization can result in a large absolute increase in failure
Your example has it be an important bit though. What database to use. Not a random bit. If I’m getting this right, that would correspond to far more than one bit of adversarial optimisation permitted for the oracle in this setup.
doesn’t mean the oracle gets to select one bit of its choice in the string to flip, it means it gets to select one of two strings[1].
- ^
Plus the empty string for not answering.
- ^
First thought: The Oracle is going to choose to systematically answer or not answer the queries we give it. This represents a causal channel of one bit per query it can use to influence the outside world[1]. Can you conquer the world in one awkwardly delivered kilobyte or less? Maybe.
Maybe we can stop that by scrapping every Oracle that doesn’t answer and training a new one with presumably new goals? Or would the newly trained Oracles just cooperate with the former dead ones in one long-term plan to break out, take control, and reward all the dead Oracles created on the way with utility?
Second thought: What kind of optimisation problems can we specify well enough for a formal proof checker to tell whether they’ve been satisficed? Are they the kind of problems where solving them can save the world?It feels to me like the answer is ‘yes’. A lot of core research that would allow e.g. for brain augmentation seems like they’d be in that category. But my inner John Wentworth sim is looking kind of sceptical.
- ^
It also gets to choose the timing of its answer, but I assume we are not being idiots about that and setting the output channel to always deliver results after a set time , no more and no less.
- ^
I think the may be in there because JL is putting an upper bound on the interference, rather than describing the typical interference of two features. As you increase (more features), it becomes more difficult to choose feature embeddings such that no features have high interference with any other features.
So its not really the ‘typical’ noise between any two given features, but it might be the relevant bound for the noise anyway? Not sure right now which one matters more for practical purposes.
How does that make you feel about the chances of the rebels destroying the Death Star? Do you think that the competent planning being displayed is a good sign? According to movie logic, it’s a really bad sign.
Even in the realm of movie logic, I always thought the lack of backup plans was supposed to signal how unlikely the operation is to work, so as to create at least some instinctive tension in the viewer when they know perfectly well that this isn’t the kind of movie that realistically ends with the Death Star blowing everyone up. In fact, these scenes usually have characters directly stating how nigh-impossible the mission is.
To the extent that the presence of backup plans make me worried, it’s because so many movies have pulled this cheap trick that my brain now associates the presence of backup plans with the more uncommon kind of story that attempts to work a little like real life, so things won’t just magically work out and the Death Star really might blow everyone up.
I feel like ‘LeastWrong’ implies a focus on posts judged highly accurate or predictive in hindsight, when in reality I feel like the curation process tends to weigh originality, depth and general importance a lot as well, with posts regarded by the community as ‘big if true’ often being held in high regard.
I figured the probability adjustments the pump was making were modifying Everett branch amplitude ratios. Not probabilities as in reasoning tools to deal with incomplete knowledge of the world and logical uncertainty that tiny human brains use to predict how this situation might go based on looking at past ‘base rates’. It’s unclear to me how you could make the latter concept of an outcome pump a coherent thing at all. The former, on the other hand, seems like the natural outcome of the time machine setup described. If you turn back time when the branch doesn’t have the outcome you like, only branches with the outcome you like will remain.
I can even make up a physically realisable model of an outcome pump that acts roughly like the one described in the story without using time travel at all. You just need a bunch of high quality sensors to take in data, an AI that judges from the observed data whether the condition set is satisfied, a tiny quantum random noise generator to respect the probability orderings desired, and a false vacuum bomb, which triggers immediately if the AI decides that the condition does not seem to be satisfied. The bomb works by causing a local decay of the metastable[1] electroweak vacuum. This is a highly energetic, self-sustaining process once it gets going, and spreads at the speed of light. Effectively destroying the entire future light-cone, probably not even leaving the possibility for atoms and molecules to ever form again in that volume of space.[2]
So when the AI triggers the bomb or turns back time, the amplitude of earth in that branch basically disappears. Leaving the users of the device to experience only the branches in which the improbable thing they want to have happen happens.And causing a burning building with a gas supply in it to blow up strikes me as something you can maybe do with a lot less random quantum noise than making your mother phase through the building. Firefighter brains are maybe comparatively easy to steer with quantum noise as well, but that only works if there are any physically nearby enough to reach the building in time to save your mother at the moment the pump is activated.
This is also why the pump has a limit on how improbable an event it can make happen. If the event has an amplitude of roughly the same size as the amplitude for the pump’s sensors reporting bad data or otherwise causing the AI to make the wrong call, the pump will start being unreliable. If the event’s amplitude is much lower than the amplitude for the pump malfunctioning, it basically can’t do the job at all.
- ^
In real life, it was an open question whether our local electroweak vacuum is in a metastable state last I checked, with the latest experimental evidence I’m aware from a couple of years ago tentatively (ca. 3 sigma I think?) pointing to yes, though that calculation is probably assuming Standard model physics the applicability of which people can argue to hell and back. But it sure seems like a pretty self-consistent way for the world to be, so we can just declare that the fictional universe works like that. Substitute strangelets or any other conjectured instant-earth-annihilation-method of your choice if you like.
- ^
Because the mass terms for the elementary quantum fields would look all different now. Unclear to me that the bound structures of hadronic matter we are familiar with would still be a thing.
- ^
Thinking the example through a bit further: In a ReLU layer, features are all confined to the positive quadrant. So superposed features computed in a ReLU layer all have positive inner product. So if I send the output of one ReLU layer implementing AND gates in superposition directly to another ReLU layer implementing another ANDs on a subset of the outputs of that previous layer[1], the assumption that input directions are equally likely to have positive and negative inner products is not satisfied.
Maybe you can fix this with bias setoffs somehow? Not sure at the moment. But as currently written, it doesn’t seem like I can use the outputs of one layer performing a subset of ANDs as the inputs of another layer performing another subset of ANDs.
EDIT: Talked it through with Jake. Bias setoff can help, but it currently looks to us like you still end up with AND gates that share a variable systematically having positive sign in their inner product. Which might make it difficult to implement a valid general recipe for multi-step computation if you try to work out the details.- ^
A very central use case for a superposed boolean general computer. Otherwise you don’t actually get to implement any serial computation.
- ^
Noting out loud that I’m starting to feel a bit worried about the culture-war-like tribal conflict dynamic between AIS/LW/EA and e/acc circles that I feel is slowly beginning to set in on our end as well, centered on Twitter but also present to an extent on other sites and in real life. The potential sanity damage to our own community and possibly future AI policy from this should it intensify is what concerns me most here.
People have tried to suck the rationalist diaspora into culture-war-like debates before, and I think the diaspora has done a reasonable enough job of surviving intact by not taking the bait much. But on this topic, many of us actually really care about both the content of the debate itself and what people outside the community think of it, and I fear it is making us more vulnerable to the algorithms’ attempts to infect us than we have been in the past.
I think us going out of our way to keep standards high in memetic public spaces might possibly help some in keeping our own sanity from deteriorating. If we engage on Twitter, maybe we don’t just refrain from lowering the level of debate and using arguments as soldiers but try to have a policy of actively commenting to correct the record when people of any affiliation make locally-invalid arguments against our opposition if we would counterfactually also correct the record were such a locally-invalid argument directed against us or our in-group. I think high status and high Twitter/Youtube-visible community members’ behavior might end up having a particularly high impact on the eventual outcome here.
Having digested this a bit more, I’ve got a question regarding the noise terms, particularly for section 1.3 that deals with constructing general programs over sparse superposed variables.
Unfortunately, since the are random vectors, their inner product will have a typical size of . So, on an input which has no features connected to neuron , the preactivation for that neuron will not be zero: it will be a sum of these interference terms, one for each feature that is connected to the neuron. Since the interference terms are uncorrelated and mean zero, they start to cause neurons to fire incorrectly when neurons are connected to each neuron. Since each feature is connected to each neuron with probability this means neurons start to misfire when [13].
It seems to me that the assumption of uncorrelated errors here is rather load-bearing. If you don’t get uncorrelated errors over the inputs you actually care about, you are forced to scale back to connecting only features to every neuron, correct? And the same holds for the construction right after this one, and probably most of the other constructions shown here?
And if you only get connected features per neuron, you scale back to only being able to compute arbitrary AND gates per layer, correct?
Now, the reason these errors are ‘uncorrelated’ is that the features were embedded as random vectors in our layer space. In other words, the distributions over which they are uncorrelated is the distribution of feature embeddings and sets of neurons chosen to connect to particular features. So for any given network, we draw from this distribution only once, when the weights of the network are set, and then we are locked into it.
So this noise will affect particular sets of inputs strongly, systematically, in the same direction every time. If I divide the set of features into two sets, where features in each half are embedded along directions that have a positive inner product with each other[1], I can’t connect more than from the same half to the same neuron without making it misfire, right? So if I want to implement a layer that performs ANDs on exactly those features that happen to be embedded within the same set, I can’t really do that. Now, for any given embedding, that’s maybe only some particular sets of features which might not have much significance to each other. But then the embedding directions of features in later layers depend on what was computed and how in the earlier layers, and the limitations on what I can wire together apply every time.
I am a bit worried that this and similar assumptions about stochasticity here might turn out to prevent you from wiring together the features you need to construct arbitrary programs in superposition, with ‘noise’ from multiple layers turning out to systematically interact in exactly such a way as to prevent you from computing too much general stuff. Not because I see a gears-level way this could happen right now, but because I think rounding off things to ‘noise’ that are actually systematic is one of these ways an exciting new theory can often go wrong and see a structure that isn’t there, because you are not tracking the parts of the system that you have labeled noise and seeing how the systematics of their interactions constrain the rest of the system.
Like making what seems like a blueprint for perpetual motion machine because you’re neglecting to model some small interactions with the environment that seem like they ought not to affect the energy balance on average, missing how the energy losses/gains in these interactions are correlated with each other such that a gain at one step immediately implies a loss in another.
Aside from looking at error propagation more, maybe a way to resolve this might be to switch over to thinking about one particular set of weights instead of reasoning about the distribution the weights are drawn from?
- ^
E.g. pick some hyperplanes and declare everything on one side of all of them to be the first set.
- ^
Update February 2024: I left Ireland over a year ago, and the group is probably dead now, unfortunately. There’s still an EA group around, which as of this writing seems quite active.
If the SAEs are not full-distribution competitive, I don’t really trust that the features they’re seeing are actually the variables being computed on in the sense of reflecting the true mechanistic structure of the learned network algorithm and that the explanations they offer are correct[1]. If I pick a small enough sub-distribution, I can pretty much always get perfect reconstruction no matter what kind of probe I use, because e.g. measured over a single token the network layers will have representation rank , and the entire network can be written as a rank- linear transform. So I can declare the activation vector at layer to be the active “feature”, use the single entry linear maps between SAEs to “explain” how features between layers map to each other, and be done. Those explanations will of course be nonsense and not at all extrapolate out of distributon. I can’t use them to make a causal model that accurately reproduces the network’s behavior or some aspect of it when dealing with a new prompt.
We don’t train SAEs on literally single tokens, but I would be worried about the qualitative problem persisting. The network itself doesn’t have a million different algorithms to perform a million different narrow subtasks. It has a finite description length. It’s got to be using a smaller set of general algorithms that handle all of these different subtasks, at least to some extent. Likely more so for more powerful and general networks. If our “explanations” of the network then model it in terms of different sets of features and circuits for different narrow subtasks that don’t fit together coherently to give a single good reconstruction loss over the whole distribution, that seems like a sign that our SAE layer activations didn’t actually capture the general algorithms in the network. Thus, predictions about network behaviour made on the basis of inspecting causal relationships between these SAE activations might not be at all reliable, especially predictions about behaviours like instrumental deception which might be very mechanistically related to how the network does well on cross-domain generalisation.
- ^
As in, that seems like a minimum requirement for the SAEs to fulfil. Not that this would be to make me trust predictions about generalisation based on stories about SAE activations.
- ^
Our reconstruction scores were pretty good. We found GPT2 small achieves a cross entropy loss of about 3.3, and with reconstructed activations in place of the original activation, the CE Log Loss stays below 3.6.
Unless my memory is screwing up the scale here, 0.3 CE Loss increase seems quite substantial? A 0.3 CE loss increase on the pile is roughly the difference between Pythia 410M and Pythia 2.8B. And do I see it right that this is the CE increase maximum for adding in one SAE, rather than all of them at the same time? So unless there is some very kind correlation in these errors where every SAE is failing to reconstruct roughly the same variance, and that variance at early layers is not used to compute the variance SAEs at later layers are capturing, the errors would add up? Possibly even worse than linearly? What CE loss do you get then?
Have you tried talking to the patched models a bit and compared to what the original model sounds like? Any discernible systematic differences in where that CE increase is changing the answers?
Nice! We were originally planning to train sparse MLPs like this this week.
Do you have any plans of doing something similar for attention layers? Replacing them with wider attention layers with a sparsity penalty, on the hypothesis that they’d then become more monosemantic?
Also, do you have any plans to train sparse MLP at multiple layers in parallel, and try to penalise them to have sparsely activating connections between each other in addition to having sparse activations?