like at many public companies, google has anti-insider trading policies that prohibit employees from trading in options and other derivatives on the company stock, or shorting it.
lewis smith
yeah that makes sense I think
with later small networks taking the outputs of earlier small networks as their inputs.
what’s the distinction between two small networks connected in series with the first taking the output of the previous one as input and one big network? what defines the boundaries of the networks here?
I kind of agree that Dennett is right about this, but I think it’s important to notice that the idea he’s attacking—that all representation is explicit representation—is an old and popular one in philosophy of mind that was, at one point, seen as natural and inevitable by many people working in the field, and one which I think still seems somewhat natural and obvious to many people who maybe haven’t thought about the counterarguments much (e.g I think you can see echos of this view in a post like this one, or the idea that there will be some ‘intelligence algorithm’ which will be a relatively short python program). The idea that a thought is always or mostly something like a sentence in ‘mentalese’ is, I think, still an attractive one to many people of a logical sort of bent, as is the idea that formalised reasoning captures the ‘core’ of cognition.
I guess you are thinking about holes with the p-type semiconductor?
I don’t think I agree (perhaps obviously) that it’s better to think about the issues in the post in terms of physics analogies than in terms of the philosophy of mind and language. If you are thinking about how a mental representation represents some linguistic concept, then Dennett and Wittgenstein (and others!) are addressing the same problem as you! in a way that virtual particles are really not
I think atom would be a pretty good choice as well but I think the actual choice of terminology is less important than the making the distinction. I used latent here because that’s what we used in our paper
‘Feature’ is overloaded terminology
In the interpretability literature, it’s common to overload ‘feature’ to mean three separate things:
-
Some distinctive, relevant attribute of the input data. For example, being in English is a feature of this text.
-
A general activity pattern across units which a network uses to represent some feature in the first sense. So we might say that a network represents the feature (1) that the text is in English with a linear feature (2) in layer 32.
-
The elements of some process for discovering features, like an SAE. An SAE learns a dictionary of activation vectors, which we hope correspond to the features (2) that the network actually uses. It is common to simply refer to the elements of the SAE dictionary as ‘features’ as well. For example, we might say something like ’the network represents features of the input with linear features in its representation space, which are recovered well by feature 3242 in our SAE.
This seems bad; at best it’s a bit sloppy and confusing, at worst it’s begging the question about the interpretability or usefulness of SAE features. It seems important to carefully distinguish between them in case these don’t coincide. We think that it’s probably worth making a bit of an effort to carefully distinguish between all of these different concepts by giving them different names. A terminology that we prefer is to reserve ‘feature’ for the conceptual senses of the word (1 and 2) and use alternative terminology for case 3, like ‘SAE latent’ instead of ‘SAE feature’. So we might say, for example, that a model uses a linear representation for a Golden Gate Bridge feature, which is recovered well by an SAE latent. We have tried to follow this terminology in our recent Gemma Scope report. We think that this added precision is helpful in thinking about feature representations, and SAEs more clearly. To illustrate what we mean with an example, we might ask whether a network has a feature for numbers—i.e, whether it has any kind of localized representation of numbers at all. We can then ask what the format of this feature representation is ; for example, how many dimensions does it use, where is it located in the model, does it have any kind of geometric structure, etc. We could then separately ask whether it is discovered by a feature discovery algorithm; i.e is there a latent in a particular SAE that describes it well? We think it’s important to recognise that these are all distinct questions, and use a terminology that is able to distinguish between them.
We (the deepmind language model interpretability team) started using this terminology in the GemmaScope report, but we didn’t really justify the decision much there and I thought it was worth making the argument separately.
-
I’m not that confident about the statistical query dimension (I assume that’s what you mean by SQ dimension?) But I don’t think it’s applicable; SQ dimension is about the difficulty of a task (e.g binary parity), wheras explicit vs tacit representations are properties of an implementation, so it’s kind of apples to oranges.
To take the chess example again, one way to rank moves is to explicitly compute some kind of rule or heuristic from the board state, and another is to do some kind of parallel search, and yet another is to use a neural network or something similar. The first one is explicit, the second is (maybe?) more tacit, and the last is unclear. I think stronger variations of the LRH kind of assume that the neural network must be ‘secretly’ explicit, but I’m not really sure this is neccesary.
But I don’t think any of this is really affected by the SQ dimension because it’s the same task in all three cases (and we could possibly come up with examples which had identical performance?)
but maybe i’m not quite understanding what you mean
yeah, I think this paper is great!
i’m glad you liked it.
I definitely agree that the LRH and the interpretability of the linear features are seperate hypotheses; that was what I was trying to get at by having monosemanticity as a seperate assumption to the LRH. I think that these are logically independent; there could be some explicit representation such that everything corresponds to an interpretable feature, but that format is more complicated than linear (i.e monosemanticity is true but LRH is false) or, as you say, the network could in some sense be mostly manipulating features but these features could be very hard to understand (LRH true, monosemanticity false) or they could just both be the wrong frame. I definitely think it would be good if we spent a bit more effort in clarifying these distinctions; I hope this essay made some progress in that direction but I don’t think it’s the last word on the subject.
I agree coming up with experiments which would test the LRH in isolation is difficult. But maybe this should be more of a research priority; we ought to be able to formulate a version of the strong LRH which makes strong empirical predictions. I think something along the lines of https://arxiv.org/abs/2403.19647 is maybe going in the write direction here. In a shameless self-plug, I hope that LMI’s recent work on open sourcing a massive SAE suite (Gemma Scope) will let people test out this sort of thing.
Having said that, one reason I’m a bit pessimistic is that stronger versions of the LRH do seem to predict there is some set of ‘ground truth’ features that a wide-enough or well tuned enough SAE ought to converge to (perhaps there should be some ‘phase change’ in the scaling graphs as you sweep the hyperparameters), but AFAIK we have been unable to find any evidence for this even in toy models.
I don’t want to overstate this point though; I think part of the reason for the excitement around SAEs is that this was genuinely quite great science ; the Toy Models paper proposed some theoretical reasons to expect linear representations in superposition, which implied that something like SAEs should recover interesting representations, and then was quite successful! (This is why I say in the post I think there’s a reasonable amount of evidence for at least the weak LRH).
I’m not entirely sure I follow here; I am thinking of compositionally as a feature of the format of a representation (Chris Olah has a good note on this here https://transformer-circuits.pub/2023/superposition-composition/index.html). I think whether we should expect one kind of representation or another is an interesting question, but ultimately an empirical one: there are some theoretical arguments for linear representations (basically that they should be easy for NNs to make decisions based on them) but the biggest reason to believe in them is just that people genuinely have found lots of examples of linear mediators that seem quite robust (e.g Golden Gate claude, neels stuff on refusal directions)
we have now done this https://github.com/google-deepmind/mishax
Maybe this is on us for not including enough detail in the post, but I’m pretty confident that you would lose your bet no matter how you operationalised it. We did compare ITO to using the encoder to pick features (using the top k) then optimising the weights on those feature at inference time, and to learning a post hoc scale and to address the ‘shrinkage’ problem where the encoder systematically underweights features, and gradient pursuit consistently outperformed both of them, so I think that gradient pursuit doesn’t just fiddle round with low weight, it also chooses features ‘better’.
With respect to your threshold thing; the structure of the specific algorithm we used (gradient pursuit) means that if GP has selected a feature, it tends to assign it quite a high weight, so I don’t think that would do much; SAE encoders tend to have much more features close to zero, because it’s structurally hard for them to avoid doing this. I would almost turn around your argument; i think that low-activating features in a normal SAE are likely to not be particularly interesting or interpretable either, as the structure of an SAE makes it difficult for them to avoid having features that have interference activate spuriously.
One quirk of gradient pursuit that is a bit weird is that it will almost always choose a new feature which is orthogonal to the span of features selected so far, which does seem a little artificial.
Whether the way that it chooses features better is actually better for interpretability is difficult to say. As we say in the post, we did manually inspect some examples and we couldn’t spot any obvious problems with the ITO decomposition, but we haven’t done a properly systematic double blind comparison of ITO to encoder ‘explanations’ in terms of interpretability because it’s quite expensive for us in terms of time.
I think that it’s too early to say whether ITO is ‘really’ helping or not, but I am pretty confident it’s worth more exploration, which is why we are spreading the word about this specific algorithm in this snippet (even though we didn’t invent it). I think training models using GP at train time, getting rid of the SAE framework altogether, is also worth exploring to be honest. But at the moment it’s still quite hard to give sparse decompositions an ‘interpretability score’ which is objective and not too expensive to make, so it’s a bit difficult to see how we would evaluate something like this. (I think auto-interp could be a reasonable way of screening ideas like this once we are running it more easily)
I think there is a fairly reasonable theoretical argument that non-SAE decompositions won’t work well for superposition (because the NN can’t actually be using an iterative algorithm to read features) but I do think that I haven’t really seen any empirical evidence that this is either true or false to be honest, and I don’t think we should rule out that non-SAE methods would just work loads better; they do work much better for almost every other sparse optimisation algorithm afaik.
Yeah I agree with everything you say; it’s just I was trying to remind myself of enough of SLT to give a a ‘five minute pitch’ for SLT to other people, and I didn’t like the idea that I’m hanging it of the ReLU.
I guess the intuition behind the hierarchical nature of the models leading to singularities is the permutation symmetry between the hidden channels, which is kind of an easy thing to understand.
I get and agree with your point about approximate equivalences, though I have to say that I think we should be careful! One reason I’m interested in SLT is I spent a lot of time during my PhD on Bayesian approximations to NN posteriors. I think SLT is one reasonable explanation of why this. never yielded great results, but I think hand-wavy intuitions about ‘oh well the posterior is probably-sorta-gaussian’ played a big role in it’s longevity as an idea.
yeah it’s not totally clear what this ‘nearly singular’ thing would mean? Intuitively, it might be that there’s a kind of ‘hidden singularity’ in the space of this model that might affect the behaviour, like the singularity in a dynamic model with a phase transition. but im just guessing
I’m trying to read through this more carefully this time: how load-bearing is the use of ReLU nonlinearities in the proof? This doesn’t intuitively seem like it should be that important (e.g a sigmoid/gelu/tanh network feels like it is probably singular, and it certainly has to be if SLT is going to tell us something important about NN behaviour because changing the nonlinearity doesn’t change how NNs behave that much imo), but it does seem to be an important part of the construction you use.
maybe this is really naive (I just randomly thought of it), and you mention you do some obvious stuff like looking at the singular vectors of activations which might rule it out, but could the low-frequency cluster be linked something simple like the fact that the use of ReLUs, GeLUs etc. means the neuron activations are going to be biased towards the positive quadrant of the activation space in terms of magnitude (because negative components of any vector in the activation basis would be cut off). I wonder if the singular vectors would catch this.
I think that it’s important to be careful with elaborately modelled reasoning about this kind of thing, because the second order political effects are very hard to predict but also likely to be extremely important, possibly even more important than the direct effect on timelines in some scenarios. For instance, you mention leading labs slowing down as bad (because the leading labs are ‘safety conscious’ and slowing down dilutes their lead). In my opinion, this is a very simplistic model of the likely effects of this intervention. There are a few reasons for this:
-
Taking drastic unilateral action creates new political possibilities. A good example is Hinton and Bengio ‘defecting’ to advocating strongly for AI safety in public; I think this has had a huge effect on ML researchers and governments in taking things seriously, even though the direct effect on AI research is probably neglible. For instance, Hinton in particular made me personally take a much more serious look at AI safety related arguments, and this has influenced me trying to re-orient my career in a more safety-focused direction. I find it implausible that a leading AI lab shutting themselves down for safety reasons would have no second order political effects along these lines, even if the direct impact was small: if there’s one lesson I would draw from covid and the last year or so of AI discourse, it’s that the overton window is much more mobile than people often think. A dramatic intervention like this would obviously have uncertain outcomes, but could trigger unforeseen possibilities. Unilateral action that disadvantages the actor also makes a political message much more powerful. There’s a lot of skepticism when labs like Anthropic talk loudly about AI risk because of the objection ‘if it’s so bad why are you making it’. While there are technical arguments one can make that there are good reasons to simultaneously work on safety and ai development, it makes communicating this message much harder and people will understandably have doubts about your motives.
-
‘we can’t slow down because someone else will do it anyway’ - I actually this is probably wrong: in a counterfactual world where OpenAI didn’t throw lots of resources and effort into language models, I’m not actually sure someone else would have bothered to continue scaling them, at least not for many years. Research is not a linear process and a field being unfashionable can delay progress by a considerable amount; just look at the history of neural network research! I remember many people in academia being extremely skeptical of scaling laws around the time they were being published; if OpenAI hadn’t pushed on it it could have taken years to decades for another lab to really throw enough resources at that hypothesis if it had become unfashionable for whatever reason.
-
I’m not sure it’s always true that other labs catch up if the leading ones stop: progress also isn’t a simple function of time; without people trying to scale massive GPU clusters you don’t get practical experience with the kind of problems such systems have, production lines don’t re-orient themselves towards the needs of such systems, etc. etc. There are important feedback loops in this kind of process that the big labs shutting down could disrupt, such as attracting more talent and enthusiasm into the field. It’s also not true that all ML research is a monolithic line towards ‘more AGI’ - from my experience of academia, many researchers would have quite happily worked on small specialised systems in a variety of domains for the rest of time.
I think many of these arguments also apply to arguments against ‘US moratorium now’ - for instance, it’s much easier to get other countries to listen to you if you take unilateral actions, as doing so is a costly signal that you are serious.
this isn’t neccesarily to say that I think a US moratorium or a leading lab shutting down would actually be a useful thing, just that I don’t think it’s cut and dry that it wouldn’t. Consider what would happen if a leading lab actually did shut themselves down—would there really be no political consequences that would have a serious effect on the development of AI? I think that your argument makes a lot of sense if we are considering ‘spherical AI labs in a vacuum’, but I’m not sure that’s how it plays out in reality.
-
Any post along the lines of yours needs a ‘political compass’ diagram lol.
I mean it’s hard to say what Altman would think in your hypothetical debate: assuming he has reasonable freedom of action at OpenAI his revealed preference seems to be to devote ⇐ 20% of the resources available to his org to ‘the alignment problem’. If he wanted to assign more resources into ‘solving alignment’ he could probably do so. I think Altman thinks he’s basically doing the right thing in terms of risk levels. Maybe that’s a naive analysis, but I think it’s probably reasonable to take him more or less at face value.
I also think that it’s worth saying that easily the most confusing argument for the general public is exactly the Anthropic/OpenAI argument that ‘AI is really risky but also we should build it really fast’. I think you can steelman this argument more than I’ve done here, and many smart people do, but there’s no denying it sounds pretty weird, and I think it’s why many people struggle to take it at face value when people like Altman talk about x-risk—it just sounds really insane!
In constrast, while people often think it’s really difficult and technical, I think yudkowsky’s basic argument (building stuff smarter than you seems dangerous) is pretty easy for normal people to get, and many people agree with general ‘big tech bad’ takes that the ‘realists’ like to make.
I think a lot of boosters who are skeptical of AI risk basically think ‘AI risk is a load of horseshit’ for various not always very consistent reasons. It’s hard to overstate how much ‘don’t anthropomorphise’ and ‘thinking about AGI is distracting sillyness by people who just want to sit around and talk all day’ are frequently baked deep into the souls of ML veterans like LeCun. But I think people who would argue no to your proposed alignment debate would, for example, probably strongly disagree that ‘the alignment problem’ is like a coherent thing to be solved.
Maybe I shouldn’t have used EY as an example, I don’t have any special insight into how he thinks about AI and power imbalances. Generally I get the vibe from his public statements that he’s pretty libertarian and thinks pros outweigh cons on most technology which he thinks isn’t x-risky. I think I’m moderately confident that hes more relaxed about, say, misinformation or big tech platforms dominance than (say) Melanie Mitchell but maybe i’m wrong about that.
your example agreement with a friend is obviously a derivative, which is just a contract whose value depends on the value of an underlying asset (google stock in this case). If it’s not a formal derivative contract you might be less likely to get in trouble for it compared to doing it on robinhood or whatever (not legal advice!) but it doesn’t seem like a very good idea.