PhD student at UCL DARK doing RL, OOD Robustness and safety. Interested in self improvement.
RobertKirk
Thanks for writing the post, and it’s great to see that (at least implicitly) lots of the people doing mechanistic interpretability (MI) are talking to each other somewhat.
Some comments and questions:
I think “science of deep learning” would be a better term than “deep learning theory” for what you’re describing, given that I think all the phenomena you list aren’t yet theoretically grounded or explained in a mathematical way, and are rather robust empirical observations. Deep learning theory could be useful, especially if it had results concerning the internals of the network, but I think that’s a different genre of work to the science of DL work.
In your description of the relevance of the lottery ticket hypothesis (LTH), it feels like a bit of a non-sequitur to immediately discuss removing dangerous circuits at initialisation. I guess you think this is because lottery tickets are in some way about removing circuits at the beginning of training (although currently we only know how to find out which circuits by getting to the end of training)? I think the LTH potentially has broader relevance for MI, i.e.: if lottery tickets do exist and are of equal performance, then it’s possible they’d be easier to interpret (due to increased sparsity); or just understanding what the existence of lottery tickets means for what circuits are more likely to emerge during neural network training.
When you say “Automating Mechanistic Interpretability research”, do you mean automating (1) the task of interpreting a given network (automating MI), or automating (2) the research of building methods/understanding/etc. that enable us to better-interpret neural networks (automating MI Research)? I realise that a lot of current MI research, even if the ultimate goal is (2), is mostly currently doing (1) as a first step.
Most of the text in that section implies automating (1) to me, but “Eventually, we might also want to automate the process of deciding which interventions to perform on the model to improve AI safety” seems to lean more towards automating (2), which comes under generally approach of automating alignment research. Obviously it would be great to be able to do both of them, but automating (1) seems both much more tractable, and also probably necessary to enable scalable interpretability of large models, whereas (2) is potentially less necessary for MI research to be useful for AI safety.
I’ve now had a conversation with Evan where he’s explained his position here and I now agree with it. Specifically, this is in low path-dependency land, and just considering the simplicity bias. In this case it’s likely that the world model will be very simple indeed (e.g. possibly just some model of physics), and all the hard work is done at run time in activations by the optimisation process. In this setting, Evan argued that the only very simple goal we can encode (that we know of) that would produce behaviour that maximises the training objective is any arbitrary simple objective, combined with the reasoning that deception and power are instrumentally useful for that objective. To encode an objective that would reliably produce maximisation of the training objective always (e.g. internal or corrigible alignment), you’d need to fully encode the training objective.
Overall, I think the issue of causal confusion and OOD misgeneralisation is much more about capabilities than about alignment, especially if we are talking about the long-term x-risk from superintelligent AI, rather than short/mid-term AI risk.
The main argument of the post isn’t “ASI/AGI may be causally confused, what are the consequences of that” but rather “Scaling up static pretraining may result in causally confused models, which hence probably wouldn’t be considered ASI/AGI”. I think in practice if we get AGI/ASI, then almost by definition I’d think it’s not causally confused.
OOD misgeneralisation is absolutely inevitable, due to Gödel’s incompleteness of the universe and the fact that all the systems that evolve on Earth generally climb up in complexity
In a theoretical sense this may be true (I’m not really familiar with the argument), but in practice OOD misgeneralisation is probably a spectrum, and models can be more or less causally confused about how the world works. We’re arguing here that static training, even when scaled up, plausibly doesn’t lead to a model that isn’t causally confused about a lot of how the world works.
Did you use the term “objective misgeneralisation” rather than “goal misgeneralisation” on purpose? “Objective” and “goal” are synonyms, but “objective misgeneralisation” is hardly used, “goal misgeneralisation” is the standard term.
No reason, I’ll edit the post to use goal misgeneralisation. Goal misgeneralisation is the standard term but hasn’t been so for very long (see e.g. this tweet: https://twitter.com/DavidSKrueger/status/1540303276800983041).
Maybe I miss something obvious, but this argument looks wrong to me, or it assumes that the learning algorithm is not allowed to discover additional (conceptual, abstract, hidden, implicit) variables in the training data, but this is false for deep neural networks
Given that the model is trained statically, while it could hypothesise about additional variables of the kinds your listed, it can never know which variables or which values for those variables are correct without domain labels or interventional data. Specifically while “Discovering such hidden confounders doesn’t give interventional capacity” is true, to discover these confounders he needed interventional capacity.
I don’t understand the italicised part of this sentence. Why will P(shorts, ice cream) be a reliable guide to decision-making?
We’re not saying that P(shorts, icecream) is good for decision making, but P(shorts, do(icecream)) is useful in sofar as the goal is to make someone where shorts, and providing icecream is one of the possible actions (as the causal model will demonstrate that providing icecream isn’t useful for making someone where shorts).
What do these symbols in parens before the claims mean?
They are meant to be referring to the previous parts of the argument, but I’ve just realised that this hasn’t worked as the labels aren’t correct. I’ll fix that.
When you talk about whether we’re in a high or low path-dependence “world”, do you think that there is a (somewhat robust) answer to this question that holds across most ML training processes? I think it’s more likely that some training processes are highly path-dependent and some aren’t. We definitely have evidence that some are path-dependent, e.g. Ethan’s comment and other examples like https://arxiv.org/abs/2002.06305, and almost any RL paper where different random seeds of the training process often result in quite different results. Arguably I don’t think we have conclusive of any particular existing training process being low-path dependence, because the burden of proof is heavy for proving that two models are basically equivalent on basically all inputs (given that they’re very unlikely to literally have identical weights, so the equivalence would have to be at a high level of abstraction).
Reasoning about the path dependence of a training process specifically, rather than whether all of the ML/AGI development world is path dependent, seems more precise, and also allows us to reason about whether we want a high or low path-dependence training process, and considering that as an intervention, rather than a state of the world we can’t change.
When you say “the knowledge of what our goals are should be present in all models”, by “knowledge of what our goals are” do you mean a pointer to our goals (given that there are probably multiple goals which are combined in someway) is in the world model? If so this seems to contradict you earlier saying:
The deceptive model has to build such a pointer [to the training objective] at runtime, but it doesn’t need to have it hardcoded, whereas the corrigible model needs it to be hardcoded
I guess I don’t understand what it would mean for the deceptive AI to have the knowledge of what are goals are (in the world model), but for that not to mean it doesn’t have a hard-coded pointer to what our goals are. I’d imagine that what it means for the world model to capture what our goals are is exactly having such a pointer to them.
(I realise I’ve been failing to do this, but it might make sense to use AI when we mean the outer system and model when we mean the world model. I don’t think this is the core of the disagreement, but it could make the discussion clearer. For example, when you say the knowledge is present in the model, do you mean the world model or the AI more generally? I assumed the former above.)
To try and run my (probably inaccurate) simulation of you: I imagine you don’t think that’s a contradiction above. So you’d think that “knowledge of what our goals are” doesn’t mean a pointer to our goals in all the AI’s world models, but something simpler, that can be used to figure out what our goals are by the deceptive AI (e.g. in it’s optimisation process), but wouldn’t enable the aligned AI to use as its objective a simpler pointer and instead would require the aligned AI to hard-code the full pointer to our goals (where the pointer would be pointing into it’s the world model, and probably using this simpler information about our goals in some way). I’m struggling to imagine what that would look like.
Even agreeing that no additional complexity is required to rederive that it should try to be deceptive (assuming it has situational awareness of the training process and long-term goals which aren’t aligned with ours), to be deceptive successfully, it then needs to rederive what our goals are, so that it can pursue them instrumentally. I’m arguing that the ability to do this in the AI would require additional complexity compared to a AI that doesn’t need to rederive the content of this goal (that is, our goal) at every decision.
Alternatively, the aligned model could use the same derivation process to be aligned: The deceptive model has some long-term goal, and in pursuing it rederives the content of the instrumental goal “do ‘what the training process incentives’”, and the alignment model has the long-term goal “do ‘what the training process incentivises’” (as a pointer/de dicto), and also rederives it with the same level of complexity. I think “do ‘what the training process incetivises’” (as a pointer/de dicto) isn’t a very complex long-term goal., and feels likely to be as complex as the arbitrary crystallised deceptive AI’s internal goal, assuming both models have full situation awareness of the training process and hence such a pointer is possible, which we’re assuming they do.
(ETA/Meta point: I do think deception is a big issue that we definitely need more understanding of, and I definitely put weight on it being a failure of alignment that occurs in practice, but I think I’m less sure it’ll emerge (or less sure that your analysis demonstrates that). I’m trying to understand where we disagree, and whether you’ve considered the doubts I have and you possess good arguments against them or not, rather than convince you that deception isn’t going to happen.)
It seems like a lot more computationally difficult to, at every forward pass/decision process, derive/build/construct such a pointer. If the deceptive model is going to be doing this every time it seems like it would be more efficient to have a dedicated part of the network that calculates it (i.e. have it in the weights)
Separately, for more complex goals this procedure is also going to be more complex, and the network probably needs to be more complex to support constructing it in the activations at every forward pass, compared to the corrigible model that doesn’t need to do such a construction (becauase it has it hard-coded as you say). I guess I’m arguing that the additional complexity in the deceptive model that allows it to rederive our goals at every forward pass compensates for the additional complexity in the corrigible model that has the our goals hard-coded.
whereas the corrigible model needs it to be hardcoded
The corrigible model needs to be able to robustly point to our goals, in a way that doesn’t change. One way of doing this is having the goals hardcoded. Another way might be to instead have a pointer to the output of a procedure that is executed at runtime that always constructs our goals in the activations. If the deceptive model can reliably construct in it’s activations something that actually points towards our goals, then the corrible model could also have such a procedure, and make it’s goal be a pointer to the output of such a procedure. Then the only difference in model complexity is that the deceptive model points to some arbitrary attribute of the world model (or whatever), and the aligned model points to the output of this computation, that both models posses.
I think at a high level I’m trying to say that any way in which the deceptive model can robustly point at our goals such that it can pursue them instrumentally, the aligned model can robustly point at them to pursue them terminally. SGD+DL+whatever may favour one way of another of robustly pointing at such goals (either in the weights, or through a procedure that robustly outputs them in the activations), but both deceptive and aligned models could make use of that.
Now, one thing I will say is that, since the difficulty of changing the proxies into long-term goals seems to be essentially constant regardless of the complexity of the training objective—whereas the difficulty of creating a pointer to the training objective scales with that complexity—I think that, if we try to train models on increasingly complex goals, eventually deception will win almost regardless of the baseline “stickiness” level. But understanding that baseline could still be quite important, and it’s something that I think we can quite readily experiment with right now.
But the deceptively aligned model also needs “a pointer to training objective”, for it to be able to optimize that instrumentally/deceptively, so there doesn’t seem to be a penalty in complexity to training on complex goals.
This is similar to my comment on the original post about the likelihood of deceptive alignment, but reading that made it slightly clearer exactly what I disagreed with, hence writing the comment here.
I think people who value empirical alignment work now probably think that (to some extent) we can predict at a high level what future problems we might face (contrasting with “there’d have been no hope whatsoever of identifying all the key problems in advance just based on theory”). Obviously this is a spectrum, but I think the chip fab analogy is I think further towards people believing there are unknown unknowns in the problem space than people at OpenAI are (e.g. OpenAI people possibly think outer alignment and inner alignment capture all of the kinds of problems we’ll face).
However, they probably don’t believe you can work on solutions to those problems without being able to empirically demonstrate those problems and hence iterate on them (and again one could probably appeal to a track record here of most proposed solutions to problems not working unless they were developed by iterating on the actual problem). We can maybe vaguely postulate what the solutions could look like (they would say), but it’s going to be much better to try and actually implement solutions on versions of the problem we can demonstrate, and iterate from there. (Note that they probably also perhaps try and produce demonstrations of the problems such that they can then work on those solutions, but this is still all empirical).
Otherwise I do think your ITT does seem reasonable to me, although I don’t think I’d put myself in the class of people you’re trying to ITT, so that’s not much evidence.
Thanks for writing this post, it’s great to see explicit (high-level) stories for how and why deceptive alignment would arise! Some comments/disagreements:
(Note I’m using “AI” instead of “model” to avoid confusing myself between “model” and “world model”, e.g. “the deceptively aligned AI’s world model” instead of “the deceptively-aligned model’s world model”).
Making goals long-term might not be easy
You say
Furthermore, this is a really short and simple modification. All gradient descent has to do in order to hook up the model’s understanding of the thing that we want it to do to its actions here is just to make its proxies into long term goals
However, this doesn’t necessarily seem all that simple. The world model and internal optimisation process need to be able to plan into the “long term”, or even have the conception of the “long term”, for the proxy goals to be long-term; this seems to heavily depend on how much the world model and internal optimisation process are capturing this.
Conditioning on the world model and internal optimisation process capturing this concept, it’s still not necessarily easy to convert proxies into long term goals, if the proxies are time-dependent in some way, as they might be—if tasks or episodes are similar lengths, then proxies like “start wrapping up my attempt at this task to present it to the human” is only useful if it’s conditioned on a time near the end of the episode. My argument here seems much sketchier, but I think this might be because I can’t come up with a good example. It seems like it’s not necessarily the case that “making goals long-term” is easy; that seems to be mostly taken on intuition that I don’t think I share
Relatedly, it seems that the conditioning on capabilities of the world model and internal optimisation process changes the path somewhat in a way that isn’t captured by your analysis. That is, it might be easier to achieve corrigible or internal alignment with a less capable world model/internal optimisation process (i.e. earlier in training), as it doesn’t require the world model/internal optimisation process to plan over the longer time horizons and greater situational awareness required to still perform well in the deceptive alignmeent case. Do you think that is the case?
On the overhang from throwing out proxies
In the high path-dependency world, you mention an overhang several times. If it understand correctly, what you’re referring to here is that, as the world model increases in capabilities, it will start modelling things that are useful as internal optimisation targets for maximising the training objective, and then at some point SGD could just through away the AI’s internal goals (which we see as proxies) and instead point to these parts of the world model as the target, which would result in a large increase in the training objective, as these are much better targets. (This is the description of what would happen in the internally aligned case, but the same mechanism seems present in the other cases, as you mention).
However, it seems like the main reason the world model would capture these parts of the world is if they were useful (for maximising the training objective) as internal optimisation targets, and so if they’re emerging and improving, it’s likely because there’s pressure for them to improve as they are being used as targets. This would mean there wasn’t an overhang of the sort described above.
Another way of phrasing this might be that the internal goals (proxies) the AIs have will be part of the world model/in the same ontology/using the same representations, they won’t be separate (as your story seems to imply?), and hence there won’t be something to switch them to inside the world model that provides a bump in the training objective; or if there is, this will happen smoothly as the things to switch to are better-modelled such that they become useful targets.
I think how this affects the analysis is that, as the AI learns more about it’s training process, this involves learning more about the training objective, and if it’s doing this, it would be very easy for the internal goals to shift to pointing to this understanding of the training objective (if it’s already there). This would result in a higher likelihood of corrigible alignment. Specifically, in the case where the AI has a full understanding of the training process, including a full understanding of the training objective (such that it models all parts of it, and there’s a single pointer that points to all these parts and is hence easily referenced), it seems (a priori) equally easy to shift the internal objective proxies to be pointing to this training objective pointer, as it does to shift them to long-term goals which then require optimising for the training objective pointer instrumentally. In fact it might even be simpler, as you may need a less complex/long-time-horizon-capable world model/internal optimiser in the first case than the second.
Simplicity Bias Argument
(Building on the ideas above about how the world model could have a pointer to the training objective, and would need this for all model classes considered): In the “slightly more formal analysis” of the simplicity bias in the low-path-dependence world, you decompose the complexity of the model into a sum of it’s world model complexity, optimisation process complexity and mesa objective complexity, and argue that the mesa objective complexity is the key thing that’s going to vary between the model classes being considered. However, I don’t think this argument holds.
Specifically, for the deceptive-aligned AI to perform well during training, inside it’s world model, it needs to have a full specifical of the training objective, and some pointer P to it such that the optimisation process can say “we want {mesa_objective}, but we need to do well in training instrumentally, so plan to make the value of pointer P go higher”. Either this makes the deceptive-aligned AI’s world models more complex than the world model for the internally aligned AI (which instead has the pointer to the full specification of the training objective outside it’s world model, in the mesa objective component), or it’s the same complexity as the internally aligned model, in which case the internally aligned AI’s mesa_objective can just reference that pointer, and hence is likely the same complexity as the simplest mesa objective for the deceptively-aligned AI (which likely also just references a pointer to some attribute of the world which the world model captures).
Phrased a different way, I’m imagining that the world model, optimisation process and mesa objective are all in a shared (internal to the model) ontology, and hence the mesa objective specification isn’t going to be a copy of the parts of the world model that are the objective (which would entail also copying all the machinery necessary to actually specify this from observations), but instead just a (weighted combination of) concept(s) in the internal ontology, which will be very simple to specify.
Overall, all these considerations argue that deceptive aligned is less likely than the analysis in this post suggests. It does still seem very possible that deceptive alignment occurs, and I still agree that we need transparency tools to fix these problems, but perhaps I think we’re less underwater than Evan does (to use the terminology from the Conclusion).
I think perhaps a lot work is being done by “if your optimiser worked”. This might also be where there’s a disanaology between humans<->evolution and AIs<->SGD+PPO (or whatever RL algorithm you’re using to optimise the policy). Maybe evolution is actually a very weak optimiser, that doesn’t really “work”, compared to SGD+RL.
Me, modelling skeptical ML researchers who may read this document:
It felt to me that Large-scale goals are likely to incentivize misaligned power-seeking and AGIs’ behavior will eventually be mainly guided by goals they generalize to large scales were the least well-argued sections (in that while reading them I felt less convinced, and the arguments were more hand-wavy than before).
In particular, the argument that we won’t be able to use other AGIs to help with supervision because of collusion is entirely contained in footnote 22, and doesn’t feel that robust to me - or at least it seems easier for a skeptical reader to dismiss that, and hence not think the rest of section 3 is well-founded. Maybe it’s worth adding another argument for why we probably can’t just use other AGIs to help with alignment, or at least that we don’t currently have good proposals for doing so that we’re confident will work (e.g. how do we know the other AGIs are aligned and are hence actually helping).
Also
Positive goals are unlikely to generalize well to larger scales, because without the constraint of obedience to humans, AGIs would have no reason to let us modify their goals to remove (what we see as) mistakes. So we’d need to train them such that, once they become capable enough to prevent us from modifying them, they’ll generalize high-level positive goals to very novel environments in desirable ways without ongoing corrections, which seems very difficult. Even humans often disagree greatly about what positive goals to aim for, and we should expect AGIs to generalize in much stranger ways than most humans.
seems to be saying that positive goals won’t generalise correctly because we need to get the positive goals exactly correct on the first try. I don’t know if that is exactly an argument for why positive goals won’t generalise correctly. It feels like this paragraph is trying to preempt the counterargument to this section that goes something like “Why wouldn’t we just interactively adjust the objective if we see bad behaviour?”, by justifying why we would need to get it right robustly and on the first try and throughout training, because the AGI will stop us doing this modification later on. Maybe it would be better to frame it that way if that was the intention.
Note that I agree with the document and I’m in favour of producing more ML-researcher-accessible descriptions of and motivations for the alignment problem, hence this effort to make the document more robust to skeptical ML researchers.
First condition: assess reasoning authenticity
To be able to do this step in the most general setting seems to capture the entire difficulty of interpretability—if we could assess whether a model’s outputs faithfully reflect it’s internal “thinking” and hence that all of it’s reasoning is what we’re seeing, then that would be a huge jump forwards (and perhaps possible be equivalent to solving) something like ELK. Given that that problem is known to be quite difficult, and we currently don’t have solutions for it, I’m uncertain whether this reduction of aligning a language model into first verifying all its visible reasoning is complete, correct and faithful, and then doing other steps (i.e. actively optimising against this our measures of correct reasoning) is one that makes the problem easier. Do you think it’s meaningfully different (i.g. easier) to solve the “assess reasoning authenticity” completely than to solve ELK, or another hard interpretability problem?
If, instead of using interpretability tools in the loss function, we merely use it as a ‘validation set’ instead of the training set (i.e. using it as a ‘mulligan’), we might have better chances of picking up dangerous cognition before it gets out of hand so we can terminate the model and start over. We’re therefore still using interpretability in model selection, but the feedback loop is much less tight, so it’d be harder to Goodhart.
While only using the interpretability-tool-based filter for model selection is much weaker optimisation pressure than using it in the loss function, and hence makes goodhearting harder and hence slower, it’s not clear that this would solve the problem in the long run. If the interpretability-tool-based filter captures everything we know now to capture, and we don’t get new insights during the iterated process of model training and model selection, then it’s possible we’ll eventually end up goodharting the model selection process in the same was as SGD would goodhart the interpretability tool in the loss function.
I think it’s likely that we would gain more insights or have more time if we were to use the interpretability tool as a mulligan, and it’s possible the way we as AI builders optimise producing a model that passes the interpretability filters is qualitatively different from the way SGD (or whatever training algorithm is being used) would optimise the interpretability-filter loss function. However, in the spirit of paranoia/security mindset/etc., it’s worth pointing out that using the tool as a model selection filter doesn’t guarantee that an AGI that passes the filter is safer than if we used the interpretability tool as a training signal, in the limit of iterating to pass the interpretability tool model selection filter.
Suppose that aligning an AGI requires 1000 person-years of research.
900 of these person-years can be done in parallelizable 5-year chunks (e.g., by 180 people over 5 years — or, more realistically, by 1800 people over 10 years, with 10% of the people doing the job correctly half the time).
The remaining 100 of these person-years factor into four chunks that take 25 serial years apiece (so that you can’t get any of those four parts done in less than 25 years).
Do you have a similar model for just building (unaligned) AGI? Or is the model meaningfully different? On a similar model for just building AGI, then timelines would mostly be shortened by progressing through the serial research-person-years instead of the parallelisable research-person-years. If researchers who are progressing both capabilities and aligning are doing both in the parallelisable part, then this would be less worrying, as they’re not actually shortening timelines meaningfully.
Unfortunately I imagine you think that building (unaligned) AGI quite probably doesn’t have many more serial person-years of research required, if any. This is possibly another way of framing the prosaic AGI claim: “we expect we can get to AGI without any fundamentally new insights on intelligence, using (something like) current methods.”
I expect that these kinds of problems could mostly be solved by scaling up data and compute (although I haven’t read the paper). However, the argument in the post is that even if we did scale up, we couldn’t solve the OOD generalisation problems.
Here we’re saying that the continual fine-tuning might not necessarily resolve causal confusion within the model; instead, it will help the model learn the (new) spurious correlations so that it still performs well on the test data. This is assuming that continual fine-tuning is using a similar ERM-based method (e.g. the same pretraining objective but on the new data distribution). In hindsight, we probably should have written “continual training” rather than specifically “continual fine-tuning”. If you could continually train online in the deployment environment then that would be better, and whether it’s enough is very related to whether online training is enough, which is one of the key open questions we mention.
The ability to go 1->4 or 2->5 by the behavioural-cloning approach would assume that the difficulty of interpreting all parts of the model are fairly similar, but it just takes time for the humans to interpret all parts, so we can automate that by imitating the humans. But if understanding the worst-case stuff is significantly harder than the best-case stuff (which seems likely to me) then I wouldn’t expect the behaviourally-cloned interpretation agent to generalise to being able to correctly interpret the worse-case stuff.
Another point worth making here is why I haven’t separated out worst-case inspection transparency for deceptive models vs. worst-case training process transparency for deceptive models there. That’s because, while technically the latter is strictly more complicated than the former, I actually think that they’re likely to be equally difficult. In particular, I suspect that the only way that we might actually have a shot at understanding worst-case properties of deceptive models is through understanding how they’re trained.
I’d be curious to hear a bit more justification for this. It feels like resting on this intuition for a reason not to include worst-case inspection transparency for deceptive models as a separate node is a bit of a brittle choice (i.e. makes it more likely the tech tree would change if we got new information). You write
That is, if our ability to understand training dynamics is good enough, we might be able to make it impossible for a deceptive model to evade us by always being able to see its planning for how to do so during training.
which to me is a justification that worst-case inspection transparency for deceptive models is solved if we solve worst-case training process transparency for deceptive models, but not a justification that that’s the only way to solve it.
This work looks super interesting, definitely keen to see more!
Will you open-source your code for running the experiments and producing plots? I’d definitely be keen to play around with it. (They already did here: https://github.com/adamjermyn/toy_model_interpretability I just missed it. Thanks! Although it would be useful to have the plotting code as well, if that’s easy to share?)I agree that N (true feature dimension) > d (observed dimension), and that sparsity will be high, but I’m uncertain whether the other part of the regime (that you don’t mention here), that k (model latent dimension) > N, is likely to be true. Do you think that is likely to be the case? As an analogy, I think the intermediate feature dimensions in MLP layers in transformers (analogously k) are much lower dimension than the “true intrinsic dimension of features in natural language” (analogously N), even if it is larger than the input dimension (embedding dimension* num_tokens, analogously d). So I expect N>k>d, whereas in your regime k>N>d. Do you think you’d be able to find monosemantic networks for k<N? Did you try out this regime at all (I don’t think I could find it in the paper).
In the paper you say that you weakly believe that monosemantic and polysemantic network parametrisations are likely in different loss basins, given they’re implementing very different algorithms. I think (given the size of your networks) it should be easy to test for at least linear mode connectivity with something like git re-basin (https://github.com/samuela/git-re-basin). Have you tried doing that? I think there are also algorithms for finding non-linear (e.g. quadratic) mode connectivity, although I’m less familiar with them. If it is the case that they’re in different basins, I’d be curious to see whether there are just two basins (poly vs mono), or a basin for each level of monosemanticity, or if even within a level of polysemanticity there are multiple basins. If it’s one of the former cases, it’s be interesting to do something like the connectivity-based fine-tuning talked about here (https://openreview.net/forum?id=NZZoABNZECq, in effect optimise for a new parametrisation that is linearly disconnected from the previous one), and see if doing that from a polysemantic initialisation can produce a more monosemantic one, or if it just becomes polysemantic in a different way.
You also mentioned your initial attempts at sparsity through a hard-coded initially sparse matrix failed; I’d be very curious to see whether a lottery ticket-style iterative magnitude pruning was able to produce sparse matrices from the high-latent-dimension monosemantic networks that are still monosemantic, or more broadly how the LTH interacts with polysemanticity—are lottery tickets less polysemantic, or more, or do they not really change the monosemanticity?
If my understanding of the bias decay method is correct, is a large initial part of training only reducing the bias (through weight decay) until certain neurons start firing? If that’s the case, could you calculate the maximum output in the latent dimension on the dataset at the start of training (say B), and then initialise the bias to be just below -B, so that you skip almost all of the portion of training that’s only moving the bias term. You could do this per-neuron or just maxing over neurons. Or is this portion of training relatively small compared to the rest of training, and the slower convergence is more due to less neurons getting gradients even when some of them are outputting higher than the bias?