Something in that direction, yeah
Daniel Murfet
I found this clear and useful, thanks. Particularly the notes about compositional structure. For what it’s worth I’ll repeat here a comment from ILIAD, which is that there seems to be something in the direction of SAEs, approximate sufficient statistics/information bottleneck, the work of Achille-Soatto and SLT (Section 5 iirc) which I had looked into after talking with Olah and Wattenberg about feature geometry but which isn’t currently a high priority for us. Somebody might want to pick that up.
I like the emphasis in this post on the role of patterns in the world in shaping behaviour, the fact that some of those patterns incentivise misaligned behaviour such as deception, and further that our best efforts at alignment and control are themselves patterns that could have this effect. I also like the idea that our control systems (even if obscured from the agent) can present as “errors” with respect to which the agent is therefore motivated to learn to “error correct”.
This post and the sharp left turn are among the most important high-level takes on the alignment problem for shaping my own views on where the deep roots of the problem are.
Although to be honest I had forgotten about this post, and therefore underestimated its influence on me, until performing this review (which caused me to update a recent article I wrote, the Queen’s Dilemma, which is clearly a kind of retelling of one aspect of this story, with an appropriate reference). I assess it to be a substantial influence on me even so.
I think this whole line of thought could be substantially developed, and with less reliance on stories, and that this would be useful.
This is an important topic, about which I find it hard to reason and on which I find the reasoning of others to be lower quality than I would like, given its significance. For that reason I find this post valuable. It would be great if there were longer, deeper takes on this issue available on LW.
This post and its precusor from 2018 present a strong and well-written argument for the centrality of mathematical theory to AI alignment. I think the learning-theoretic agenda, as well as Hutter’s work on ASI safety in the setting of AIXI, currently seems underrated and will rise in status. It is fashionable to talk about automating AI alignment research, but who is thinking hard about what those armies of researchers are supposed to do? Conceivably one of the main things they should do is solve the problems that Vanessa has articulated here.
The idea of a frugal compositional language and infa-Bayesian logic seem very interesting. As Vanessa points out in Direction 2 it seems likely there are possibilities for interesting cross-fertilisation between LTA and SLT, especially in connection with Solomonoff-like ideas and inductive biases.
I have referred colleagues in mathematics interested in alignment to this post and have revisited it a few times myself.
I have been thinking about interpretability for neural networks seriously since mid-2023. The biggest early influences on me that I recall were Olah’s writings and a podcast that Nanda did. The third most important is perhaps this post, which I valued as an opposing opinion to help sharpen up my views.
I’m not sure it has aged well, in the sense that it’s no longer clear to me I would direct someone to read this in 2025. I disagree with many of the object level claims. However, especially when some of the core mechanistic interpretability work is not being subjected to peer review, perhaps I wish there was more sceptical writing like this on balance.
It’s a question of resolution. Just looking at things for vibes is a pretty good way of filtering wheat from chaff, but you don’t give scarce resources like jobs or grants to every grain of wheat that comes along. When I sit on a hiring committee, the discussions around the table are usually some mix of status markers and people having done the hard work of reading papers more or less carefully (this consuming time in greater-than-linear proportion to distance from your own fields of expertise). Usually (unless nepotism is involved) someone who has done that homework can wield more power than they otherwise would at that table, because people respect strong arguments and understand that status markers aren’t everything.
Still, at the end of day, an Annals paper is an Annals paper. It’s also true that to pass some of the early filters you either need (a) someone who speaks up strongly for you or (b) pass the status marker tests.
I am sometimes in a position these days of trying to bridge the academic status system and the Berkeley-centric AI safety status system, e.g. by arguing to a high status mathematician that someone with illegible (to them) status is actually approximately equivalent in “worthiness of being paid attention to” as someone they know with legible status. Small increases in legibility can have outsize effects in how easy my life is in those conversations.
Otherwise it’s entirely down to me putting social capital on the table (“you think I’m serious, I think this person is very serious”). I’m happy to do this and continue doing this, but it’s not easily scalable, because it depends on my personal relationships.
To be clear, I am not arguing that evolution is an example of what I’m talking about. The analogy to thermodynamics in what I wrote is straightforwardly correct, no need to introduce KT-complexity and muddy the waters; what I am calling work is literally work.
There is a passage from Jung’s “Modern man in search of a soul” that I think about fairly often, on this point (p.229 in my edition)
I know that the idea of proficiency is especially repugnant to the pseudo-moderns, for it reminds them unpleasantly of their deceits. This, however, cannot prevent us from taking it as our criterion of the modern man. We are even forced to do so, for unless he is proficient, the man who claims to be modern is nothing but an unscrupulous gambler. He must be proficient in the highest degree, for unless he can atone by creative ability for his break with tradition, he is merely disloyal to the past
Is there a reason why the Pearson correlation coefficient of the data in Figure 14 is not reported? This correlation is referred to numerous times throughout the paper.
et al (!)
There’s no general theoretical reason that I am aware of to expect a relation between the L2 norm and the LLC. The LLC is the coefficient of the term in the asymptotic expansion of the free energy (negative logarithm of the integral of the posterior over a local region, as a function of sample size ) while the L2 norm of the parameter shows up in the constant order term of that same expansion, if you’re taking a Gaussian prior.
It might be that in particular classes of neural networks there is some architecture-specific correlation between the L2 norm and the LLC, but I am not aware of any experimental or theoretical evidence for that.
For example, in the figure below from Hoogland et al 2024 we see that there are later stages of training in a transformer trained to do linear-regression in context (blue shaded regions) where the LLC is decreasing but the L2 norm is increasing. So the model is moving towards a “simpler” parameter with larger weight norm.
My best current guess is that it happens to be, in the grokking example, that the simpler solution has smaller weight norm. This could be true in many synthetic settings, for all I know; however, in general, it is not the case that complexity (at least as far as SLT is concerned) and weight norm are correlated.
That simulation sounds cool. The talk certainly doesn’t contain any details and I don’t have a mathematical model to share at this point. One way to make this more concrete is to think through Maxwell’s demon as an LLM, for example in the context of Feynman’s lectures on computation. The literature on thermodynamics of computation (various experts, like Adam Shai and Paul Riechers, are around here and know more than me) implicitly or explicitly touches on relevant issues.
The analogous laws are just information theory.
Re: a model trained on random labels. This seems somewhat analogous to building a power plant out of dark matter; to derive physical work it isn’t enough to have some degrees of freedom somewhere that have a lot of energy, one also needs a chain of couplings between those degrees of freedom and the degrees of freedom you want to act on. Similarly, if I want to use a model to reduce my uncertainty about something, I need to construct a chain of random variables with nonzero mutual information linking the question in my head to the predictive distribution of the model.
To take a concrete example: if I am thinking about a chemistry question, and there are four choices A, B, C, D. Without any other information than these letters the model cannot reduce my uncertainty (say I begin with equal belief in all four options). However if I provide a prompt describing the question, and the model has been trained on chemistry, then this information sets up a correspondence between this distribution over four letters and something the model knows about; its answer may then reduce my distribution to being equally uncertain between A, B but knowing C, D are wrong (a change of 1 bit in my entropy).
Since language models are good general compressors this seems to work in reasonable generality.
Ideally we would like the model to push our distribution towards true answers, but it doesn’t necessarily know true answers, only some approximation; thus the work being done is nontrivially directed, and has a systematic overall effect due to the nature of the model’s biases.
I don’t know about evolution. I think it’s right that the perspective has limits and can just become some empty slogans outside of some careful usage. I don’t know how useful it is in actually technically reasoning about AI safety at scale, but it’s a fun idea to play around with.
Marcus Hutter on AIXI and ASI safety
Yes this seems like an important question but I admit I don’t have anything coherent to say yet. A basic intuition from thermodynamics is that if you can measure the change in the internal energy between two states, and the heat transfer, you can infer how much work was done even if you’re not sure how it was done. So maybe the problem is better thought of as learning to measure enough other quantities that one can infer how much cognitive work is being done.
For all I know there is a developed thermodynamic theory of learning agents out there which already does this, but I didn’t find it yet...
The description of love at the conclusion of Gene Wolfe’s The Wizard gets at something important, if you read it as something that both parties are simultaneously doing.
Thanks Jesse, Ben. I agree with the vision you’ve laid out here.
I’ve spoken with a few mathematicians about my experience using Claude Sonnet and o1, o1-Pro for doing research, and there’s an anecdote I have shared a few times which gets across one of the modes of interaction that I find most useful. Since these experiences inform my view on the proper institutional form of research automation, I thought I might share the anecdote here.
Sometime in November 2024 I had a striking experience with Claude Sonnet 3.5. At the end of a workday I regularly paste in the LaTeX for the paper I’m working on and ask for its opinion, related work I was missing, and techniques it thinks I might find useful. I finish by asking it to speculate on how the research could be extended. Usually this produces enthusiastic and superficially interesting ideas, which are however useless.
On this particular occasion, however, the model proceeded to elaborate a fascinating and far-reaching vision of the future of theoretical computer science. In fact I recognised the vision, because it was the vision that led me to write the document. However, none of that was explicitly in the LaTeX file. What the model could see was some of the initial technical foundations for that vision, but the fancy ideas were only latent. In fact, I have several graduate students working with me on the project and I think none of them saw what the model saw (or at least not as clearly).
I was impressed, but not astounded, since I had already thought the thoughts. But one day soon, I will ask a model to speculate and it will come up with something that is both fantastic and new to me.
Note that Claude Sonnet 3.5/3.6 would, in my judgement, be incapable of delivering on that vision. o1-Pro is going to get a bit further. However, Sonnet in particular has a broad vision and “good taste” and has a remarkable knack of “surfing the vibes” around a set of ideas. A significant chunk of cutting edge research comes from just being familiar at a “bones deep” level with a large set of ideas and tools, and knowing what to use and where in the Right Way. Then there is technical mastery to actually execute when you’ve found the way; put the vibe surfing and technical mastery together and you have a researcher.
In my opinion the current systems have the vibe surfing, now we’re just waiting for the execution to catch up.