Here’s some stuff I cut from the main piece that nevertheless seems worth saying.
The mech interp bubble.
Hot take: The mech interp bubble is real. I’ve lived in it. To be clear, I’m not saying that all MI researchers live in a bubble, nor that MI is disconnected from adjacent fields. But there seems to be an insidious tendency for MI researchers to subtly self-select into working on increasingly esoteric things that make sense within the MI paradigms but have little to do with the important problems and developments outside of them.
As a mech interp researcher, your skills and interests mainly resonate with other mech interp researchers. Your work becomes embedded in specific paradigms that aren’t always legible to the broader ML or AI safety community. Concrete claim: If you work on SAEs outside a big lab, chances are that your work is not of interest outside the mech interp community. (This doesn’t mean it’s not valuable. Just that it’s niche.)
(Speaking from personal experience:) Mech interp folk also have a tendency to primarily think about interp-based solutions to problems. About interp-related implications of new developments. About interp-based methods as cool demos of interp, not as practical solutions to problems.
Tl;dr Mech interp can easily become the lens through which you see the world. This lens is sometimes useful, sometimes not. You have to know when you should take it off. A pretty good antidote is talking to people who do not work on interp.
Does Mech Interp Yield Transferable Skills?
This is admittedly based on a sample size of 1. But I’m not sure that mech interp has imparted me with much knowledge in the way of training, fine-tuning, or evaluating language models. It also hasn’t taught me much about designing scalable experiments / writing scalable, high-performance research code.
My impression of what mech interp has taught me is ‘don’t fool yourself’, ‘hacker mindset’ and ‘sprint to the graph’ - all of which are really valuable—but these aren’t concrete object-level skills or tacit knowledge in the way that the other ones are.
What is Intelligence Even, Anyway?
What if, as Gwern proposes, intelligence is simply “search over the space of Turing machines”, i.e. AIXI? Currently this definition is what feels closest to the empirical realities in ML capabilities; that ‘expert knowledge’ and ‘inductive bias’ have largely lost out to the cold realities of scaling compute and data.
All we are doing when we are doing “learning,” or when we are doing “scaling,” is that we’re searching over more and longer Turing machines, and we are applying them in each specific case.
Otherwise, there is no general master algorithm. There is no special intelligence fluid. It’s just a tremendous number of special cases that we learn and we encode into our brains.
If this turns out to be correct, as opposed to something more ‘principled’ or ‘organic’ like natural abstractions or shard theory, why would we expect to be able to understand it (mechanistically or otherwise) at all? In this world we should be focusing much more on scary demos or evals or other things that seem robustly good for reducing X-risk.
What if, as Gwern proposes, intelligence is simply “search over the space of Turing machines”, i.e. AIXI? Currently this definition is what feels closest to the empirical realities in ML capabilities; that ‘expert knowledge’ and ‘inductive bias’ have largely lost out to the cold realities of scaling compute and data.
All we are doing when we are doing “learning,” or when we are doing “scaling,” is that we’re searching over more and longer Turing machines, and we are applying them in each specific case.
Otherwise, there is no general master algorithm. There is no special intelligence fluid. It’s just a tremendous number of special cases that we learn and we encode into our brains.
If this turns out to be correct, as opposed to something more ‘principled’ or ‘organic’ like natural abstractions or shard theory, why would we expect to be able to understand it (mechanistically or otherwise) at all? In this world we should be focusing much more on scary demos or evals or other things that seem robustly good for reducing X-risk.
I don’t think it can literally be AIXI/search over Turing Machines, because it’s an extremely unrealistic model of how future AIs work, but I do think a related claim is true in that inductive biases mattered a lot less than we thought in the 2000s-early 2010s, and this matters.
The pitch for natural abstractions is that compute limits of real AGIs/ASIs force abstractions rather than brute-force simulation of the territory, combined with hope that abstractions are closer to discrete than continuous, combined with hope that other minds naturally learn these abstractions in pursuit of capabilities (I see LLMs as evidence for natural abstractions being relevant), but yes I think that a mechanistic understanding of how AIs work is likely to not exist in time, if at all, so I am indeed somewhat bearish on mech interp.
This is why I tend to favor direct alignment approaches like altering the data over approaches that rely on interpretability.
What if, as Gwern proposes, intelligence is simply “search over the space of Turing machines”, i.e. AIXI? Currently this definition is what feels closest to the empirical realities in ML capabilities; that ‘expert knowledge’ and ‘inductive bias’ have largely lost out to the cold realities of scaling compute and data.
All we are doing when we are doing “learning,” or when we are doing “scaling,” is that we’re searching over more and longer Turing machines, and we are applying them in each specific case.
Otherwise, there is no general master algorithm. There is no special intelligence fluid. It’s just a tremendous number of special cases that we learn and we encode into our brains.
That said, while I’m confident that literally learning special cases and just searching/look-up tables isn’t how current AIs work, there is an important degree of truth to this in general, where we are just searching over larger and more objects (though in my case it’s not just restricted to Turing Machines, but any set defined formally using ZFC + Tarski’s Axiom at minimum, links below):
And more importantly, the maximal generalization of learning/intelligence is just that we are learning ever larger look-up tables, and optimal intelligences look like look-up tables having other look-up tables when you weaken your assumptions enough.
I view the no-free-lunch theorems as essentially asserting that there exists only 1 method to learn in the worst case, which is the highly inefficient look-up table, and in the general case there are no shortcuts to learning a look-up table, you must pay the full exponential cost of storage and time (in finite domains).
Does it make sense to say there is no inductuive bias at work in modern ML models? Seems that clearly literally brute force searching ALL THE ALGORITHMS would still be unfeasible no matter how much compute you throw at it. Our models are very general, but when e.g. we use a diffusion model for images that exploits (and is biased towards) the kind of local structure we expect of images, when we use a transformer for text that exploits (and is biased towards) the kind of sequential pair-correlation you see in natural language, etc.
I agree with the claim that a little inductive bias has to be there, solely because AIXI is an utterly unrealistic model of how future AIs will look like, and even AIXI-tl is very infeasible, but I think the closer claim that is that the data matters way more than the architecture bias, which does turn out to be true.
One example is that attention turns out to be more or less replacable by MLP mixtures (have heard this from Gwern, but can’t verify this), or this link below:
That sounds more like my intuition, though obviously there still have to be differences given that we keep using self-attention (quadratic in N) instead of MLPs (linear in N).
In the limit of infinite scaling, the fact that MLPs are universal function approximators is a guarantee that you can do anything with them. But obviously we still would rather have something that can actually work with less-than-infinite amounts of compute.
I didn’t say there’s no inductive bias at work in models. Merely that trying to impose your own inductive biases on models is probably doomed for reasons best stated in the Bitter Lesson. The effectiveness of pretraining / scaling suggests that inductive biases work best when they are arrived at organically by very expressive models training on very large amounts of data.
Our models are very general, but when e.g. we use a diffusion model for images that exploits (and is biased towards) the kind of local structure we expect of images, when we use a transformer for text that exploits (and is biased towards) the kind of sequential pair-correlation you see in natural language, etc.
Edit: On reflection, the two examples above actually support the idea that the inductive bias of the architecture is immensely helpful to solving tasks (to the extent that the weights themselves don’t really matter). A better example of my point is that MLP-mixer models can be as good as CNNs on vision tasks despite having much smaller architectural inductive biases towards vision.
Sure, but it’s unclear whether these inductive biases are necessary.
My understanding is that they’re necessary even in principle, since there are an unbounded number of functions that fit any finite set of points. Even AIXI has a strong inductive bias, toward programs with the lowest Kolmogorov complexity.
Yeah, I agree that some inductive bias is probably necessary. But not all inductive biases are equal; some are much more general than others, and in particular I claim that ‘narrow’ inductive biases (e.g. specializing architectures to match domains) probably have net ~zero benefit compared to those learned from data
Interesting. But CNNs were developed originally for a reason to begin with, and MLP-mixer does mention a rather specific architecture as well as “modern regularization techniques”. I’d say all of that counts as baking in some inductive biases in the model though I agree it’s a very light touch.
If this turns out to be correct, as opposed to something more ‘principled’ or ‘organic’ like natural abstractions or AIXI
AIXI is just search over Turing machines, just computationally unbounded. I am pretty sure that Gwern kept in mind AIXI when he spoke about search over Turing machines.
I think that you can’t say anything “more principled” than AIXI if you don’t account for reality you are already in. Our reality is not generated by random Turing machine, it is generated by very specific program that creates 3+1 space-time with certain space-time symmetries and certain particle fields etc, and “intelligence” here is “how good you are at approximating algorithm for optimal problem-solving in this environment”.
AIXI is just search over Turing machines, just computationally unbounded. I am pretty sure that Gwern kept in mind AIXI when he spoke about search over Turing machines.
Upon reflection, agree with this
I think that you can’t say anything “more principled” than AIXI if you don’t account for reality you are already in.
Also agree with this, and I think this reinforces the point I was attempting to make in that comment. I.e. ‘search over turing machines’ is so general as to yield very little insight, and any further assumptions risk being invalid
‘sprint to the graph’
Refers to the mentality of ‘research sprints should aim to reach ASAP the first somewhat-shareable research result’, usually a single graph
Referring to the section “What is Intelligence Even, Anyway?”:
I think AIXI is fairly described as a search over the space of Turing machines. Why do you think otherwise? Or maybe are you making a distinction at a more granular level?
Upon consideration I think you are right, and I should edit the post to reflect that. But I think the claim still holds (if you expect intelligence looks like AIXI then it seems quite unlikely you should expect to be able to understand it without further priors)
Outtakes.
Here’s some stuff I cut from the main piece that nevertheless seems worth saying.
The mech interp bubble.
Hot take: The mech interp bubble is real. I’ve lived in it. To be clear, I’m not saying that all MI researchers live in a bubble, nor that MI is disconnected from adjacent fields. But there seems to be an insidious tendency for MI researchers to subtly self-select into working on increasingly esoteric things that make sense within the MI paradigms but have little to do with the important problems and developments outside of them.
As a mech interp researcher, your skills and interests mainly resonate with other mech interp researchers. Your work becomes embedded in specific paradigms that aren’t always legible to the broader ML or AI safety community. Concrete claim: If you work on SAEs outside a big lab, chances are that your work is not of interest outside the mech interp community. (This doesn’t mean it’s not valuable. Just that it’s niche.)
(Speaking from personal experience:) Mech interp folk also have a tendency to primarily think about interp-based solutions to problems. About interp-related implications of new developments. About interp-based methods as cool demos of interp, not as practical solutions to problems.
Tl;dr Mech interp can easily become the lens through which you see the world. This lens is sometimes useful, sometimes not. You have to know when you should take it off. A pretty good antidote is talking to people who do not work on interp.
Does Mech Interp Yield Transferable Skills?
This is admittedly based on a sample size of 1. But I’m not sure that mech interp has imparted me with much knowledge in the way of training, fine-tuning, or evaluating language models. It also hasn’t taught me much about designing scalable experiments / writing scalable, high-performance research code.
My impression of what mech interp has taught me is ‘don’t fool yourself’, ‘hacker mindset’ and ‘sprint to the graph’ - all of which are really valuable—but these aren’t concrete object-level skills or tacit knowledge in the way that the other ones are.
What is Intelligence Even, Anyway?
What if, as Gwern proposes, intelligence is simply “search over the space of Turing machines”, i.e. AIXI? Currently this definition is what feels closest to the empirical realities in ML capabilities; that ‘expert knowledge’ and ‘inductive bias’ have largely lost out to the cold realities of scaling compute and data.
If this turns out to be correct, as opposed to something more ‘principled’ or ‘organic’ like natural abstractions or shard theory, why would we expect to be able to understand it (mechanistically or otherwise) at all? In this world we should be focusing much more on scary demos or evals or other things that seem robustly good for reducing X-risk.
I don’t think it can literally be AIXI/search over Turing Machines, because it’s an extremely unrealistic model of how future AIs work, but I do think a related claim is true in that inductive biases mattered a lot less than we thought in the 2000s-early 2010s, and this matters.
The pitch for natural abstractions is that compute limits of real AGIs/ASIs force abstractions rather than brute-force simulation of the territory, combined with hope that abstractions are closer to discrete than continuous, combined with hope that other minds naturally learn these abstractions in pursuit of capabilities (I see LLMs as evidence for natural abstractions being relevant), but yes I think that a mechanistic understanding of how AIs work is likely to not exist in time, if at all, so I am indeed somewhat bearish on mech interp.
This is why I tend to favor direct alignment approaches like altering the data over approaches that rely on interpretability.
That said, while I’m confident that literally learning special cases and just searching/look-up tables isn’t how current AIs work, there is an important degree of truth to this in general, where we are just searching over larger and more objects (though in my case it’s not just restricted to Turing Machines, but any set defined formally using ZFC + Tarski’s Axiom at minimum, links below):
And more importantly, the maximal generalization of learning/intelligence is just that we are learning ever larger look-up tables, and optimal intelligences look like look-up tables having other look-up tables when you weaken your assumptions enough.
https://en.wikipedia.org/wiki/Zermelo%E2%80%93Fraenkel_set_theory
https://en.wikipedia.org/wiki/Tarski%E2%80%93Grothendieck_set_theory
I view the no-free-lunch theorems as essentially asserting that there exists only 1 method to learn in the worst case, which is the highly inefficient look-up table, and in the general case there are no shortcuts to learning a look-up table, you must pay the full exponential cost of storage and time (in finite domains).
Does it make sense to say there is no inductuive bias at work in modern ML models? Seems that clearly literally brute force searching ALL THE ALGORITHMS would still be unfeasible no matter how much compute you throw at it. Our models are very general, but when e.g. we use a diffusion model for images that exploits (and is biased towards) the kind of local structure we expect of images, when we use a transformer for text that exploits (and is biased towards) the kind of sequential pair-correlation you see in natural language, etc.
I agree with the claim that a little inductive bias has to be there, solely because AIXI is an utterly unrealistic model of how future AIs will look like, and even AIXI-tl is very infeasible, but I think the closer claim that is that the data matters way more than the architecture bias, which does turn out to be true.
One example is that attention turns out to be more or less replacable by MLP mixtures (have heard this from Gwern, but can’t verify this), or this link below:
https://nonint.com/2023/06/10/the-it-in-ai-models-is-the-dataset/
This is relevant for AI alignment and AI capabilities.
That sounds more like my intuition, though obviously there still have to be differences given that we keep using self-attention (quadratic in N) instead of MLPs (linear in N).
In the limit of infinite scaling, the fact that MLPs are universal function approximators is a guarantee that you can do anything with them. But obviously we still would rather have something that can actually work with less-than-infinite amounts of compute.
I didn’t say there’s no inductive bias at work in models. Merely that trying to impose your own inductive biases on models is probably doomed for reasons best stated in the Bitter Lesson. The effectiveness of pretraining / scaling suggests that inductive biases work best when they are arrived at organically by very expressive models training on very large amounts of data.
Sure, but it’s unclear whether these inductive biases are necessary.
Concrete example:randomly initialised CNN weights can extract sufficiently good features to linearly classify MNIST. Another example:randomly initialised transformers with only embeddings learnablecan do modular addition.Edit: On reflection, the two examples above actually support the idea that the inductive bias of the architecture is immensely helpful to solving tasks (to the extent that the weights themselves don’t really matter). A better example of my point is that MLP-mixer models can be as good as CNNs on vision tasks despite having much smaller architectural inductive biases towards vision.
My understanding is that they’re necessary even in principle, since there are an unbounded number of functions that fit any finite set of points. Even AIXI has a strong inductive bias, toward programs with the lowest Kolmogorov complexity.
Yeah, I agree that some inductive bias is probably necessary. But not all inductive biases are equal; some are much more general than others, and in particular I claim that ‘narrow’ inductive biases (e.g. specializing architectures to match domains) probably have net ~zero benefit compared to those learned from data
Interesting. But CNNs were developed originally for a reason to begin with, and MLP-mixer does mention a rather specific architecture as well as “modern regularization techniques”. I’d say all of that counts as baking in some inductive biases in the model though I agree it’s a very light touch.
AIXI is just search over Turing machines, just computationally unbounded. I am pretty sure that Gwern kept in mind AIXI when he spoke about search over Turing machines.
I think that you can’t say anything “more principled” than AIXI if you don’t account for reality you are already in. Our reality is not generated by random Turing machine, it is generated by very specific program that creates 3+1 space-time with certain space-time symmetries and certain particle fields etc, and “intelligence” here is “how good you are at approximating algorithm for optimal problem-solving in this environment”.
Can you elaborate what is it?
Upon reflection, agree with this
Also agree with this, and I think this reinforces the point I was attempting to make in that comment. I.e. ‘search over turing machines’ is so general as to yield very little insight, and any further assumptions risk being invalid
Refers to the mentality of ‘research sprints should aim to reach ASAP the first somewhat-shareable research result’, usually a single graph
Referring to the section “What is Intelligence Even, Anyway?”:
I think AIXI is fairly described as a search over the space of Turing machines. Why do you think otherwise? Or maybe are you making a distinction at a more granular level?
Upon consideration I think you are right, and I should edit the post to reflect that. But I think the claim still holds (if you expect intelligence looks like AIXI then it seems quite unlikely you should expect to be able to understand it without further priors)