Daniel Tan comments on Why I’m Moving from Mechanistic to Prosaic Interpretability

Daniel Tan 30 Dec 2024 6:37 UTC
8 points
0
Outtakes.
Here’s some stuff I cut from the main piece that nevertheless seems worth saying.
The mech interp bubble.
Hot take: The mech interp bubble is real. I’ve lived in it. To be clear, I’m not saying that all MI researchers live in a bubble, nor that MI is disconnected from adjacent fields. But there seems to be an insidious tendency for MI researchers to subtly self-select into working on increasingly esoteric things that make sense within the MI paradigms but have little to do with the important problems and developments outside of them.
As a mech interp researcher, your skills and interests mainly resonate with other mech interp researchers. Your work becomes embedded in specific paradigms that aren’t always legible to the broader ML or AI safety community. Concrete claim: If you work on SAEs outside a big lab, chances are that your work is not of interest outside the mech interp community. (This doesn’t mean it’s not valuable. Just that it’s niche.)
(Speaking from personal experience:) Mech interp folk also have a tendency to primarily think about interp-based solutions to problems. About interp-related implications of new developments. About interp-based methods as cool demos of interp, not as practical solutions to problems.
Tl;dr Mech interp can easily become the lens through which you see the world. This lens is sometimes useful, sometimes not. You have to know when you should take it off. A pretty good antidote is talking to people who do not work on interp.
Does Mech Interp Yield Transferable Skills?
This is admittedly based on a sample size of 1. But I’m not sure that mech interp has imparted me with much knowledge in the way of training, fine-tuning, or evaluating language models. It also hasn’t taught me much about designing scalable experiments / writing scalable, high-performance research code.
My impression of what mech interp has taught me is ‘don’t fool yourself’, ‘hacker mindset’ and ‘sprint to the graph’ - all of which are really valuable—but these aren’t concrete object-level skills or tacit knowledge in the way that the other ones are.
What is Intelligence Even, Anyway?
What if, as Gwern proposes, intelligence is simply “search over the space of Turing machines”, i.e. AIXI? Currently this definition is what feels closest to the empirical realities in ML capabilities; that ‘expert knowledge’ and ‘inductive bias’ have largely lost out to the cold realities of scaling compute and data.
All we are doing when we are doing “learning,” or when we are doing “scaling,” is that we’re searching over more and longer Turing machines, and we are applying them in each specific case.
Otherwise, there is no general master algorithm. There is no special intelligence fluid. It’s just a tremendous number of special cases that we learn and we encode into our brains.
If this turns out to be correct, as opposed to something more ‘principled’ or ‘organic’ like natural abstractions or shard theory, why would we expect to be able to understand it (mechanistically or otherwise) at all? In this world we should be focusing much more on scary demos or evals or other things that seem robustly good for reducing X-risk.
- Noosphere89 30 Dec 2024 16:36 UTC
  7 points
  5
  Parent
  What is Intelligence Even, Anyway?
  What if, as Gwern proposes, intelligence is simply “search over the space of Turing machines”, i.e. AIXI? Currently this definition is what feels closest to the empirical realities in ML capabilities; that ‘expert knowledge’ and ‘inductive bias’ have largely lost out to the cold realities of scaling compute and data.
  All we are doing when we are doing “learning,” or when we are doing “scaling,” is that we’re searching over more and longer Turing machines, and we are applying them in each specific case.
  Otherwise, there is no general master algorithm. There is no special intelligence fluid. It’s just a tremendous number of special cases that we learn and we encode into our brains.
  If this turns out to be correct, as opposed to something more ‘principled’ or ‘organic’ like natural abstractions or shard theory, why would we expect to be able to understand it (mechanistically or otherwise) at all? In this world we should be focusing much more on scary demos or evals or other things that seem robustly good for reducing X-risk.
  I don’t think it can literally be AIXI/search over Turing Machines, because it’s an extremely unrealistic model of how future AIs work, but I do think a related claim is true in that inductive biases mattered a lot less than we thought in the 2000s-early 2010s, and this matters.
  The pitch for natural abstractions is that compute limits of real AGIs/ASIs force abstractions rather than brute-force simulation of the territory, combined with hope that abstractions are closer to discrete than continuous, combined with hope that other minds naturally learn these abstractions in pursuit of capabilities (I see LLMs as evidence for natural abstractions being relevant), but yes I think that a mechanistic understanding of how AIs work is likely to not exist in time, if at all, so I am indeed somewhat bearish on mech interp.
  This is why I tend to favor direct alignment approaches like altering the data over approaches that rely on interpretability.
- Noosphere89 31 Dec 2024 20:07 UTC
  4 points
  0
  Parent
  What is Intelligence Even, Anyway?
  What if, as Gwern proposes, intelligence is simply “search over the space of Turing machines”, i.e. AIXI? Currently this definition is what feels closest to the empirical realities in ML capabilities; that ‘expert knowledge’ and ‘inductive bias’ have largely lost out to the cold realities of scaling compute and data.
  All we are doing when we are doing “learning,” or when we are doing “scaling,” is that we’re searching over more and longer Turing machines, and we are applying them in each specific case.
  Otherwise, there is no general master algorithm. There is no special intelligence fluid. It’s just a tremendous number of special cases that we learn and we encode into our brains.
  
  That said, while I’m confident that literally learning special cases and just searching/look-up tables isn’t how current AIs work, there is an important degree of truth to this in general, where we are just searching over larger and more objects (though in my case it’s not just restricted to Turing Machines, but any set defined formally using ZFC + Tarski’s Axiom at minimum, links below):
  And more importantly, the maximal generalization of learning/intelligence is just that we are learning ever larger look-up tables, and optimal intelligences look like look-up tables having other look-up tables when you weaken your assumptions enough.
  https://en.wikipedia.org/wiki/Zermelo%E2%80%93Fraenkel_set_theory
  https://en.wikipedia.org/wiki/Tarski%E2%80%93Grothendieck_set_theory
  I view the no-free-lunch theorems as essentially asserting that there exists only 1 method to learn in the worst case, which is the highly inefficient look-up table, and in the general case there are no shortcuts to learning a look-up table, you must pay the full exponential cost of storage and time (in finite domains).
- dr_s 30 Dec 2024 8:12 UTC
  4 points
  0
  Parent
  Does it make sense to say there is no inductuive bias at work in modern ML models? Seems that clearly literally brute force searching ALL THE ALGORITHMS would still be unfeasible no matter how much compute you throw at it. Our models are very general, but when e.g. we use a diffusion model for images that exploits (and is biased towards) the kind of local structure we expect of images, when we use a transformer for text that exploits (and is biased towards) the kind of sequential pair-correlation you see in natural language, etc.
  - Noosphere89 30 Dec 2024 15:29 UTC
    8 points
    0
    Parent
    I agree with the claim that a little inductive bias has to be there, solely because AIXI is an utterly unrealistic model of how future AIs will look like, and even AIXI-tl is very infeasible, but I think the closer claim that is that the data matters way more than the architecture bias, which does turn out to be true.
    
    One example is that attention turns out to be more or less replacable by MLP mixtures (have heard this from Gwern, but can’t verify this), or this link below:
    
    https://nonint.com/2023/06/10/the-it-in-ai-models-is-the-dataset/
    
    This is relevant for AI alignment and AI capabilities.
    - dr_s 1 Jan 2025 19:35 UTC
      2 points
      0
      Parent
      That sounds more like my intuition, though obviously there still have to be differences given that we keep using self-attention (quadratic in N) instead of MLPs (linear in N).
      
      In the limit of infinite scaling, the fact that MLPs are universal function approximators is a guarantee that you can do anything with them. But obviously we still would rather have something that can actually work with less-than-infinite amounts of compute.
  - Daniel Tan 30 Dec 2024 8:45 UTC
    6 points
    0
    Parent
    I didn’t say there’s no inductive bias at work in models. Merely that trying to impose your own inductive biases on models is probably doomed for reasons best stated in the Bitter Lesson. The effectiveness of pretraining / scaling suggests that inductive biases work best when they are arrived at organically by very expressive models training on very large amounts of data.
    
    Our models are very general, but when e.g. we use a diffusion model for images that exploits (and is biased towards) the kind of local structure we expect of images, when we use a transformer for text that exploits (and is biased towards) the kind of sequential pair-correlation you see in natural language, etc.
    Sure, but it’s unclear whether these inductive biases are necessary. ~~Concrete example:~~ ~~randomly initialised CNN weights can extract sufficiently good features to linearly classify MNIST. Another example:~~ ~~randomly initialised transformers with only embeddings learnable~~ ~~can do modular addition.~~
    Edit: On reflection, the two examples above actually support the idea that the inductive bias of the architecture is immensely helpful to solving tasks (to the extent that the weights themselves don’t really matter). A better example of my point is that MLP-mixer models can be as good as CNNs on vision tasks despite having much smaller architectural inductive biases towards vision.
    - eggsyntax 1 Jan 2025 20:24 UTC
      4 points
      0
      Parent
      Sure, but it’s unclear whether these inductive biases are necessary.
      My understanding is that they’re necessary even in principle, since there are an unbounded number of functions that fit any finite set of points. Even AIXI has a strong inductive bias, toward programs with the lowest Kolmogorov complexity.
      - Daniel Tan 1 Jan 2025 21:54 UTC
        5 points
        0
        Parent
        Yeah, I agree that some inductive bias is probably necessary. But not all inductive biases are equal; some are much more general than others, and in particular I claim that ‘narrow’ inductive biases (e.g. specializing architectures to match domains) probably have net ~zero benefit compared to those learned from data
    - dr_s 30 Dec 2024 12:56 UTC
      4 points
      0
      Parent
      Interesting. But CNNs were developed originally for a reason to begin with, and MLP-mixer does mention a rather specific architecture as well as “modern regularization techniques”. I’d say all of that counts as baking in some inductive biases in the model though I agree it’s a very light touch.
- quetzal_rainbow 30 Dec 2024 10:00 UTC
  2 points
  0
  Parent
  If this turns out to be correct, as opposed to something more ‘principled’ or ‘organic’ like natural abstractions or AIXI
  AIXI is just search over Turing machines, just computationally unbounded. I am pretty sure that Gwern kept in mind AIXI when he spoke about search over Turing machines.
  I think that you can’t say anything “more principled” than AIXI if you don’t account for reality you are already in. Our reality is not generated by random Turing machine, it is generated by very specific program that creates 3+1 space-time with certain space-time symmetries and certain particle fields etc, and “intelligence” here is “how good you are at approximating algorithm for optimal problem-solving in this environment”.
  ‘sprint to the graph’
  Can you elaborate what is it?
  - Daniel Tan 30 Dec 2024 10:35 UTC
    3 points
    0
    Parent
    AIXI is just search over Turing machines, just computationally unbounded. I am pretty sure that Gwern kept in mind AIXI when he spoke about search over Turing machines.
    
    Upon reflection, agree with this
    I think that you can’t say anything “more principled” than AIXI if you don’t account for reality you are already in.
    Also agree with this, and I think this reinforces the point I was attempting to make in that comment. I.e. ‘search over turing machines’ is so general as to yield very little insight, and any further assumptions risk being invalid
    ‘sprint to the graph’
    Refers to the mentality of ‘research sprints should aim to reach ASAP the first somewhat-shareable research result’, usually a single graph
- rotatingpaguro 30 Dec 2024 9:51 UTC
  1 point
  0
  Parent
  Referring to the section “What is Intelligence Even, Anyway?”:
  I think AIXI is fairly described as a search over the space of Turing machines. Why do you think otherwise? Or maybe are you making a distinction at a more granular level?
  - Daniel Tan 30 Dec 2024 10:32 UTC
    1 point
    0
    Parent
    Upon consideration I think you are right, and I should edit the post to reflect that. But I think the claim still holds (if you expect intelligence looks like AIXI then it seems quite unlikely you should expect to be able to understand it without further priors)

Daniel Tan comments on Why I’m Moving from Mechanistic to Prosaic Interpretability

Outtakes.

The mech interp bubble.

Does Mech Interp Yield Transferable Skills?

What is Intelligence Even, Anyway?

What is Intelligence Even, Anyway?

What is Intelligence Even, Anyway?