This is great work, even though you weren’t able to understand the memorization mechanistically.
I agree that a big part of the reason to be pessimistic about ambitious mechanistic interp is that even very large neural networks are performing some amount of pure memorization. For example, existing LMs often can regurgitate some canary strings, which really seems like a case without any (to use your phrase) macrofeatures. Consequently, as you talk about in both posts 3 and 4, it’s not clear that there even should be a good mechanistic understanding for how neural networks implement factual recall. In the pathological worst case, with the hash and lookup algorithm, there are actually no nontrivial, interpretable features to recover.
One hope here is no “interesting” cognition depends in large part on uninterpretable memorization[1]; maybe a understanding circuits is sufficient for understanding. It might be possible that for any dangerous capability a model could implement, we can’t really understand how the facts it’s using are mechanistically built up or even necessarily what all the facts are, but we can at least recognize the circuits building on factual representations, and do something with this level of understanding.
I also agree that SAEs are probably not a silver bullet to this problem. (Or at least, it’s not clear.) In the case of common names like “Michael Jordan”, it seems likely that a sufficiently wide SAE would recover that feature (it’s a macrofeature, to use the terminology from post 4). But I’m not sure how an SAE would work in one-off cases without internal structure like predicting canary strings?
Absent substantial conceptual breakthroughs, my guess is the majority of my hope for ambitious mechanistic interp lies in use cases that don’t require understanding factual recall. Given the negative results in this work despite significant effort, my best guess for how I’d study this problem would be to look at more toy models of memorization, perhaps building on Anthropic’s prior work on this subject. If it’s cheap and I had more familiarity with the SOTA on SAEs, I’d probably just throw some SAEs at the problem, to confirm that the obvious uses of SAEs wouldn’t help.
Also, some comments on specific quotes from this post:
First, I’m curious why you think this is true:
intuitively, the number of facts known by GPT-4 vs GPT-3.5 scales superlinearly in the number of neurons, let alone the residual stream dimension.
Why specifically do you think this is intuitively true? (I think this is plausible, but don’t think it’s necessarily obvious.)
Second, a nitpick: you say in this post about post 4:
In post 4, we also studied a toy model mapping pairs of integers to arbitrary labels where we knew all the data and could generate as much as we liked, and didn’t find the toy model any easier to interpret, in terms of finding internal sparsity or meaningful intermediate states.
However, I’m not seeing any mention of trained models in post 4 -- is it primarily intended as a thought experiment to clarify the problem of memorization, or was part of the post missing?
(EDIT Jan 5 2024: in private correspondence with the authors, they’ve clarified that they have indeed done several toy experiments finding those results, but did not include them in post 4 because the results were uniformly negative.)
Hope that top-level reasoning stays dominant on the default AI development path
Currently, it seems like most AI systems’ consequentialist reasoning is explainable in terms of top-level algorithms. For example, AlphaGo’s performance is mostly explained by MCTS and the way it’s trained through self-play. The subsystem reasoning is subsumed by the top-level reasoning and does not overwhelm it.
This is great work, even though you weren’t able to understand the memorization mechanistically.
I agree that a big part of the reason to be pessimistic about ambitious mechanistic interp is that even very large neural networks are performing some amount of pure memorization. For example, existing LMs often can regurgitate some canary strings, which really seems like a case without any (to use your phrase) macrofeatures. Consequently, as you talk about in both posts 3 and 4, it’s not clear that there even should be a good mechanistic understanding for how neural networks implement factual recall. In the pathological worst case, with the hash and lookup algorithm, there are actually no nontrivial, interpretable features to recover.
One hope here is no “interesting” cognition depends in large part on uninterpretable memorization[1]; maybe a understanding circuits is sufficient for understanding. It might be possible that for any dangerous capability a model could implement, we can’t really understand how the facts it’s using are mechanistically built up or even necessarily what all the facts are, but we can at least recognize the circuits building on factual representations, and do something with this level of understanding.
I also agree that SAEs are probably not a silver bullet to this problem. (Or at least, it’s not clear.) In the case of common names like “Michael Jordan”, it seems likely that a sufficiently wide SAE would recover that feature (it’s a macrofeature, to use the terminology from post 4). But I’m not sure how an SAE would work in one-off cases without internal structure like predicting canary strings?
Absent substantial conceptual breakthroughs, my guess is the majority of my hope for ambitious mechanistic interp lies in use cases that don’t require understanding factual recall. Given the negative results in this work despite significant effort, my best guess for how I’d study this problem would be to look at more toy models of memorization, perhaps building on Anthropic’s prior work on this subject. If it’s cheap and I had more familiarity with the SOTA on SAEs, I’d probably just throw some SAEs at the problem, to confirm that the obvious uses of SAEs wouldn’t help.
Also, some comments on specific quotes from this post:
First, I’m curious why you think this is true:
Why specifically do you think this is intuitively true? (I think this is plausible, but don’t think it’s necessarily obvious.)
Second, a nitpick: you say in this post about post 4:
However, I’m not seeing any mention of trained models in post 4 -- is it primarily intended as a thought experiment to clarify the problem of memorization, or was part of the post missing?
(EDIT Jan 5 2024: in private correspondence with the authors, they’ve clarified that they have indeed done several toy experiments finding those results, but did not include them in post 4 because the results were uniformly negative.)
This reminds me about an old MIRI post distinguishing between interpretable top-level reasoning and uninterpretable subsystem reasoning, and while they imagined MCTS and SGD as examples of top-level reasoning (as opposed to the interpretable algorithms inside a neural network), this hope is similar to one of their paths to aligned AI: