I’d say mechanistic interpretability is trending toward a field which cares & researches the problems you mention. For example, the doppelganger problem is a fairly standard criticism of the sparse autoencoder work, diasystemic novelty seems the kind of thing you’d encounter when doing developmental interpretability, interp-through-time, or inductive biases research, especially with a focus on phase changes (a growing focus area), and though I’m having a hard time parsing your creativity post (an indictment of me, not of you, as I didn’t spend too long with it), it seems the kind of thing which would come from the study of in-context-learning, a goal that mainstream MI I believe has, even if it doesn’t focus on now (likely because it believes its unable to at this moment), and which I think it will care more about as the power of such in-context learning becomes more and more apparent.
ETA: An argument could be that though these problems will come up, ultimately the field will prioritize hacky fixes in order to deal with them, which only sweep the problems under the rug. I think many in MI will prioritize such limited fixes, but also that some won’t, and due to the benefits of such problems becoming empirical, such people will be able to prove the value of their theoretical work & methodology by convincing MI people with their practical applications, and money will get diverted to such theoretical work & methodology by DL-theory-traumatized grantmakers.
the doppelganger problem is a fairly standard criticism of the sparse autoencoder work,
And what’s the response to the criticism, or a/the hoped approach?
diasystemic novelty seems the kind of thing you’d encounter when doing developmental interpretability, interp-through-time
Yeah, this makes sense. And hey, maybe it will lead to good stuff. Any results so far, that I might consider approaching some core alignment difficulties?
it seems the kind of thing which would come from the study of in-context-learning, a goal that mainstream MI I believe has, even if it doesn’t focus on now (likely because it believes its unable to at this moment), and which I think it will care more about as the power of such in-context learning becomes more and more apparent.
Also makes some sense (though the ex quo, insofar as we even want to attribute this to current systems is distributed across the training algorithms and the architecture sources, as well as inference-time stuff).
Generally what you’re bringing up sounds like “yes these are problems and MI would like to think about them… later”. Which is understandable, but yeah, that’s what streetlighting looks like.
Maybe an implicit justification of current work is like:
There’s these more important, more difficult problems. We want to deal with them, but they are too hard right now, so we will try in the future. Right now we’ll deal with simpler things. By dealing with simpler things, we’ll build up knowledge, skills, tools, and surrounding/supporting orientation (e.g. explaining weird phenomena that are actually due to already-understandable stuff, so that later we don’t get distracted). This will make it easier to deal with the hard stuff in the future.
This makes a lot of sense—it’s both empathizandable, and seems probably somewhat true. However:
Again, it still isn’t in fact currently addressing the hard parts. We want to keep straight the difference between [currently addressing] vs. [arguably might address in the future].
We gotta think about what sort of thing would possibly ever work. We gotta think about this now, as much as possible.
A core motivating intuition behind the MI program is (I think) “the stuff is all there, perfectly accessible programmatically, we just have to learn to read it”. This intuition is deeply flawed: Koan: divining alien datastructures from RAM activations
I don’t know of any clear progress on your interests yet. My argument was about the trajectory MI is on, which I think is largely pointed in the right direction. We can argue about the speed at which it gets to the hard problems, whether its fast enough, and how to make it faster though. So you seem to have understood me well.
A core motivating intuition behind the MI program is (I think) “the stuff is all there, perfectly accessible programmatically, we just have to learn to read it”. This intuition is deeply flawed: Koan: divining alien datastructures from RAM activations
I think I’m more agnostic than you are about this, and also about how “deeply” flawed MI’s intuitions are. If you’re right, once the field progresses to nontrivial dynamics, we should expect those operating at a higher level of analysis—conceptual MI—to discover more than those operating at a lower level, right?
If, hypothetically, we were doing MI on minds, then I would predict that MI will pick some low hanging fruit and then hit walls where their methods will stop working, and it will be more difficult to develop new methods that work. The new methods that work will look more and more like reflecting on one’s own thinking, discovering new ways of understanding one’s own thinking, and then going and looking for something like that in the in-vitro mind. IDK how far that could go. But then this will completely grind to a halt when the IVM is coming up with concepts and ways of thinking that are novel to humanity. Some other approach would be needed to learn new ideas from a mind via MI.
However, another dealbreaker problem with current and current-trajectory MI is that it isn’t studying minds.
I’d say mechanistic interpretability is trending toward a field which cares & researches the problems you mention. For example, the doppelganger problem is a fairly standard criticism of the sparse autoencoder work, diasystemic novelty seems the kind of thing you’d encounter when doing developmental interpretability, interp-through-time, or inductive biases research, especially with a focus on phase changes (a growing focus area), and though I’m having a hard time parsing your creativity post (an indictment of me, not of you, as I didn’t spend too long with it), it seems the kind of thing which would come from the study of in-context-learning, a goal that mainstream MI I believe has, even if it doesn’t focus on now (likely because it believes its unable to at this moment), and which I think it will care more about as the power of such in-context learning becomes more and more apparent.
ETA: An argument could be that though these problems will come up, ultimately the field will prioritize hacky fixes in order to deal with them, which only sweep the problems under the rug. I think many in MI will prioritize such limited fixes, but also that some won’t, and due to the benefits of such problems becoming empirical, such people will be able to prove the value of their theoretical work & methodology by convincing MI people with their practical applications, and money will get diverted to such theoretical work & methodology by DL-theory-traumatized grantmakers.
And what’s the response to the criticism, or a/the hoped approach?
Yeah, this makes sense. And hey, maybe it will lead to good stuff. Any results so far, that I might consider approaching some core alignment difficulties?
Also makes some sense (though the ex quo, insofar as we even want to attribute this to current systems is distributed across the training algorithms and the architecture sources, as well as inference-time stuff).
Generally what you’re bringing up sounds like “yes these are problems and MI would like to think about them… later”. Which is understandable, but yeah, that’s what streetlighting looks like.
Maybe an implicit justification of current work is like:
This makes a lot of sense—it’s both empathizandable, and seems probably somewhat true. However:
Again, it still isn’t in fact currently addressing the hard parts. We want to keep straight the difference between [currently addressing] vs. [arguably might address in the future].
We gotta think about what sort of thing would possibly ever work. We gotta think about this now, as much as possible.
A core motivating intuition behind the MI program is (I think) “the stuff is all there, perfectly accessible programmatically, we just have to learn to read it”. This intuition is deeply flawed: Koan: divining alien datastructures from RAM activations
I don’t know of any clear progress on your interests yet. My argument was about the trajectory MI is on, which I think is largely pointed in the right direction. We can argue about the speed at which it gets to the hard problems, whether its fast enough, and how to make it faster though. So you seem to have understood me well.
I think I’m more agnostic than you are about this, and also about how “deeply” flawed MI’s intuitions are. If you’re right, once the field progresses to nontrivial dynamics, we should expect those operating at a higher level of analysis—conceptual MI—to discover more than those operating at a lower level, right?
If, hypothetically, we were doing MI on minds, then I would predict that MI will pick some low hanging fruit and then hit walls where their methods will stop working, and it will be more difficult to develop new methods that work. The new methods that work will look more and more like reflecting on one’s own thinking, discovering new ways of understanding one’s own thinking, and then going and looking for something like that in the in-vitro mind. IDK how far that could go. But then this will completely grind to a halt when the IVM is coming up with concepts and ways of thinking that are novel to humanity. Some other approach would be needed to learn new ideas from a mind via MI.
However, another dealbreaker problem with current and current-trajectory MI is that it isn’t studying minds.