All young people and other newcomers should be made aware that on-paradigm AI safety/alignment—while being more tractable, feedbacked, well-resourced, and populated compared to theory—is also inevitably streetlighting https://en.wikipedia.org/wiki/Streetlight_effect.
Half-agree. I think there’s scope within field like interp to focus on things that are closer to the hard part of the problem or at least touch on robust bottlenecks for alignment agendas (eg: ontology identification). I do think there is a lot of diversity in people working in these more legible areas and that means there are now many people who haven’t engaged with or understood the alignment problem well enough to realise where we might be suffering from the street light effect.
Object level: ontology identification, in the sense that is studied empirically, is pretty useless. It streetlights on recognizable things, and AFAIK isn’t trying to avoid, for example, the Doppelgänger problem or to at all handle diasystemic novelty or the ex quo of a mind’s creativity. [ETA: actually ELK I think addresses the Doggelgänger problem in its problem statement, if not in any proposed solutions.]
Meta:
I think there’s scope within field like interp to focus on things that are closer to the hard part of the problem or at least touch on robust bottlenecks for alignment agendas (eg: ontology identification).
You hedged your statement so much that it became true and also not very relevant. Here are the hedges:
“scope”: some research could be interpreted as trying to get to some other research, or as having a mission statement that includes some other research
“within field[s]”: some people / some research—or maybe no actual people or reseach, but possible research that would fit with the genre of the field
“closer to”: but maybe not close to, in an absolute sense
“or at least touch on”: if an academic philosopher wrote this about their work, you’d immediately recognize it as cope
“alignment agendas”: there aren’t any alignment agendas. There are alignment agendas in the sense that “we can start a colony around Proxima Centauri in the following way: 1. make a go-really-fast-er. 2. use the go-really-fast-er to go really fast towards Proxima Centauri” is an agenda to get to Proxima Centauri. If you make no mention of the part where you have to also slow down, and the part about steering, and the part where you have to shield from cosmic rays, and make a self-sustaining habitat on the ship, and the part about are any of the planets around Proxima Centauri remotely habitable… is this really an agenda?
Object level: ontology identification, in the sense that is studied empirically, is pretty useless. It streetlights on recognizable things, and AFAIK isn’t trying to avoid, for example, the Doppelgänger problem
I haven’t seen anyone do such interpretability research yet but I see no particular reason to think this is the sort of thing that can’t be studied empirically rather than the sort of thing that hasn’t been studied empirically. We have, for example, vision transformers and language transformers. I would be very surprised if there was a pure 1:1 mapping between the learned features in those two types of transformer models.
Well, empirically, when people try to study it empirically, instead they do something else. Surely that’s empirical evidence that it can’t be studied empirically? (I’m a little bit trolling but also not.)
I’d say mechanistic interpretability is trending toward a field which cares & researches the problems you mention. For example, the doppelganger problem is a fairly standard criticism of the sparse autoencoder work, diasystemic novelty seems the kind of thing you’d encounter when doing developmental interpretability, interp-through-time, or inductive biases research, especially with a focus on phase changes (a growing focus area), and though I’m having a hard time parsing your creativity post (an indictment of me, not of you, as I didn’t spend too long with it), it seems the kind of thing which would come from the study of in-context-learning, a goal that mainstream MI I believe has, even if it doesn’t focus on now (likely because it believes its unable to at this moment), and which I think it will care more about as the power of such in-context learning becomes more and more apparent.
ETA: An argument could be that though these problems will come up, ultimately the field will prioritize hacky fixes in order to deal with them, which only sweep the problems under the rug. I think many in MI will prioritize such limited fixes, but also that some won’t, and due to the benefits of such problems becoming empirical, such people will be able to prove the value of their theoretical work & methodology by convincing MI people with their practical applications, and money will get diverted to such theoretical work & methodology by DL-theory-traumatized grantmakers.
the doppelganger problem is a fairly standard criticism of the sparse autoencoder work,
And what’s the response to the criticism, or a/the hoped approach?
diasystemic novelty seems the kind of thing you’d encounter when doing developmental interpretability, interp-through-time
Yeah, this makes sense. And hey, maybe it will lead to good stuff. Any results so far, that I might consider approaching some core alignment difficulties?
it seems the kind of thing which would come from the study of in-context-learning, a goal that mainstream MI I believe has, even if it doesn’t focus on now (likely because it believes its unable to at this moment), and which I think it will care more about as the power of such in-context learning becomes more and more apparent.
Also makes some sense (though the ex quo, insofar as we even want to attribute this to current systems is distributed across the training algorithms and the architecture sources, as well as inference-time stuff).
Generally what you’re bringing up sounds like “yes these are problems and MI would like to think about them… later”. Which is understandable, but yeah, that’s what streetlighting looks like.
Maybe an implicit justification of current work is like:
There’s these more important, more difficult problems. We want to deal with them, but they are too hard right now, so we will try in the future. Right now we’ll deal with simpler things. By dealing with simpler things, we’ll build up knowledge, skills, tools, and surrounding/supporting orientation (e.g. explaining weird phenomena that are actually due to already-understandable stuff, so that later we don’t get distracted). This will make it easier to deal with the hard stuff in the future.
This makes a lot of sense—it’s both empathizandable, and seems probably somewhat true. However:
Again, it still isn’t in fact currently addressing the hard parts. We want to keep straight the difference between [currently addressing] vs. [arguably might address in the future].
We gotta think about what sort of thing would possibly ever work. We gotta think about this now, as much as possible.
A core motivating intuition behind the MI program is (I think) “the stuff is all there, perfectly accessible programmatically, we just have to learn to read it”. This intuition is deeply flawed: Koan: divining alien datastructures from RAM activations
I don’t know of any clear progress on your interests yet. My argument was about the trajectory MI is on, which I think is largely pointed in the right direction. We can argue about the speed at which it gets to the hard problems, whether its fast enough, and how to make it faster though. So you seem to have understood me well.
A core motivating intuition behind the MI program is (I think) “the stuff is all there, perfectly accessible programmatically, we just have to learn to read it”. This intuition is deeply flawed: Koan: divining alien datastructures from RAM activations
I think I’m more agnostic than you are about this, and also about how “deeply” flawed MI’s intuitions are. If you’re right, once the field progresses to nontrivial dynamics, we should expect those operating at a higher level of analysis—conceptual MI—to discover more than those operating at a lower level, right?
If, hypothetically, we were doing MI on minds, then I would predict that MI will pick some low hanging fruit and then hit walls where their methods will stop working, and it will be more difficult to develop new methods that work. The new methods that work will look more and more like reflecting on one’s own thinking, discovering new ways of understanding one’s own thinking, and then going and looking for something like that in the in-vitro mind. IDK how far that could go. But then this will completely grind to a halt when the IVM is coming up with concepts and ways of thinking that are novel to humanity. Some other approach would be needed to learn new ideas from a mind via MI.
However, another dealbreaker problem with current and current-trajectory MI is that it isn’t studying minds.
I mean my impression is that there are something on the order of 100-1000 people in the world working on ML interpretability as their day job, and maybe 1k-10k people who dabble in their free time. No research in the field will get done unless one of that small number of people makes a specific decision to tackle that particular research question instead of one of the countless other ones they could choose to tackle.
Well, empirically, when people try to study it empirically, instead they do something else
I don’t know that we have any empirical data on what happens when people try to study that particular empirical question (the specific relationship between the features leaned by two models of different modalities) because I don’t know that anyone has set out to study that particular question in any serious way.
In other words, I suspect it’s not “when someone starts to study this phenomenon, some mysterious process causes them to study something else instead”. I think it’s “the surface area of the field is large and there aren’t many people in it, so I doubt anyone has even gotten to the part where they start to study this phenomenon.”
Edit: to be even more explicit, what I’m trying to do in this thread is encourage thinking about ways one might collect empirical observations about non-”streetlit” topics. None of the topics are under the streetlight until someone builds the streetlight. “Build a streetlight” is sometimes an available action, but it only happens if someone makes a specific effort to do so.
Edit 2: I misunderstood what point you were making as “prosaic alignment is unlikely to be helpful, look at all of these empirical researchers who have not even answered these basic questions” (which is a perspective I disagree with pretty strongly) rather than “I think empirical research shouldn’t be the only game in town” (which I agree with) and “we should fund outsiders to go do stuff without much interaction with or feedback from the community to hopefully develop new ideas that are not contaminated with the current community biases” (I think this would he worth doing f resources we’re unlimited, not sure as things actually stand).
As a concrete note, I suspect work that demonstrates that philosophical or mathematical approaches can yield predictions about empirical questions is more likely to be funded. For example, in your post you say
In programming, adding a function definition would be endosystemic; refactoring the code into a functional style rather than an object-oriented style, or vice versa, in a way that reveals underlying structure, is diasystemic novelty.
Could that be operationalized as a prediction of the form
If you train a model on a bunch of simple tasks involving both functional and object-oriented code (e.g. “predict the next token of the codebase”, “predict missing token”, “identify syntax errors”) and then train it on a complex task on only object-oriented code (e.g. “write a document describing how to use this library”), it will fail to navigate that ontological shift and will be unable to document functional code.
(I expect that’s not a correct operationalization but something of that shape)
I think there’s scope within field like interp to focus on things that are closer to the hard part of the problem or at least touch on robust bottlenecks for alignment agendas
BT:
Object level: ontology identification, in the sense that is studied empirically, is pretty useless.
sname:
I haven’t seen anyone do such interpretability research yet but I see no particular reason to think this is the sort of thing that can’t be studied empirically rather than the sort of thing that hasn’t been studied empirically.
BT:
Well, empirically, when people try to study it empirically, instead they do something else
sname:
I don’t know that we have any empirical data on what happens when people try to study that particular empirical question (the specific relationship between the features leaned by two models of different modalities) because I don’t know that anyone has set out to study that particular question in any serious way.
BT:
ah, sname is talking about conceptual Doppelgängers specifically, as ze indicated in a previous comment that I now understand
When I said “when people try to study it empirically”, what I meant was “when people try to do interpretability research (presumably, that is relevant to the hard part of the problem?)”.
“prosaic alignment is unlikely to be helpful, look at all of these empirical researchers who have not even answered these basic questions”
Right, I’m not saying exactly this. But I am saying:
Prosaic alignment is unlikely to be helpful, look at how they are starting in an extremely streetlighty way(*) and then, empirically, not pushing out into the dark quickly—and furthermore, AFAIK, not very concerned with how they aren’t pushing out into the dark quickly enough, or successfully addressing this at the meta level, though plausibly they’re doing that and I’m just not aware.
(*): studying LLMs, which are not minds; trying to recognize [stuff we mostly conceptually understand] within systems rather than trying to come to conceptually understand [the stuff we’d need to be able to recognize/design in a mind, in order to determine the mind’s effects].
(I think this would he worth doing f resources we’re unlimited, not sure as things actually stand).
Well, you’ve agreed with a defanged version of my statements. The toothful version, which I do think: Insofar as this is even possible, we should allocate a lot more resources toward funding any high-caliber smart/creative/interesting/promising/motivated youngsters/newcomers who want to take a crack at independently approaching the core difficulties of AGI alignment, even if that means reallocating a lot of resources away from existing on-paradigm research.
Edit: to be even more explicit, what I’m trying to do in this thread is encourage thinking about ways one might collect empirical observations about non-”streetlit” topics. None of the topics are under the streetlight until someone builds the streetlight. “Build a streetlight” is sometimes an available action, but it only happens if someone makes a specific effort to do so.
This seems like a good thing to do. But there’s multiple ways that existing research is streetlit, and reality doesn’t owe it to you to make it be the case that there are nice (tractionful, feasible, interesting, empirical, familiar, non-weird-seeming, feedbacked, grounded, legible, consensusful) paths toward the important stuff. The absence of nice paths would really suck if it’s the case, and it’s hard to see how anyone could be justifiedly really confident that there are no nice paths. But yes, I’m saying that it looks like there aren’t nice paths, or at least there aren’t enough nice paths that we seem likely to find them by continuing to sample from the same distribution we’ve been sampling from; and I have some arguments and reasons supporting this belief, which seem true; and I would guess that a substantial fraction (though not most) of current alignment researchers would agree with a fairly strong version of “very few or no nice paths”.
Could that be operationalized as a prediction of the form
If you train a model on a bunch of simple tasks involving both functional and object-oriented code (e.g. “predict the next token of the codebase”, “predict missing token”, “identify syntax errors”) and then train it on a complex task on only object-oriented code (e.g. “write a document describing how to use this library”), it will fail to navigate that ontological shift and will be unable to document functional code.
I don’t think that’s a good operationalization, as you predict. I think it’s trying to be an operationalization related to my claim above:
ontology identification, in the sense that is studied empirically, is pretty useless. It [..] AFAIK isn’t trying to [...] at all handle diasystemic novelty [...].
But it sort of sounds like you’re trying to extract a prediction about capability generalization or something? Anyway, an interp-like study trying to handle diasystemic novelty might for example try to predict large scale explicitization events events before they happen—maybe in a way that’s robust to “drop out”. E.g. you have a mind that doesn’t explicitly understand Bayesian reasoning; but it is engaging in lots of activities that would naturally induce small-world probabilistic reasoning, e.g. gambling games or predicting-in-distribution simple physical systems; and then your interpreter’s job is to notice, maybe only given access to restricted parts (in time or space, say) of the mind’s internals, that Bayesian reasoning is (implicitly) on the rise in many places. (This is still easy mode if the interpreter gets to understand Bayesian reasoning explicitly beforehand.) I don’t necessarily recommend this sort of study, though; I favor theory.
When I said “when people try to study it empirically”, what I meant was “when people try to do interpretability research (presumably, that is relevant to the hard part of the problem?)”.
Is there a particular reason you expect there to be exactly one hard part of the problem, and for the part that ends up being hardest in the end to be the part that looks hardest to us now?
Prosaic alignment is unlikely to be helpful, look at how they are starting in an extremely streetlighty way(*) and then, empirically, not pushing out into the dark quickly—and furthermore, AFAIK, not very concerned with how they aren’t pushing out into the dark quickly enough, or successfully addressing this at the meta level, though plausibly they’re doing that and I’m just not aware.
If I were a prosaic alignment researcher, I probably would choose to prioritize which problems I worked on a bit differently than those currently in the field. However, I expect that the research that ends up being the most useful will not be that research which looked most promising before someone started doing it, but rather research that stemmed from someone trying something extremely simple and getting an unexpected result, and going “huh, that’s funny, I should investigate further”. I think that the process of looking at lots of things and trying to get feedback from reality as quickly as possible is promising, even if I don’t have a strong expectation that any one specific one of those things is promising to look at.
But there’s multiple ways that existing research is streetlit, and reality doesn’t owe it to you to make it be the case that there are nice (tractionful, feasible, interesting, empirical, familiar, non-weird-seeming, feedbacked, grounded, legible, consensusful) paths toward the important stuff
Certainly reality doesn’t owe us a path like that, but it would be pretty undignified if reality did in fact give us a path like that and we failed to find it because we didn’t even look.
Anyway, an interp-like study trying to handle diasystemic novelty might for example try to predict large scale explicitization events events before they happen—maybe in a way that’s robust to “drop out”. E.g. you have a mind that doesn’t explicitly understand Bayesian reasoning; but it is engaging in lots of activities that would naturally induce small-world probabilistic reasoning, e.g. gambling games or predicting-in-distribution simple physical systems; and then your interpreter’s job is to notice, maybe only given access to restricted parts (in time or space, say) of the mind’s internals, that Bayesian reasoning is (implicitly) on the rise in many places. (This is still easy mode if the interpreter gets to understand Bayesian reasoning explicitly beforehand.)
Interesting. I would be pretty interested to see research along these lines (although the scope of the above is probably still a bit large for a pilot project).
I don’t necessarily recommend this sort of study, though; I favor theory.
What is your preferred method for getting feedback from reality on whether your theory describes the world as it is?
Is there a particular reason you expect there to be exactly one hard part of the problem,
Have you stopped beating your wife? I say “the” here in the sense of like “the problem of climbing that mountain over there”. If you’re far away, it makes sense to talk about “the (thing over there)”, even if, when you’re up close, there’s multiple routes, multiple summits, multiple sorts of needed equipment, multiple sources of risk, etc.
and for the part that ends up being hardest in the end to be the part that looks hardest to us now?
We make an argument like “any solution would have to address X” or “anything with feature Y does not do Z” or “property W is impossible”, and then we can see what a given piece of research is and is not doing / how it is doomed to irrelevance. It’s not like pointing to a little ball in ideaspace and being like “the answer is somewhere in here”. Rather it’s like cutting out a halfspace and saying “everything on this side of this plane is doomed, we’d have to be somewhere in the other half”, or like pointing out a manifold that all research is on and saying “anything on this manifold is doomed, we’d have to figure out how to move somewhat orthogonalward”.
research that stemmed from someone trying something extremely simple and getting an unexpected result
I agree IF we are looking at the objects in question. If LLMs were minds, the research would be much more relevant. (I don’t care if you have an army of people who all agree on taking a stance that seems to imply that there’s not much relevant difference between LLMs and future AGI systems that might kill everyone.)
What is your preferred method for getting feedback from reality on whether your theory describes the world as it is?
I think you (and everyone else) don’t know how to ask this question properly. For example, “on whether your theory describes the world as it is” is a too-narrow idea of what our thoughts about minds are supposed to be. Sub-example: our thoughts about mind are supposed to also produce design ideas.
To answer your question: by looking at and thinking about minds. The only minds that currently exist are humans, and the best access you have to minds is introspection. (I don’t mean meditation, I mean thinking and also thinking about thinking/wanting/acting—aka some kinds of philosophy and math.)
Is there a particular reason you expect there to be exactly one hard part of the problem,
Have you stopped beating your wife? I say “the” here in the sense of like “the problem of climbing that mountain over there”. If you’re far away, it makes sense to talk about “the (thing over there)”, even if, when you’re up close, there’s multiple routes, multiple summits, multiple sorts of needed equipment, multiple sources of risk, etc.
I think the appropriate analogy is someone trying to strategize about “the hard part of climbing that mountain over there” before they have even reached base camp or seriously attempted to summit any other mountains. There are a bunch of parts that might end up being hard, and one can come up with some reasonable guesses as to what those parts might be, but the bits that look hard from a distance and the bits that end up being hard when you’re on the face of the mountain may be different parts.
We make an argument like “any solution would have to address X” or “anything with feature Y does not do Z” or “property W is impossible”, and then we can see what a given piece of research is and is not doing / how it is doomed to irrelevance
Doomed to irrelevance, or doomed to not being a complete solution in and of itself? The point of a lot of research is to look at a piece of the world and figure out how it ticks. Research to figure out how a piece of the world ticks won’t usually directly allow you to make it tock instead, but can be a useful stepping stone. Concrete example: dictionary learning vs Golden Gate Claude.
I agree IF we are looking at the objects in question. If LLMs were minds, the research would be much more relevant.
I think one significant crux is “to what extent are LLMs doing the same sort of thing that human brains do / the same sorts of things that future, more powerful AIs will do?” It sounds like you think the answer is “they’re completely different and you won’t learn much about one by studying the other”. Is that an accurate characterization?
To answer your question: by looking at and thinking about minds. The only minds that currently exist are humans
Agreed, though with quibbles
and the best access you have to minds is introspection.
In my experience, my brain is a dirty lying liar that lies to me at every opportunity—another crux might be how faithful one expects their memory of their thought processes to be to the actual reality of those thought processes.
Doomed to irrelevance, or doomed to not being a complete solution in and of itself?
Doomed to not be trying to go to and then climb the mountain.
my brain is a dirty lying liar that lies to me at every opportunity
So then it isn’t easy. But it’s feedback. Also there’s not that much distinction between making a philosophically rigorous argument and “doing introspection” in the sense I mean, so if you think the former is feasible, work from there.
Doomed to irrelevance, or doomed to not being a complete solution in and of itself?
Doomed to not be trying to go to and then climb the mountain.
If you think that current mech interp work is currently trying to directly climb the mountain, rather than trying to build and test a set of techniques that might be helpful on a summit attempt, I can see why you’d be frustrated and discouraged at the lack of progress.
> Also there’s not that much distinction between making a philosophically rigorous argument and “doing introspection” in the sense I mean, so if you think the former is feasible, work from there.
I don’t have much hope in the former being feasible, though I do support having a nonzero number of people try it because sometimes things I don’t think are feasible end up working.
Look… Consider the hypothetically possible situation that in fact everyone is very far from being on the right track, and everything everyone is doing doesn’t help with the right track and isn’t on track to get on the right track or to help with the right track.
Ok, so I’m telling you that this hypothetically possible situation seems to me like the reality. And then you’re, I don’t know, trying to retreat to some sort of agreeable live-and-let-live stance, or something, where we all just agree that due to model uncertainty and the fact that people have vaguely plausible stories for how their thing might possibly be helpful, everyone should do their own thing and it’s not helpful to try to say that some big swath of research is doomed? If this is what’s happening, then I think that what you in particular are doing here is a bad thing to do here.
Maybe we can have a phone call if you’d like to discuss further.
Maybe we can have a phone call if you’d like to discuss further.
I doubt it’s worth it—I’m not a major funder in this space and don’t expect to become one in the near future, and my impression is that there is no imminent danger of you shutting down research that looks promising to me and unpromising to you. As such, I think the discussion ended up getting into the weeds in a way that probably wasn’t a great use of either of our time, and I doubt spending more time on it would change that.
That said, I appreciated your clarity of thought, and in particular your restatement of how the conversation looked to you. I will probably be stealing that technique.
Half-agree. I think there’s scope within field like interp to focus on things that are closer to the hard part of the problem or at least touch on robust bottlenecks for alignment agendas (eg: ontology identification). I do think there is a lot of diversity in people working in these more legible areas and that means there are now many people who haven’t engaged with or understood the alignment problem well enough to realise where we might be suffering from the street light effect.
Object level: ontology identification, in the sense that is studied empirically, is pretty useless. It streetlights on recognizable things, and AFAIK isn’t trying to avoid, for example, the Doppelgänger problem or to at all handle diasystemic novelty or the ex quo of a mind’s creativity. [ETA: actually ELK I think addresses the Doggelgänger problem in its problem statement, if not in any proposed solutions.]
Meta:
You hedged your statement so much that it became true and also not very relevant. Here are the hedges:
“scope”: some research could be interpreted as trying to get to some other research, or as having a mission statement that includes some other research
“within field[s]”: some people / some research—or maybe no actual people or reseach, but possible research that would fit with the genre of the field
“closer to”: but maybe not close to, in an absolute sense
“or at least touch on”: if an academic philosopher wrote this about their work, you’d immediately recognize it as cope
“alignment agendas”: there aren’t any alignment agendas. There are alignment agendas in the sense that “we can start a colony around Proxima Centauri in the following way: 1. make a go-really-fast-er. 2. use the go-really-fast-er to go really fast towards Proxima Centauri” is an agenda to get to Proxima Centauri. If you make no mention of the part where you have to also slow down, and the part about steering, and the part where you have to shield from cosmic rays, and make a self-sustaining habitat on the ship, and the part about are any of the planets around Proxima Centauri remotely habitable… is this really an agenda?
I haven’t seen anyone do such interpretability research yet but I see no particular reason to think this is the sort of thing that can’t be studied empirically rather than the sort of thing that hasn’t been studied empirically. We have, for example, vision transformers and language transformers. I would be very surprised if there was a pure 1:1 mapping between the learned features in those two types of transformer models.
Well, empirically, when people try to study it empirically, instead they do something else. Surely that’s empirical evidence that it can’t be studied empirically? (I’m a little bit trolling but also not.)
I’d say mechanistic interpretability is trending toward a field which cares & researches the problems you mention. For example, the doppelganger problem is a fairly standard criticism of the sparse autoencoder work, diasystemic novelty seems the kind of thing you’d encounter when doing developmental interpretability, interp-through-time, or inductive biases research, especially with a focus on phase changes (a growing focus area), and though I’m having a hard time parsing your creativity post (an indictment of me, not of you, as I didn’t spend too long with it), it seems the kind of thing which would come from the study of in-context-learning, a goal that mainstream MI I believe has, even if it doesn’t focus on now (likely because it believes its unable to at this moment), and which I think it will care more about as the power of such in-context learning becomes more and more apparent.
ETA: An argument could be that though these problems will come up, ultimately the field will prioritize hacky fixes in order to deal with them, which only sweep the problems under the rug. I think many in MI will prioritize such limited fixes, but also that some won’t, and due to the benefits of such problems becoming empirical, such people will be able to prove the value of their theoretical work & methodology by convincing MI people with their practical applications, and money will get diverted to such theoretical work & methodology by DL-theory-traumatized grantmakers.
And what’s the response to the criticism, or a/the hoped approach?
Yeah, this makes sense. And hey, maybe it will lead to good stuff. Any results so far, that I might consider approaching some core alignment difficulties?
Also makes some sense (though the ex quo, insofar as we even want to attribute this to current systems is distributed across the training algorithms and the architecture sources, as well as inference-time stuff).
Generally what you’re bringing up sounds like “yes these are problems and MI would like to think about them… later”. Which is understandable, but yeah, that’s what streetlighting looks like.
Maybe an implicit justification of current work is like:
This makes a lot of sense—it’s both empathizandable, and seems probably somewhat true. However:
Again, it still isn’t in fact currently addressing the hard parts. We want to keep straight the difference between [currently addressing] vs. [arguably might address in the future].
We gotta think about what sort of thing would possibly ever work. We gotta think about this now, as much as possible.
A core motivating intuition behind the MI program is (I think) “the stuff is all there, perfectly accessible programmatically, we just have to learn to read it”. This intuition is deeply flawed: Koan: divining alien datastructures from RAM activations
I don’t know of any clear progress on your interests yet. My argument was about the trajectory MI is on, which I think is largely pointed in the right direction. We can argue about the speed at which it gets to the hard problems, whether its fast enough, and how to make it faster though. So you seem to have understood me well.
I think I’m more agnostic than you are about this, and also about how “deeply” flawed MI’s intuitions are. If you’re right, once the field progresses to nontrivial dynamics, we should expect those operating at a higher level of analysis—conceptual MI—to discover more than those operating at a lower level, right?
If, hypothetically, we were doing MI on minds, then I would predict that MI will pick some low hanging fruit and then hit walls where their methods will stop working, and it will be more difficult to develop new methods that work. The new methods that work will look more and more like reflecting on one’s own thinking, discovering new ways of understanding one’s own thinking, and then going and looking for something like that in the in-vitro mind. IDK how far that could go. But then this will completely grind to a halt when the IVM is coming up with concepts and ways of thinking that are novel to humanity. Some other approach would be needed to learn new ideas from a mind via MI.
However, another dealbreaker problem with current and current-trajectory MI is that it isn’t studying minds.
I mean my impression is that there are something on the order of 100-1000 people in the world working on ML interpretability as their day job, and maybe 1k-10k people who dabble in their free time. No research in the field will get done unless one of that small number of people makes a specific decision to tackle that particular research question instead of one of the countless other ones they could choose to tackle.
I don’t know what you’re trying to do in this thread (e.g. what question you’re trying to answer).
To be explicit, that was a response to
I don’t know that we have any empirical data on what happens when people try to study that particular empirical question (the specific relationship between the features leaned by two models of different modalities) because I don’t know that anyone has set out to study that particular question in any serious way.
In other words, I suspect it’s not “when someone starts to study this phenomenon, some mysterious process causes them to study something else instead”. I think it’s “the surface area of the field is large and there aren’t many people in it, so I doubt anyone has even gotten to the part where they start to study this phenomenon.”
Edit: to be even more explicit, what I’m trying to do in this thread is encourage thinking about ways one might collect empirical observations about non-”streetlit” topics. None of the topics are under the streetlight until someone builds the streetlight. “Build a streetlight” is sometimes an available action, but it only happens if someone makes a specific effort to do so.
Edit 2: I misunderstood what point you were making as “prosaic alignment is unlikely to be helpful, look at all of these empirical researchers who have not even answered these basic questions” (which is a perspective I disagree with pretty strongly) rather than “I think empirical research shouldn’t be the only game in town” (which I agree with) and “we should fund outsiders to go do stuff without much interaction with or feedback from the community to hopefully develop new ideas that are not contaminated with the current community biases” (I think this would he worth doing f resources we’re unlimited, not sure as things actually stand).
As a concrete note, I suspect work that demonstrates that philosophical or mathematical approaches can yield predictions about empirical questions is more likely to be funded. For example, in your post you say
Could that be operationalized as a prediction of the form
(I expect that’s not a correct operationalization but something of that shape)
Here’s the convo according to me:
Bloom:
BT:
sname:
BT:
sname:
BT:
When I said “when people try to study it empirically”, what I meant was “when people try to do interpretability research (presumably, that is relevant to the hard part of the problem?)”.
Right, I’m not saying exactly this. But I am saying:
(*): studying LLMs, which are not minds; trying to recognize [stuff we mostly conceptually understand] within systems rather than trying to come to conceptually understand [the stuff we’d need to be able to recognize/design in a mind, in order to determine the mind’s effects].
Well, you’ve agreed with a defanged version of my statements. The toothful version, which I do think: Insofar as this is even possible, we should allocate a lot more resources toward funding any high-caliber smart/creative/interesting/promising/motivated youngsters/newcomers who want to take a crack at independently approaching the core difficulties of AGI alignment, even if that means reallocating a lot of resources away from existing on-paradigm research.
This seems like a good thing to do. But there’s multiple ways that existing research is streetlit, and reality doesn’t owe it to you to make it be the case that there are nice (tractionful, feasible, interesting, empirical, familiar, non-weird-seeming, feedbacked, grounded, legible, consensusful) paths toward the important stuff. The absence of nice paths would really suck if it’s the case, and it’s hard to see how anyone could be justifiedly really confident that there are no nice paths. But yes, I’m saying that it looks like there aren’t nice paths, or at least there aren’t enough nice paths that we seem likely to find them by continuing to sample from the same distribution we’ve been sampling from; and I have some arguments and reasons supporting this belief, which seem true; and I would guess that a substantial fraction (though not most) of current alignment researchers would agree with a fairly strong version of “very few or no nice paths”.
I don’t think that’s a good operationalization, as you predict. I think it’s trying to be an operationalization related to my claim above:
But it sort of sounds like you’re trying to extract a prediction about capability generalization or something? Anyway, an interp-like study trying to handle diasystemic novelty might for example try to predict large scale explicitization events events before they happen—maybe in a way that’s robust to “drop out”. E.g. you have a mind that doesn’t explicitly understand Bayesian reasoning; but it is engaging in lots of activities that would naturally induce small-world probabilistic reasoning, e.g. gambling games or predicting-in-distribution simple physical systems; and then your interpreter’s job is to notice, maybe only given access to restricted parts (in time or space, say) of the mind’s internals, that Bayesian reasoning is (implicitly) on the rise in many places. (This is still easy mode if the interpreter gets to understand Bayesian reasoning explicitly beforehand.) I don’t necessarily recommend this sort of study, though; I favor theory.
Seems about right
Is there a particular reason you expect there to be exactly one hard part of the problem, and for the part that ends up being hardest in the end to be the part that looks hardest to us now?
If I were a prosaic alignment researcher, I probably would choose to prioritize which problems I worked on a bit differently than those currently in the field. However, I expect that the research that ends up being the most useful will not be that research which looked most promising before someone started doing it, but rather research that stemmed from someone trying something extremely simple and getting an unexpected result, and going “huh, that’s funny, I should investigate further”. I think that the process of looking at lots of things and trying to get feedback from reality as quickly as possible is promising, even if I don’t have a strong expectation that any one specific one of those things is promising to look at.
Certainly reality doesn’t owe us a path like that, but it would be pretty undignified if reality did in fact give us a path like that and we failed to find it because we didn’t even look.
Interesting. I would be pretty interested to see research along these lines (although the scope of the above is probably still a bit large for a pilot project).
What is your preferred method for getting feedback from reality on whether your theory describes the world as it is?
Have you stopped beating your wife? I say “the” here in the sense of like “the problem of climbing that mountain over there”. If you’re far away, it makes sense to talk about “the (thing over there)”, even if, when you’re up close, there’s multiple routes, multiple summits, multiple sorts of needed equipment, multiple sources of risk, etc.
We make an argument like “any solution would have to address X” or “anything with feature Y does not do Z” or “property W is impossible”, and then we can see what a given piece of research is and is not doing / how it is doomed to irrelevance. It’s not like pointing to a little ball in ideaspace and being like “the answer is somewhere in here”. Rather it’s like cutting out a halfspace and saying “everything on this side of this plane is doomed, we’d have to be somewhere in the other half”, or like pointing out a manifold that all research is on and saying “anything on this manifold is doomed, we’d have to figure out how to move somewhat orthogonalward”.
I agree IF we are looking at the objects in question. If LLMs were minds, the research would be much more relevant. (I don’t care if you have an army of people who all agree on taking a stance that seems to imply that there’s not much relevant difference between LLMs and future AGI systems that might kill everyone.)
I think you (and everyone else) don’t know how to ask this question properly. For example, “on whether your theory describes the world as it is” is a too-narrow idea of what our thoughts about minds are supposed to be. Sub-example: our thoughts about mind are supposed to also produce design ideas.
To answer your question: by looking at and thinking about minds. The only minds that currently exist are humans, and the best access you have to minds is introspection. (I don’t mean meditation, I mean thinking and also thinking about thinking/wanting/acting—aka some kinds of philosophy and math.)
I think the appropriate analogy is someone trying to strategize about “the hard part of climbing that mountain over there” before they have even reached base camp or seriously attempted to summit any other mountains. There are a bunch of parts that might end up being hard, and one can come up with some reasonable guesses as to what those parts might be, but the bits that look hard from a distance and the bits that end up being hard when you’re on the face of the mountain may be different parts.
Doomed to irrelevance, or doomed to not being a complete solution in and of itself? The point of a lot of research is to look at a piece of the world and figure out how it ticks. Research to figure out how a piece of the world ticks won’t usually directly allow you to make it tock instead, but can be a useful stepping stone. Concrete example: dictionary learning vs Golden Gate Claude.
I think one significant crux is “to what extent are LLMs doing the same sort of thing that human brains do / the same sorts of things that future, more powerful AIs will do?” It sounds like you think the answer is “they’re completely different and you won’t learn much about one by studying the other”. Is that an accurate characterization?
Agreed, though with quibbles
In my experience, my brain is a dirty lying liar that lies to me at every opportunity—another crux might be how faithful one expects their memory of their thought processes to be to the actual reality of those thought processes.
Doomed to not be trying to go to and then climb the mountain.
So then it isn’t easy. But it’s feedback. Also there’s not that much distinction between making a philosophically rigorous argument and “doing introspection” in the sense I mean, so if you think the former is feasible, work from there.
If you think that current mech interp work is currently trying to directly climb the mountain, rather than trying to build and test a set of techniques that might be helpful on a summit attempt, I can see why you’d be frustrated and discouraged at the lack of progress.
> Also there’s not that much distinction between making a philosophically rigorous argument and “doing introspection” in the sense I mean, so if you think the former is feasible, work from there.
I don’t have much hope in the former being feasible, though I do support having a nonzero number of people try it because sometimes things I don’t think are feasible end up working.
I mean if we’re going with memes I could equally say
though realistically I think the most common problem in this kind of discussion is
Look… Consider the hypothetically possible situation that in fact everyone is very far from being on the right track, and everything everyone is doing doesn’t help with the right track and isn’t on track to get on the right track or to help with the right track.
Ok, so I’m telling you that this hypothetically possible situation seems to me like the reality. And then you’re, I don’t know, trying to retreat to some sort of agreeable live-and-let-live stance, or something, where we all just agree that due to model uncertainty and the fact that people have vaguely plausible stories for how their thing might possibly be helpful, everyone should do their own thing and it’s not helpful to try to say that some big swath of research is doomed? If this is what’s happening, then I think that what you in particular are doing here is a bad thing to do here.
Maybe we can have a phone call if you’d like to discuss further.
I doubt it’s worth it—I’m not a major funder in this space and don’t expect to become one in the near future, and my impression is that there is no imminent danger of you shutting down research that looks promising to me and unpromising to you. As such, I think the discussion ended up getting into the weeds in a way that probably wasn’t a great use of either of our time, and I doubt spending more time on it would change that.
That said, I appreciated your clarity of thought, and in particular your restatement of how the conversation looked to you. I will probably be stealing that technique.