I mean my impression is that there are something on the order of 100-1000 people in the world working on ML interpretability as their day job, and maybe 1k-10k people who dabble in their free time. No research in the field will get done unless one of that small number of people makes a specific decision to tackle that particular research question instead of one of the countless other ones they could choose to tackle.
Well, empirically, when people try to study it empirically, instead they do something else
I don’t know that we have any empirical data on what happens when people try to study that particular empirical question (the specific relationship between the features leaned by two models of different modalities) because I don’t know that anyone has set out to study that particular question in any serious way.
In other words, I suspect it’s not “when someone starts to study this phenomenon, some mysterious process causes them to study something else instead”. I think it’s “the surface area of the field is large and there aren’t many people in it, so I doubt anyone has even gotten to the part where they start to study this phenomenon.”
Edit: to be even more explicit, what I’m trying to do in this thread is encourage thinking about ways one might collect empirical observations about non-”streetlit” topics. None of the topics are under the streetlight until someone builds the streetlight. “Build a streetlight” is sometimes an available action, but it only happens if someone makes a specific effort to do so.
Edit 2: I misunderstood what point you were making as “prosaic alignment is unlikely to be helpful, look at all of these empirical researchers who have not even answered these basic questions” (which is a perspective I disagree with pretty strongly) rather than “I think empirical research shouldn’t be the only game in town” (which I agree with) and “we should fund outsiders to go do stuff without much interaction with or feedback from the community to hopefully develop new ideas that are not contaminated with the current community biases” (I think this would he worth doing f resources we’re unlimited, not sure as things actually stand).
As a concrete note, I suspect work that demonstrates that philosophical or mathematical approaches can yield predictions about empirical questions is more likely to be funded. For example, in your post you say
In programming, adding a function definition would be endosystemic; refactoring the code into a functional style rather than an object-oriented style, or vice versa, in a way that reveals underlying structure, is diasystemic novelty.
Could that be operationalized as a prediction of the form
If you train a model on a bunch of simple tasks involving both functional and object-oriented code (e.g. “predict the next token of the codebase”, “predict missing token”, “identify syntax errors”) and then train it on a complex task on only object-oriented code (e.g. “write a document describing how to use this library”), it will fail to navigate that ontological shift and will be unable to document functional code.
(I expect that’s not a correct operationalization but something of that shape)
I think there’s scope within field like interp to focus on things that are closer to the hard part of the problem or at least touch on robust bottlenecks for alignment agendas
BT:
Object level: ontology identification, in the sense that is studied empirically, is pretty useless.
sname:
I haven’t seen anyone do such interpretability research yet but I see no particular reason to think this is the sort of thing that can’t be studied empirically rather than the sort of thing that hasn’t been studied empirically.
BT:
Well, empirically, when people try to study it empirically, instead they do something else
sname:
I don’t know that we have any empirical data on what happens when people try to study that particular empirical question (the specific relationship between the features leaned by two models of different modalities) because I don’t know that anyone has set out to study that particular question in any serious way.
BT:
ah, sname is talking about conceptual Doppelgängers specifically, as ze indicated in a previous comment that I now understand
When I said “when people try to study it empirically”, what I meant was “when people try to do interpretability research (presumably, that is relevant to the hard part of the problem?)”.
“prosaic alignment is unlikely to be helpful, look at all of these empirical researchers who have not even answered these basic questions”
Right, I’m not saying exactly this. But I am saying:
Prosaic alignment is unlikely to be helpful, look at how they are starting in an extremely streetlighty way(*) and then, empirically, not pushing out into the dark quickly—and furthermore, AFAIK, not very concerned with how they aren’t pushing out into the dark quickly enough, or successfully addressing this at the meta level, though plausibly they’re doing that and I’m just not aware.
(*): studying LLMs, which are not minds; trying to recognize [stuff we mostly conceptually understand] within systems rather than trying to come to conceptually understand [the stuff we’d need to be able to recognize/design in a mind, in order to determine the mind’s effects].
(I think this would he worth doing f resources we’re unlimited, not sure as things actually stand).
Well, you’ve agreed with a defanged version of my statements. The toothful version, which I do think: Insofar as this is even possible, we should allocate a lot more resources toward funding any high-caliber smart/creative/interesting/promising/motivated youngsters/newcomers who want to take a crack at independently approaching the core difficulties of AGI alignment, even if that means reallocating a lot of resources away from existing on-paradigm research.
Edit: to be even more explicit, what I’m trying to do in this thread is encourage thinking about ways one might collect empirical observations about non-”streetlit” topics. None of the topics are under the streetlight until someone builds the streetlight. “Build a streetlight” is sometimes an available action, but it only happens if someone makes a specific effort to do so.
This seems like a good thing to do. But there’s multiple ways that existing research is streetlit, and reality doesn’t owe it to you to make it be the case that there are nice (tractionful, feasible, interesting, empirical, familiar, non-weird-seeming, feedbacked, grounded, legible, consensusful) paths toward the important stuff. The absence of nice paths would really suck if it’s the case, and it’s hard to see how anyone could be justifiedly really confident that there are no nice paths. But yes, I’m saying that it looks like there aren’t nice paths, or at least there aren’t enough nice paths that we seem likely to find them by continuing to sample from the same distribution we’ve been sampling from; and I have some arguments and reasons supporting this belief, which seem true; and I would guess that a substantial fraction (though not most) of current alignment researchers would agree with a fairly strong version of “very few or no nice paths”.
Could that be operationalized as a prediction of the form
If you train a model on a bunch of simple tasks involving both functional and object-oriented code (e.g. “predict the next token of the codebase”, “predict missing token”, “identify syntax errors”) and then train it on a complex task on only object-oriented code (e.g. “write a document describing how to use this library”), it will fail to navigate that ontological shift and will be unable to document functional code.
I don’t think that’s a good operationalization, as you predict. I think it’s trying to be an operationalization related to my claim above:
ontology identification, in the sense that is studied empirically, is pretty useless. It [..] AFAIK isn’t trying to [...] at all handle diasystemic novelty [...].
But it sort of sounds like you’re trying to extract a prediction about capability generalization or something? Anyway, an interp-like study trying to handle diasystemic novelty might for example try to predict large scale explicitization events events before they happen—maybe in a way that’s robust to “drop out”. E.g. you have a mind that doesn’t explicitly understand Bayesian reasoning; but it is engaging in lots of activities that would naturally induce small-world probabilistic reasoning, e.g. gambling games or predicting-in-distribution simple physical systems; and then your interpreter’s job is to notice, maybe only given access to restricted parts (in time or space, say) of the mind’s internals, that Bayesian reasoning is (implicitly) on the rise in many places. (This is still easy mode if the interpreter gets to understand Bayesian reasoning explicitly beforehand.) I don’t necessarily recommend this sort of study, though; I favor theory.
When I said “when people try to study it empirically”, what I meant was “when people try to do interpretability research (presumably, that is relevant to the hard part of the problem?)”.
Is there a particular reason you expect there to be exactly one hard part of the problem, and for the part that ends up being hardest in the end to be the part that looks hardest to us now?
Prosaic alignment is unlikely to be helpful, look at how they are starting in an extremely streetlighty way(*) and then, empirically, not pushing out into the dark quickly—and furthermore, AFAIK, not very concerned with how they aren’t pushing out into the dark quickly enough, or successfully addressing this at the meta level, though plausibly they’re doing that and I’m just not aware.
If I were a prosaic alignment researcher, I probably would choose to prioritize which problems I worked on a bit differently than those currently in the field. However, I expect that the research that ends up being the most useful will not be that research which looked most promising before someone started doing it, but rather research that stemmed from someone trying something extremely simple and getting an unexpected result, and going “huh, that’s funny, I should investigate further”. I think that the process of looking at lots of things and trying to get feedback from reality as quickly as possible is promising, even if I don’t have a strong expectation that any one specific one of those things is promising to look at.
But there’s multiple ways that existing research is streetlit, and reality doesn’t owe it to you to make it be the case that there are nice (tractionful, feasible, interesting, empirical, familiar, non-weird-seeming, feedbacked, grounded, legible, consensusful) paths toward the important stuff
Certainly reality doesn’t owe us a path like that, but it would be pretty undignified if reality did in fact give us a path like that and we failed to find it because we didn’t even look.
Anyway, an interp-like study trying to handle diasystemic novelty might for example try to predict large scale explicitization events events before they happen—maybe in a way that’s robust to “drop out”. E.g. you have a mind that doesn’t explicitly understand Bayesian reasoning; but it is engaging in lots of activities that would naturally induce small-world probabilistic reasoning, e.g. gambling games or predicting-in-distribution simple physical systems; and then your interpreter’s job is to notice, maybe only given access to restricted parts (in time or space, say) of the mind’s internals, that Bayesian reasoning is (implicitly) on the rise in many places. (This is still easy mode if the interpreter gets to understand Bayesian reasoning explicitly beforehand.)
Interesting. I would be pretty interested to see research along these lines (although the scope of the above is probably still a bit large for a pilot project).
I don’t necessarily recommend this sort of study, though; I favor theory.
What is your preferred method for getting feedback from reality on whether your theory describes the world as it is?
Is there a particular reason you expect there to be exactly one hard part of the problem,
Have you stopped beating your wife? I say “the” here in the sense of like “the problem of climbing that mountain over there”. If you’re far away, it makes sense to talk about “the (thing over there)”, even if, when you’re up close, there’s multiple routes, multiple summits, multiple sorts of needed equipment, multiple sources of risk, etc.
and for the part that ends up being hardest in the end to be the part that looks hardest to us now?
We make an argument like “any solution would have to address X” or “anything with feature Y does not do Z” or “property W is impossible”, and then we can see what a given piece of research is and is not doing / how it is doomed to irrelevance. It’s not like pointing to a little ball in ideaspace and being like “the answer is somewhere in here”. Rather it’s like cutting out a halfspace and saying “everything on this side of this plane is doomed, we’d have to be somewhere in the other half”, or like pointing out a manifold that all research is on and saying “anything on this manifold is doomed, we’d have to figure out how to move somewhat orthogonalward”.
research that stemmed from someone trying something extremely simple and getting an unexpected result
I agree IF we are looking at the objects in question. If LLMs were minds, the research would be much more relevant. (I don’t care if you have an army of people who all agree on taking a stance that seems to imply that there’s not much relevant difference between LLMs and future AGI systems that might kill everyone.)
What is your preferred method for getting feedback from reality on whether your theory describes the world as it is?
I think you (and everyone else) don’t know how to ask this question properly. For example, “on whether your theory describes the world as it is” is a too-narrow idea of what our thoughts about minds are supposed to be. Sub-example: our thoughts about mind are supposed to also produce design ideas.
To answer your question: by looking at and thinking about minds. The only minds that currently exist are humans, and the best access you have to minds is introspection. (I don’t mean meditation, I mean thinking and also thinking about thinking/wanting/acting—aka some kinds of philosophy and math.)
Is there a particular reason you expect there to be exactly one hard part of the problem,
Have you stopped beating your wife? I say “the” here in the sense of like “the problem of climbing that mountain over there”. If you’re far away, it makes sense to talk about “the (thing over there)”, even if, when you’re up close, there’s multiple routes, multiple summits, multiple sorts of needed equipment, multiple sources of risk, etc.
I think the appropriate analogy is someone trying to strategize about “the hard part of climbing that mountain over there” before they have even reached base camp or seriously attempted to summit any other mountains. There are a bunch of parts that might end up being hard, and one can come up with some reasonable guesses as to what those parts might be, but the bits that look hard from a distance and the bits that end up being hard when you’re on the face of the mountain may be different parts.
We make an argument like “any solution would have to address X” or “anything with feature Y does not do Z” or “property W is impossible”, and then we can see what a given piece of research is and is not doing / how it is doomed to irrelevance
Doomed to irrelevance, or doomed to not being a complete solution in and of itself? The point of a lot of research is to look at a piece of the world and figure out how it ticks. Research to figure out how a piece of the world ticks won’t usually directly allow you to make it tock instead, but can be a useful stepping stone. Concrete example: dictionary learning vs Golden Gate Claude.
I agree IF we are looking at the objects in question. If LLMs were minds, the research would be much more relevant.
I think one significant crux is “to what extent are LLMs doing the same sort of thing that human brains do / the same sorts of things that future, more powerful AIs will do?” It sounds like you think the answer is “they’re completely different and you won’t learn much about one by studying the other”. Is that an accurate characterization?
To answer your question: by looking at and thinking about minds. The only minds that currently exist are humans
Agreed, though with quibbles
and the best access you have to minds is introspection.
In my experience, my brain is a dirty lying liar that lies to me at every opportunity—another crux might be how faithful one expects their memory of their thought processes to be to the actual reality of those thought processes.
Doomed to irrelevance, or doomed to not being a complete solution in and of itself?
Doomed to not be trying to go to and then climb the mountain.
my brain is a dirty lying liar that lies to me at every opportunity
So then it isn’t easy. But it’s feedback. Also there’s not that much distinction between making a philosophically rigorous argument and “doing introspection” in the sense I mean, so if you think the former is feasible, work from there.
Doomed to irrelevance, or doomed to not being a complete solution in and of itself?
Doomed to not be trying to go to and then climb the mountain.
If you think that current mech interp work is currently trying to directly climb the mountain, rather than trying to build and test a set of techniques that might be helpful on a summit attempt, I can see why you’d be frustrated and discouraged at the lack of progress.
> Also there’s not that much distinction between making a philosophically rigorous argument and “doing introspection” in the sense I mean, so if you think the former is feasible, work from there.
I don’t have much hope in the former being feasible, though I do support having a nonzero number of people try it because sometimes things I don’t think are feasible end up working.
Look… Consider the hypothetically possible situation that in fact everyone is very far from being on the right track, and everything everyone is doing doesn’t help with the right track and isn’t on track to get on the right track or to help with the right track.
Ok, so I’m telling you that this hypothetically possible situation seems to me like the reality. And then you’re, I don’t know, trying to retreat to some sort of agreeable live-and-let-live stance, or something, where we all just agree that due to model uncertainty and the fact that people have vaguely plausible stories for how their thing might possibly be helpful, everyone should do their own thing and it’s not helpful to try to say that some big swath of research is doomed? If this is what’s happening, then I think that what you in particular are doing here is a bad thing to do here.
Maybe we can have a phone call if you’d like to discuss further.
Maybe we can have a phone call if you’d like to discuss further.
I doubt it’s worth it—I’m not a major funder in this space and don’t expect to become one in the near future, and my impression is that there is no imminent danger of you shutting down research that looks promising to me and unpromising to you. As such, I think the discussion ended up getting into the weeds in a way that probably wasn’t a great use of either of our time, and I doubt spending more time on it would change that.
That said, I appreciated your clarity of thought, and in particular your restatement of how the conversation looked to you. I will probably be stealing that technique.
I mean my impression is that there are something on the order of 100-1000 people in the world working on ML interpretability as their day job, and maybe 1k-10k people who dabble in their free time. No research in the field will get done unless one of that small number of people makes a specific decision to tackle that particular research question instead of one of the countless other ones they could choose to tackle.
I don’t know what you’re trying to do in this thread (e.g. what question you’re trying to answer).
To be explicit, that was a response to
I don’t know that we have any empirical data on what happens when people try to study that particular empirical question (the specific relationship between the features leaned by two models of different modalities) because I don’t know that anyone has set out to study that particular question in any serious way.
In other words, I suspect it’s not “when someone starts to study this phenomenon, some mysterious process causes them to study something else instead”. I think it’s “the surface area of the field is large and there aren’t many people in it, so I doubt anyone has even gotten to the part where they start to study this phenomenon.”
Edit: to be even more explicit, what I’m trying to do in this thread is encourage thinking about ways one might collect empirical observations about non-”streetlit” topics. None of the topics are under the streetlight until someone builds the streetlight. “Build a streetlight” is sometimes an available action, but it only happens if someone makes a specific effort to do so.
Edit 2: I misunderstood what point you were making as “prosaic alignment is unlikely to be helpful, look at all of these empirical researchers who have not even answered these basic questions” (which is a perspective I disagree with pretty strongly) rather than “I think empirical research shouldn’t be the only game in town” (which I agree with) and “we should fund outsiders to go do stuff without much interaction with or feedback from the community to hopefully develop new ideas that are not contaminated with the current community biases” (I think this would he worth doing f resources we’re unlimited, not sure as things actually stand).
As a concrete note, I suspect work that demonstrates that philosophical or mathematical approaches can yield predictions about empirical questions is more likely to be funded. For example, in your post you say
Could that be operationalized as a prediction of the form
(I expect that’s not a correct operationalization but something of that shape)
Here’s the convo according to me:
Bloom:
BT:
sname:
BT:
sname:
BT:
When I said “when people try to study it empirically”, what I meant was “when people try to do interpretability research (presumably, that is relevant to the hard part of the problem?)”.
Right, I’m not saying exactly this. But I am saying:
(*): studying LLMs, which are not minds; trying to recognize [stuff we mostly conceptually understand] within systems rather than trying to come to conceptually understand [the stuff we’d need to be able to recognize/design in a mind, in order to determine the mind’s effects].
Well, you’ve agreed with a defanged version of my statements. The toothful version, which I do think: Insofar as this is even possible, we should allocate a lot more resources toward funding any high-caliber smart/creative/interesting/promising/motivated youngsters/newcomers who want to take a crack at independently approaching the core difficulties of AGI alignment, even if that means reallocating a lot of resources away from existing on-paradigm research.
This seems like a good thing to do. But there’s multiple ways that existing research is streetlit, and reality doesn’t owe it to you to make it be the case that there are nice (tractionful, feasible, interesting, empirical, familiar, non-weird-seeming, feedbacked, grounded, legible, consensusful) paths toward the important stuff. The absence of nice paths would really suck if it’s the case, and it’s hard to see how anyone could be justifiedly really confident that there are no nice paths. But yes, I’m saying that it looks like there aren’t nice paths, or at least there aren’t enough nice paths that we seem likely to find them by continuing to sample from the same distribution we’ve been sampling from; and I have some arguments and reasons supporting this belief, which seem true; and I would guess that a substantial fraction (though not most) of current alignment researchers would agree with a fairly strong version of “very few or no nice paths”.
I don’t think that’s a good operationalization, as you predict. I think it’s trying to be an operationalization related to my claim above:
But it sort of sounds like you’re trying to extract a prediction about capability generalization or something? Anyway, an interp-like study trying to handle diasystemic novelty might for example try to predict large scale explicitization events events before they happen—maybe in a way that’s robust to “drop out”. E.g. you have a mind that doesn’t explicitly understand Bayesian reasoning; but it is engaging in lots of activities that would naturally induce small-world probabilistic reasoning, e.g. gambling games or predicting-in-distribution simple physical systems; and then your interpreter’s job is to notice, maybe only given access to restricted parts (in time or space, say) of the mind’s internals, that Bayesian reasoning is (implicitly) on the rise in many places. (This is still easy mode if the interpreter gets to understand Bayesian reasoning explicitly beforehand.) I don’t necessarily recommend this sort of study, though; I favor theory.
Seems about right
Is there a particular reason you expect there to be exactly one hard part of the problem, and for the part that ends up being hardest in the end to be the part that looks hardest to us now?
If I were a prosaic alignment researcher, I probably would choose to prioritize which problems I worked on a bit differently than those currently in the field. However, I expect that the research that ends up being the most useful will not be that research which looked most promising before someone started doing it, but rather research that stemmed from someone trying something extremely simple and getting an unexpected result, and going “huh, that’s funny, I should investigate further”. I think that the process of looking at lots of things and trying to get feedback from reality as quickly as possible is promising, even if I don’t have a strong expectation that any one specific one of those things is promising to look at.
Certainly reality doesn’t owe us a path like that, but it would be pretty undignified if reality did in fact give us a path like that and we failed to find it because we didn’t even look.
Interesting. I would be pretty interested to see research along these lines (although the scope of the above is probably still a bit large for a pilot project).
What is your preferred method for getting feedback from reality on whether your theory describes the world as it is?
Have you stopped beating your wife? I say “the” here in the sense of like “the problem of climbing that mountain over there”. If you’re far away, it makes sense to talk about “the (thing over there)”, even if, when you’re up close, there’s multiple routes, multiple summits, multiple sorts of needed equipment, multiple sources of risk, etc.
We make an argument like “any solution would have to address X” or “anything with feature Y does not do Z” or “property W is impossible”, and then we can see what a given piece of research is and is not doing / how it is doomed to irrelevance. It’s not like pointing to a little ball in ideaspace and being like “the answer is somewhere in here”. Rather it’s like cutting out a halfspace and saying “everything on this side of this plane is doomed, we’d have to be somewhere in the other half”, or like pointing out a manifold that all research is on and saying “anything on this manifold is doomed, we’d have to figure out how to move somewhat orthogonalward”.
I agree IF we are looking at the objects in question. If LLMs were minds, the research would be much more relevant. (I don’t care if you have an army of people who all agree on taking a stance that seems to imply that there’s not much relevant difference between LLMs and future AGI systems that might kill everyone.)
I think you (and everyone else) don’t know how to ask this question properly. For example, “on whether your theory describes the world as it is” is a too-narrow idea of what our thoughts about minds are supposed to be. Sub-example: our thoughts about mind are supposed to also produce design ideas.
To answer your question: by looking at and thinking about minds. The only minds that currently exist are humans, and the best access you have to minds is introspection. (I don’t mean meditation, I mean thinking and also thinking about thinking/wanting/acting—aka some kinds of philosophy and math.)
I think the appropriate analogy is someone trying to strategize about “the hard part of climbing that mountain over there” before they have even reached base camp or seriously attempted to summit any other mountains. There are a bunch of parts that might end up being hard, and one can come up with some reasonable guesses as to what those parts might be, but the bits that look hard from a distance and the bits that end up being hard when you’re on the face of the mountain may be different parts.
Doomed to irrelevance, or doomed to not being a complete solution in and of itself? The point of a lot of research is to look at a piece of the world and figure out how it ticks. Research to figure out how a piece of the world ticks won’t usually directly allow you to make it tock instead, but can be a useful stepping stone. Concrete example: dictionary learning vs Golden Gate Claude.
I think one significant crux is “to what extent are LLMs doing the same sort of thing that human brains do / the same sorts of things that future, more powerful AIs will do?” It sounds like you think the answer is “they’re completely different and you won’t learn much about one by studying the other”. Is that an accurate characterization?
Agreed, though with quibbles
In my experience, my brain is a dirty lying liar that lies to me at every opportunity—another crux might be how faithful one expects their memory of their thought processes to be to the actual reality of those thought processes.
Doomed to not be trying to go to and then climb the mountain.
So then it isn’t easy. But it’s feedback. Also there’s not that much distinction between making a philosophically rigorous argument and “doing introspection” in the sense I mean, so if you think the former is feasible, work from there.
If you think that current mech interp work is currently trying to directly climb the mountain, rather than trying to build and test a set of techniques that might be helpful on a summit attempt, I can see why you’d be frustrated and discouraged at the lack of progress.
> Also there’s not that much distinction between making a philosophically rigorous argument and “doing introspection” in the sense I mean, so if you think the former is feasible, work from there.
I don’t have much hope in the former being feasible, though I do support having a nonzero number of people try it because sometimes things I don’t think are feasible end up working.
I mean if we’re going with memes I could equally say
though realistically I think the most common problem in this kind of discussion is
Look… Consider the hypothetically possible situation that in fact everyone is very far from being on the right track, and everything everyone is doing doesn’t help with the right track and isn’t on track to get on the right track or to help with the right track.
Ok, so I’m telling you that this hypothetically possible situation seems to me like the reality. And then you’re, I don’t know, trying to retreat to some sort of agreeable live-and-let-live stance, or something, where we all just agree that due to model uncertainty and the fact that people have vaguely plausible stories for how their thing might possibly be helpful, everyone should do their own thing and it’s not helpful to try to say that some big swath of research is doomed? If this is what’s happening, then I think that what you in particular are doing here is a bad thing to do here.
Maybe we can have a phone call if you’d like to discuss further.
I doubt it’s worth it—I’m not a major funder in this space and don’t expect to become one in the near future, and my impression is that there is no imminent danger of you shutting down research that looks promising to me and unpromising to you. As such, I think the discussion ended up getting into the weeds in a way that probably wasn’t a great use of either of our time, and I doubt spending more time on it would change that.
That said, I appreciated your clarity of thought, and in particular your restatement of how the conversation looked to you. I will probably be stealing that technique.